Direct strategy must still run the Step 7 pre-completion gate#793
Direct strategy must still run the Step 7 pre-completion gate#793Evangelink wants to merge 2 commits into
Conversation
The Direct strategy correctly skips the research/plan/implement sub-agents for small single-file tasks, but the wording let agents also skip the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + scenario coverage) — treating a single-file task that enumerates specific behaviors as 'trivially small'. This is the dominant failure mode observed on behavior-enumerating tasks: the agent writes one test file directly and finishes with no assertion-strength or scenario-coverage check, producing weak assertions (mutation survivors) and missing required edge/negative cases. Clarify in both the generator Step 2 strategy table and the code-testing-agent SKILL.md that Direct trades away only the sub-agents, never the gate, and that a request naming a specific symbol or enumerating scenarios is not 'trivially small' and must run the gate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the dotnet-test code-testing generator documentation to ensure the Direct strategy does not bypass the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + prompt scenario coverage self-review), addressing observed quality regressions on “small” but rubric-sensitive tasks.
Changes:
- Clarifies in
code-testing-generatorstrategy guidance that Direct skips only sub-agents (Steps 3–5), not the Step 7 gate. - Updates the
code-testing-agentSKILL strategy table to explicitly note the pre-completion gate still runs on Direct.
Show a summary per file
| File | Description |
|---|---|
| plugins/dotnet-test/skills/code-testing-agent/SKILL.md | Clarifies Direct strategy description to still run the pre-completion gate before finishing. |
| plugins/dotnet-test/agents/code-testing-generator.agent.md | Strengthens Step 2 strategy guidance to keep Step 7 gate mandatory on Direct and clarifies “not trivially small” cases. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 2
…ify gate in SKILL.md - Step 2 Direct cell no longer introduces a separate 'names a specific symbol' gate trigger that contradicted Step 7. It now defers to Step 7's own threshold (>=5 tests, or any enumerated behaviors/scenarios). - SKILL.md now names what/where the gate is: the generator's Step 7 (test-gap-analysis + assertion-quality). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the |
|
/evaluate |
|
✅ Evaluation passed for |
Skill Validation Results
[1] Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
Motivation
Follow-up to #789. While analyzing MSBench
sweatlas-tw-unitruns (all claude-opus-4.6), I found that across every agent/skill variant — including full-plugin configs launched explicitly ascode-testing-generator— the Step 7 pre-completion gate (test-gap-analysis+assertion-quality+ scenario-coverage self-review) fired 0 times. The agent classified each task as Direct (single file, "trivially small") and wrote one test file directly, then finished.These tasks are exactly the shape
"add 1 unit test for each of these scenarios"— small in file count but rubric-graded on precise behavior coverage and assertion strength. Skipping the gate directly produced the dominant failures: weak assertions that survive mutation, and missing required edge/negative cases.The current wording is the cause: the Direct strategy says "skip sub-agents" and "write tests immediately", and the gate's own threshold exempts "trivially small" tasks — so a single-file task that enumerates behaviors slips through with no quality backstop, even though Steps 6-9 are nominally mandatory.
Change
Clarify, in both the generator's Step 2 strategy table and the
code-testing-agentSKILL.md, that:Docs-only, +4/-4 across two files.
markdownlint-cli2passes with 0 errors. Complements #789 (which strengthened the gate's scenario-coverage check); this PR ensures the gate actually runs on the Direct path where these tasks land.