Skip to content

Direct strategy must still run the Step 7 pre-completion gate#793

Open
Evangelink wants to merge 2 commits into
dotnet:mainfrom
Evangelink:evangelink/direct-strategy-keep-gate
Open

Direct strategy must still run the Step 7 pre-completion gate#793
Evangelink wants to merge 2 commits into
dotnet:mainfrom
Evangelink:evangelink/direct-strategy-keep-gate

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Motivation

Follow-up to #789. While analyzing MSBench sweatlas-tw-unit runs (all claude-opus-4.6), I found that across every agent/skill variant — including full-plugin configs launched explicitly as code-testing-generator — the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + scenario-coverage self-review) fired 0 times. The agent classified each task as Direct (single file, "trivially small") and wrote one test file directly, then finished.

These tasks are exactly the shape "add 1 unit test for each of these scenarios" — small in file count but rubric-graded on precise behavior coverage and assertion strength. Skipping the gate directly produced the dominant failures: weak assertions that survive mutation, and missing required edge/negative cases.

The current wording is the cause: the Direct strategy says "skip sub-agents" and "write tests immediately", and the gate's own threshold exempts "trivially small" tasks — so a single-file task that enumerates behaviors slips through with no quality backstop, even though Steps 6-9 are nominally mandatory.

Change

Clarify, in both the generator's Step 2 strategy table and the code-testing-agent SKILL.md, that:

  • Direct trades away only the sub-agent pipeline (Steps 3-5), never the Step 7 pre-completion gate.
  • A request that names a specific symbol or enumerates behaviors/scenarios is not "trivially small" — treat the list as the spec (target the exact symbol, cover every scenario) and run the gate before reporting completion.

Docs-only, +4/-4 across two files. markdownlint-cli2 passes with 0 errors. Complements #789 (which strengthened the gate's scenario-coverage check); this PR ensures the gate actually runs on the Direct path where these tasks land.

The Direct strategy correctly skips the research/plan/implement sub-agents
for small single-file tasks, but the wording let agents also skip the
Step 7 pre-completion gate (test-gap-analysis + assertion-quality +
scenario coverage) — treating a single-file task that enumerates specific
behaviors as 'trivially small'.

This is the dominant failure mode observed on behavior-enumerating tasks:
the agent writes one test file directly and finishes with no
assertion-strength or scenario-coverage check, producing weak assertions
(mutation survivors) and missing required edge/negative cases.

Clarify in both the generator Step 2 strategy table and the
code-testing-agent SKILL.md that Direct trades away only the sub-agents,
never the gate, and that a request naming a specific symbol or enumerating
scenarios is not 'trivially small' and must run the gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 10:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the dotnet-test code-testing generator documentation to ensure the Direct strategy does not bypass the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + prompt scenario coverage self-review), addressing observed quality regressions on “small” but rubric-sensitive tasks.

Changes:

  • Clarifies in code-testing-generator strategy guidance that Direct skips only sub-agents (Steps 3–5), not the Step 7 gate.
  • Updates the code-testing-agent SKILL strategy table to explicitly note the pre-completion gate still runs on Direct.
Show a summary per file
File Description
plugins/dotnet-test/skills/code-testing-agent/SKILL.md Clarifies Direct strategy description to still run the pre-completion gate before finishing.
plugins/dotnet-test/agents/code-testing-generator.agent.md Strengthens Step 2 strategy guidance to keep Step 7 gate mandatory on Direct and clarifies “not trivially small” cases.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment thread plugins/dotnet-test/skills/code-testing-agent/SKILL.md Outdated
Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated
…ify gate in SKILL.md

- Step 2 Direct cell no longer introduces a separate 'names a specific
  symbol' gate trigger that contradicted Step 7. It now defers to Step 7's
  own threshold (>=5 tests, or any enumerated behaviors/scenarios).
- SKILL.md now names what/where the gate is: the generator's Step 7
  (test-gap-analysis + assertion-quality).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@Evangelink Evangelink enabled auto-merge (squash) June 19, 2026 13:18
@github-actions github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for c6c3a1d. cc @dotnet/dotnet-testing — please review.

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, edit, read_agent 🟡 0.23 [1]
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 4.3/5 → 4.7/5 🟢 ⚠️ NOT ACTIVATED 🟡 0.23 [2]
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 4.3/5 → 5.0/5 🟢 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, edit 🟡 0.23 [3]
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.23 [4]

[1] ⚠️ High run-to-run variance (CV=136%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -17.6% due to: judgment, quality, tokens (1044621 → 1261718)
[2] ⚠️ High run-to-run variance (CV=235%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=59%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=144%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -4.5% due to: quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 793 in dotnet/skills, download eval artifacts with gh run download 27828109266 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c6c3a1dc1e88260d7e32150bf761a428bb14f74e/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants