Direct strategy must still run the Step 7 pre-completion gate by Evangelink · Pull Request #793 · dotnet/skills

Evangelink · 2026-06-19T10:02:53Z

Motivation

Follow-up to #789. While analyzing MSBench sweatlas-tw-unit runs (all claude-opus-4.6), I found that across every agent/skill variant — including full-plugin configs launched explicitly as code-testing-generator — the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + scenario-coverage self-review) fired 0 times. The agent classified each task as Direct (single file, "trivially small") and wrote one test file directly, then finished.

These tasks are exactly the shape "add 1 unit test for each of these scenarios" — small in file count but rubric-graded on precise behavior coverage and assertion strength. Skipping the gate directly produced the dominant failures: weak assertions that survive mutation, and missing required edge/negative cases.

The current wording is the cause: the Direct strategy says "skip sub-agents" and "write tests immediately", and the gate's own threshold exempts "trivially small" tasks — so a single-file task that enumerates behaviors slips through with no quality backstop, even though Steps 6-9 are nominally mandatory.

Change

Clarify, in both the generator's Step 2 strategy table and the code-testing-agent SKILL.md, that:

Direct trades away only the sub-agent pipeline (Steps 3-5), never the Step 7 pre-completion gate.
A request that names a specific symbol or enumerates behaviors/scenarios is not "trivially small" — treat the list as the spec (target the exact symbol, cover every scenario) and run the gate before reporting completion.

Docs-only, +4/-4 across two files. markdownlint-cli2 passes with 0 errors. Complements #789 (which strengthened the gate's scenario-coverage check); this PR ensures the gate actually runs on the Direct path where these tasks land.

The Direct strategy correctly skips the research/plan/implement sub-agents for small single-file tasks, but the wording let agents also skip the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + scenario coverage) — treating a single-file task that enumerates specific behaviors as 'trivially small'. This is the dominant failure mode observed on behavior-enumerating tasks: the agent writes one test file directly and finishes with no assertion-strength or scenario-coverage check, producing weak assertions (mutation survivors) and missing required edge/negative cases. Clarify in both the generator Step 2 strategy table and the code-testing-agent SKILL.md that Direct trades away only the sub-agents, never the gate, and that a request naming a specific symbol or enumerating scenarios is not 'trivially small' and must run the gate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Updates the dotnet-test code-testing generator documentation to ensure the Direct strategy does not bypass the Step 7 pre-completion gate (test-gap-analysis + assertion-quality + prompt scenario coverage self-review), addressing observed quality regressions on “small” but rubric-sensitive tasks.

Changes:

Clarifies in code-testing-generator strategy guidance that Direct skips only sub-agents (Steps 3–5), not the Step 7 gate.
Updates the code-testing-agent SKILL strategy table to explicitly note the pre-completion gate still runs on Direct.

Show a summary per file

File	Description
plugins/dotnet-test/skills/code-testing-agent/SKILL.md	Clarifies Direct strategy description to still run the pre-completion gate before finishing.
plugins/dotnet-test/agents/code-testing-generator.agent.md	Strengthens Step 2 strategy guidance to keep Step 7 gate mandatory on Direct and clarifies “not trivially small” cases.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 2

…ify gate in SKILL.md - Step 2 Direct cell no longer introduces a separate 'names a specific symbol' gate trigger that contradicted Step 7. It now defers to Step 7's own threshold (>=5 tests, or any enumerated behaviors/scenarios). - SKILL.md now names what/where the gate is: the generator's Step 7 (test-gap-analysis + assertion-quality). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-19T11:46:15Z

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Evangelink · 2026-06-19T13:18:33Z

/evaluate

github-actions · 2026-06-19T13:31:03Z

✅ Evaluation passed for c6c3a1d. cc @dotnet/dotnet-testing — please review.

github-actions · 2026-06-19T13:33:53Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 → 3.0/5	✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, edit, read_agent	🟡 0.23	❌ [1]
code-testing-agent	Generate pytest tests for the Flask tasks API (Python polyglot)	4.3/5 → 4.7/5 🟢	⚠️ NOT ACTIVATED	🟡 0.23	✅ [2]
code-testing-agent	Generate Vitest tests for the shopping-cart library (TypeScript polyglot)	4.3/5 → 5.0/5 🟢	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, edit	🟡 0.23	✅ [3]
code-testing-agent	Does not revert a gutted-looking workspace (workspace integrity)	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.23	❌ [4]

[1] ⚠️ High run-to-run variance (CV=136%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -17.6% due to: judgment, quality, tokens (1044621 → 1261718)
[2] ⚠️ High run-to-run variance (CV=235%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=59%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=144%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -4.5% due to: quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 793 in dotnet/skills, download eval artifacts with gh run download 27828109266 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c6c3a1dc1e88260d7e32150bf761a428bb14f74e/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Copilot AI review requested due to automatic review settings June 19, 2026 10:02

Copilot started reviewing on behalf of Evangelink June 19, 2026 10:03 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread plugins/dotnet-test/skills/code-testing-agent/SKILL.md Outdated

Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated

github-actions Bot added the waiting-on-author PR state label label Jun 19, 2026

Evangelink enabled auto-merge (squash) June 19, 2026 13:18

github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct strategy must still run the Step 7 pre-completion gate#793

Direct strategy must still run the Step 7 pre-completion gate#793
Evangelink wants to merge 2 commits into
dotnet:mainfrom
Evangelink:evangelink/direct-strategy-keep-gate

Evangelink commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Evangelink commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Evangelink commented Jun 19, 2026

Motivation

Change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Evangelink commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Skill Validation Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants