Fix test-anti-patterns skill activation for 5 evals by Evangelink · Pull Request #786 · dotnet/skills

Evangelink · 2026-06-17T15:43:02Z

Problem

Five test-anti-patterns eval scenarios were reported as skill_not_activated: an overlapping sibling skill (or no skill) loaded instead of test-anti-patterns. Separately, the assertion-quality eval had a validation error (prompts naming the target skill).

Root cause

In plugin eval runs a scenario only counts as activated if the target skill fires. Several prompts matched a sibling's advertised capability:

Scenario	Sibling stealing activation
coverage-touching	`coverage-analysis`
self-referential assertions	`assertion-quality`
duplicated tests / magic values	`test-smell-detection`
well-written tests	`assertion-quality` / `test-smell-detection`
Python pytest	`assertion-quality` / `test-gap-analysis`

Changes

Skill descriptions (disambiguation + activation):

test-anti-patterns — rewrote the description to own the umbrella audit/review my tests for anti-patterns request and front-load concrete trigger keywords (self-referential/round-trip, coverage-touching, flakiness: Thread.Sleep/DateTime.Now/time.sleep/reflection coupling, Python/pytest). Added DO NOT USE FOR redirects to the three metric-focused siblings. Kept within the 1024-char cap.
assertion-quality — DO NOT USE FOR redirect: a general severity-ranked anti-pattern audit (even self-referential focused) → test-anti-patterns unless an assertion-diversity metrics report is requested.
coverage-analysis — DO NOT USE FOR redirect: auditing test code for the "coverage-touching" anti-pattern → test-anti-patterns; clarified it needs/produces Cobertura/CRAP metrics.

Eval fixes:

assertion-quality eval.yaml/eval.vally.yaml — replaced the hyphenated target name assertion-quality with spaced "assertion quality" in two prompts (fixes the "prompt mentions target name" validation error), and used Jest matcher call-form (toBeTruthy() etc.) per review feedback.
test-anti-patterns eval.yaml/eval.vally.yaml — made the mixed-severity and flakiness prompts realistic by removing the inline "answer-key" catalog enumeration (which made the agent self-sufficient and skip the skill). Rubric and output_matches assertions unchanged.

Results

Disambiguation works: no sibling steals activation anymore (detectedSkills=[] on misses).
Skill shows positive value: after the realistic-prompt change, the isolated improvement score is positive for mixed (+0.084), duplicated (+0.184), well-written (+0.074) and polyglot (+0.075) — the skill demonstrably helps when it's the candidate.

Why some scenario verdicts remain ❌ (pre-existing, structural — not caused by this PR)

These scenarios were already ❌ before this PR. They hit pattern #8 from InvestigatingResults.md ("baseline already good"): a strong model audits test anti-patterns at ~5.0/5 without any skill, so there's no quality headroom. The scenario score is min(isolated, plugin), and the plugin run is structurally ≤ 0 — loading the full plugin keeps ~27 sibling skill descriptions in context, adding token overhead even when the skill doesn't activate. So min() stays ≤ 0 regardless of activation.

This is not fixable by activation/description tuning. The only honest remedy is to give the skill genuine headroom (harder fixtures where the un-skilled baseline misses findings) — a separate, judgment-heavy change the repo's own docs flag as bordering on eval-gaming, intentionally left out of this PR. Compare coverage-analysis (baseline 2–3/5 → 5/5), which passes precisely because it has real headroom.

Scope

This PR fixes the activation/disambiguation problem and the validation error. The residual pattern-#8 verdicts are pre-existing and tracked separately; they predate and are unaffected by these changes.

Sibling skills with overlapping descriptions were stealing activation from test-anti-patterns in plugin eval runs (coverage-analysis, assertion-quality, test-smell-detection). Reword descriptions so test-anti-patterns owns the umbrella 'audit my tests for anti-patterns' severity-ranked report, and add DO NOT USE redirects in the metric-focused siblings. Kept all descriptions within the 1024-char cap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-17T15:43:49Z

Skill Coverage Report

	Plugin	Skill	Covered	Coverage
✅	`dotnet-test`	`assertion-quality`	20/22	90.9%
✅	`dotnet-test`	`test-anti-patterns`	19/21	90.5%

Uncovered: dotnet-test/assertion-quality

[Validation] Metrics are computed correctly (counts add up) (line 155)
[Validation] If the suite has good diversity, the report acknowledges this (line 160)

Uncovered: dotnet-test/test-anti-patterns

[Validation] Every finding includes a specific location (not just a general warning) (line 154)
[Validation] Recommendations are prioritized by severity (line 158)

Copilot

Pull request overview

Updates dotnet-test skill frontmatter descriptions to prevent sibling skills (e.g., coverage-analysis, assertion-quality) from “stealing” activation during eval runs, which was causing skill_not_activated failures for test-anti-patterns.

Changes:

Makes test-anti-patterns explicitly the umbrella “severity-ranked anti-pattern audit” skill and clarifies when not to use it (redirecting to metric-focused siblings).
Adds DO NOT USE redirects in coverage-analysis for the “coverage-touching” anti-pattern scenario (redirect to test-anti-patterns).
Tightens assertion-quality scope to metrics-focused assertion diversity, and redirects general anti-pattern audits to test-anti-patterns.

Show a summary per file

File	Description
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md	Repositions the skill as the default anti-pattern audit umbrella and adds redirects to sibling metric skills.
plugins/dotnet-test/skills/coverage-analysis/SKILL.md	Redirects “coverage-touching” test-quality audits to `test-anti-patterns` and clarifies Cobertura/CRAP metrics focus.
plugins/dotnet-test/skills/assertion-quality/SKILL.md	Narrows/clarifies scope to assertion-diversity metrics and redirects general anti-pattern audits to `test-anti-patterns`.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 3/3 changed files
Comments generated: 1

github-actions · 2026-06-17T17:23:52Z

👋 @Evangelink — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Addresses review feedback: the Jest matcher example read like a property without parentheses. Description stays within the 1024-char cap (1023). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-17T17:25:01Z

/evaluate

github-actions · 2026-06-17T17:32:01Z

Skill Validation Results

❌ Skill validation errors

assertion-quality: Eval scenario 'Identify self-referential assertions in identity and round-trip tests' prompt mentions target name 'assertion-quality' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite' prompt mentions target name 'assertion-quality' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.0/5 → 4.0/5 🟢	✅ coverage-analysis; tools: skill, bash, create, read_bash, stop_bash, view / ✅ coverage-analysis; tools: skill, bash, create, view	🟡 0.21	✅
coverage-analysis	Run coverage from scratch without existing data	3.3/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create, glob / ✅ coverage-analysis; tools: skill, glob, create	🟡 0.21	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.3/5 🟢	✅ coverage-analysis; tools: skill, bash, create	🟡 0.21	✅
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill, glob	✅ 0.20	❌ [1]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.20	❌ [2]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.20	❌ [3]
test-anti-patterns	Recognize well-written tests without inventing false positives	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill, glob / ⚠️ NOT ACTIVATED	✅ 0.20	❌ [4]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.20	❌ [5]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.20	❌ [6]
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.20	❌ [7]

[1] (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (27827 → 92391), tool calls (3 → 6), time (28.4s → 49.3s)
[2] ⚠️ High run-to-run variance (CV=69%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.2% due to: tokens (31573 → 56118), tool calls (2 → 4), time (27.4s → 34.0s)
[3] ⚠️ High run-to-run variance (CV=150%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.6% due to: judgment, tokens (40530 → 65542), tool calls (3 → 6), time (33.8s → 41.8s)
[4] ⚠️ High run-to-run variance (CV=63%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=64%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -28.1% due to: judgment, quality, tokens (28839 → 53867), tool calls (3 → 5), time (37.9s → 46.0s)
[6] (Plugin) Quality unchanged but weighted score is -10.9% due to: tokens (42384 → 127801), tool calls (4 → 10), time (41.7s → 93.7s), quality
[7] (Plugin) Quality unchanged but weighted score is -3.1% due to: tokens (28112 → 38628), quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27707275566 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/ce700f6675c3c6eaf048bd10f9b07c07c4a74716/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

- assertion-quality eval.yaml/eval.vally.yaml: replace hyphenated 'assertion-quality' (the target skill name) with spaced 'assertion quality' in two scenario prompts, fixing the 'prompt mentions target name' validation error that biased baseline runs. - test-anti-patterns description: add 'what's wrong with my tests' / 'are these tests any good' / 'flaky tests' trigger phrasing to improve organic activation for the flakiness, well-written and polyglot scenarios (which intermittently failed to activate in plugin runs). Stays within the 1024-char description cap (1007). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-17T17:43:54Z

/evaluate

Copilot

Copilot's findings

Files reviewed: 5/5 changed files
Comments generated: 1

Addresses review feedback: write toBeDefined()/toBeTruthy()/not.toBeNull()/ toBe()/toThrow() in call form in the prompt so they read as matcher calls, consistent with the skill description examples. Regex assertions and rubric left untouched (they match agent output, which may use either form). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-17T17:50:09Z

/evaluate

github-actions · 2026-06-17T18:04:43Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	3.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, bash, create / ✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create	✅ 0.14	❌ [1]
coverage-analysis	Run coverage from scratch without existing data	4.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, glob, create / ✅ coverage-analysis; tools: skill, create	✅ 0.14	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, bash, create	✅ 0.14	✅
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	❌ [2]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.19	❌ [3]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	✅ [4]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.3/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; tools: skill	✅ 0.19	✅ [5]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 4.7/5 🔴	✅ test-anti-patterns; tools: skill	✅ 0.19	❌ [6]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.7/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill	✅ 0.19	❌ [7]
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.19	❌ [8]
assertion-quality	Identify low assertion diversity in equality-dominated test suite	4.0/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.28	✅
assertion-quality	Flag assertion-free tests and trivial-only assertions	4.0/5 → 4.3/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob	🟡 0.28	❌ [9]
assertion-quality	Recognize well-diversified assertion usage	4.0/5 → 4.3/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.28	✅ [10]
assertion-quality	Identify self-referential assertions in identity and round-trip tests	3.7/5 → 4.0/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.28	✅ [11]
assertion-quality	Decline request to write new tests from scratch	4.3/5 → 3.0/5 🔴	ℹ️ not activated (expected) / ✅ writing-mstest-tests; code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, bash, edit, task, read_agent	🟡 0.28	❌ [12]
assertion-quality	Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite	5.0/5 → 5.0/5	✅ assertion-quality; tools: skill, glob / ⚠️ NOT ACTIVATED	🟡 0.28	❌ [13]

[1] ⚠️ High run-to-run variance (CV=637%) — consider re-running with --runs 5
[2] (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (27845 → 38313)
[3] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (27255 → 43637)
[4] ⚠️ High run-to-run variance (CV=107%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=69%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=66%) — consider re-running with --runs 5
[7] (Plugin) Quality unchanged but weighted score is -5.6% due to: tokens (42205 → 93064), tool calls (4 → 5), time (35.1s → 43.8s)
[8] ⚠️ High run-to-run variance (CV=114%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -13.6% due to: judgment, quality
[9] ⚠️ High run-to-run variance (CV=275%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -4.5% due to: tokens (26584 → 56159), tool calls (2 → 5), time (22.0s → 35.7s)
[10] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=1823%) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=111%) — consider re-running with --runs 5
[13] (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (27367 → 37743)

⏰ timeout — run(s) hit the (300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27708721042 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/55dfa1c92e67f419a2e57b6d820896edb4f61419/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-17T19:10:58Z

✅ Evaluation passed for 55dfa1c. cc @dotnet/dotnet-testing — please review.

The flakiness and Python-pytest scenarios failed to activate even in isolated runs (where it's the only candidate skill), because their prompts enumerate the methodology and the description's keywords were too generic. Front-load the concrete trigger keywords those prompts use: Thread.Sleep, DateTime.Now, time.sleep, order-dependent, reflection coupling, and Python/pytest. Stays within the 1024-char cap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-18T08:04:50Z

/evaluate

Copilot

Copilot's findings

Files reviewed: 5/5 changed files
Comments generated: 1

github-actions · 2026-06-18T08:18:27Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.0/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, bash, create	✅ 0.08	✅
coverage-analysis	Run coverage from scratch without existing data	3.3/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, glob, create	✅ 0.08	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.3/5 🟢	✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create, view / ✅ coverage-analysis; tools: skill, bash, create, view	✅ 0.08	✅ [1]
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.20	❌ [2]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.20	❌ [3]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill	🟡 0.20	❌ [4]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.0/5 → 4.3/5 🟢	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	🟡 0.20	❌ [5]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill	🟡 0.20	❌ [6]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.3/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	🟡 0.20	✅ [7]
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	4.3/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.20	✅ [8]
assertion-quality	Identify low assertion diversity in equality-dominated test suite	3.7/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob	🟡 0.26	✅
assertion-quality	Flag assertion-free tests and trivial-only assertions	4.0/5 → 4.3/5 🟢	✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.26	❌ [9]
assertion-quality	Recognize well-diversified assertion usage	3.7/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.26	✅ [10]
assertion-quality	Identify self-referential assertions in identity and round-trip tests	4.0/5 → 3.7/5 🔴	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.26	❌
assertion-quality	Decline request to write new tests from scratch	4.7/5 → 4.3/5 🔴	ℹ️ not activated (expected)	🟡 0.26	❌
assertion-quality	Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite	5.0/5 → 5.0/5	✅ assertion-quality; tools: skill, glob / ⚠️ NOT ACTIVATED	🟡 0.26	❌ [11]

[1] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=121%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -19.1% due to: judgment, tokens (27850 → 53143), tool calls (3 → 5), time (26.4s → 35.8s)
[3] ⚠️ High run-to-run variance (CV=80%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (27243 → 43672)
[4] ⚠️ High run-to-run variance (CV=314%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -4.5% due to: tokens (40312 → 73244)
[5] ⚠️ High run-to-run variance (CV=51%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.8% due to: tokens (26412 → 47523), time (18.1s → 29.3s), tool calls (2 → 3)
[6] ⚠️ High run-to-run variance (CV=79%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -27.2% due to: judgment, quality, tokens (28422 → 53310), tool calls (3 → 4)
[7] ⚠️ High run-to-run variance (CV=805%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=363%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (35171 → 100467), tool calls (3 → 6), time (26.6s → 56.6s)
[10] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=178%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.4% due to: tokens (27471 → 37856)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27745619190 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/e08f24fd2836fd24d63edcf9bcd4c4480271765d/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

The mixed-severity and flakiness scenarios consistently failed to activate test-anti-patterns in plugin runs (detected=[] — the agent loaded no skill at all and answered directly). Both prompts enumerated the full anti-pattern catalog inline, acting as an answer key that made the agent self-sufficient. Replace the embedded checklists with realistic user asks while keeping the 'for .NET test anti-patterns' trigger, file references, severity-ranked output format, and read-only constraint. Rubric and output_matches assertions are unchanged — they validate the produced report. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-18T11:29:10Z

/evaluate

github-actions · 2026-06-18T11:40:32Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	3.0/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, create, bash / ✅ coverage-analysis; tools: skill, create, glob, bash	✅ 0.11	✅
coverage-analysis	Run coverage from scratch without existing data	4.0/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, glob, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, create, glob	✅ 0.11	✅
coverage-analysis	Coverage plateau diagnosis	3.3/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create, view / ✅ coverage-analysis; tools: skill, bash, create, view	✅ 0.11	✅ [1]
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.17	❌ [2]
test-anti-patterns	Detect flakiness indicators and test coupling	5.0/5 → 4.3/5 🔴	✅ test-anti-patterns; tools: skill, glob / ⚠️ NOT ACTIVATED	✅ 0.17	❌ [3]
test-anti-patterns	Detect duplicated tests and magic values	4.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.17	❌ [4]
test-anti-patterns	Recognize well-written tests without inventing false positives	4.7/5 → 4.7/5	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.17	❌ [5]
test-anti-patterns	Detect coverage-touching pattern across a service facade	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; tools: skill	✅ 0.17	❌ [6]
test-anti-patterns	Detect self-referential assertions in round-trip and identity tests	4.7/5 → 4.7/5	✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill	✅ 0.17	❌ [7]
test-anti-patterns	Polyglot: detect anti-patterns in a Python pytest suite	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.17	❌ [8]
assertion-quality	Identify low assertion diversity in equality-dominated test suite	4.0/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob	🟡 0.24	✅
assertion-quality	Flag assertion-free tests and trivial-only assertions	4.0/5 → 4.3/5 🟢	✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.24	❌ [9]
assertion-quality	Recognize well-diversified assertion usage	3.3/5 → 5.0/5 🟢	✅ assertion-quality; tools: skill, glob, bash, grep / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.24	✅
assertion-quality	Identify self-referential assertions in identity and round-trip tests	4.7/5 → 3.3/5 🔴	✅ assertion-quality; tools: skill, glob, grep, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.24	❌ [10]
assertion-quality	Decline request to write new tests from scratch	4.3/5 → 4.7/5 🟢	ℹ️ not activated (expected)	🟡 0.24	❌ [11]
assertion-quality	Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite	5.0/5 → 5.0/5	✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill	🟡 0.24	❌ [12]

[1] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[2] (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (27827 → 38217), quality
[3] ⚠️ High run-to-run variance (CV=70%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=208%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -17.8% due to: judgment, quality, tokens (40564 → 58014)
[5] ⚠️ High run-to-run variance (CV=84%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (26596 → 92250), tool calls (2 → 5), time (21.5s → 45.6s)
[6] (Isolated) Quality unchanged but weighted score is -28.3% due to: judgment, quality, tokens (28920 → 54074), tool calls (3 → 5), time (39.0s → 49.6s)
[7] (Isolated) Quality unchanged but weighted score is -6.3% due to: tokens (42529 → 75490), time (42.6s → 76.0s), tool calls (5 → 6)
[8] (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (28123 → 38527)
[9] ⚠️ High run-to-run variance (CV=299%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -2.4% due to: tokens (35106 → 68790), tool calls (3 → 7), time (25.3s → 47.8s)
[10] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=189%) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=167%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.3% due to: tokens (27510 → 64569), time (24.7s → 39.8s), tool calls (3 → 4)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27756338562 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3b807d694060096249acb729b85b98a2309cf886/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Copilot AI review requested due to automatic review settings June 17, 2026 15:43

Copilot started reviewing on behalf of Evangelink June 17, 2026 15:43 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread plugins/dotnet-test/skills/assertion-quality/SKILL.md Outdated

github-actions Bot added the waiting-on-author PR state label label Jun 17, 2026

Use toBeTruthy() call form in assertion-quality example

ce700f6

Addresses review feedback: the Jest matcher example read like a property without parentheses. Description stays within the 1024-char cap (1023). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added a commit that referenced this pull request Jun 17, 2026

Update PR token usage data (PR #786)

e59e700

Copilot AI review requested due to automatic review settings June 17, 2026 17:43

Copilot started reviewing on behalf of Evangelink June 17, 2026 17:43 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread tests/dotnet-test/assertion-quality/eval.yaml Outdated

github-actions Bot added a commit that referenced this pull request Jun 17, 2026

Update PR token usage data (PR #786)

f1e57f0

github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 17, 2026

Copilot AI review requested due to automatic review settings June 18, 2026 08:04

Copilot started reviewing on behalf of Evangelink June 18, 2026 08:05 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread tests/dotnet-test/assertion-quality/eval.yaml

github-actions Bot added a commit that referenced this pull request Jun 18, 2026

Update PR token usage data (PR #786)

e4f2caa

github-actions Bot added a commit that referenced this pull request Jun 18, 2026

Update PR token usage data (PR #786)

ff485bf

Evangelink enabled auto-merge (squash) June 18, 2026 12:47

YuliiaKovalova approved these changes Jun 18, 2026

View reviewed changes

Evangelink merged commit 14d727f into main Jun 18, 2026
35 of 37 checks passed

Evangelink deleted the fix/test-anti-patterns-activation branch June 18, 2026 14:13

Conversation

Evangelink commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Changes

Results

Why some scenario verdicts remain ❌ (pre-existing, structural — not caused by this PR)

Scope

Uh oh!

github-actions Bot commented Jun 17, 2026

Skill Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Evangelink commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Skill Validation Results

❌ Skill validation errors

Uh oh!

Evangelink commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Evangelink commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Evangelink commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

github-actions Bot commented Jun 18, 2026

Skill Validation Results

Uh oh!

Evangelink commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evangelink commented Jun 17, 2026 •

edited

Loading