Skip to content

Fix test-anti-patterns skill activation for 5 evals#786

Merged
Evangelink merged 6 commits into
mainfrom
fix/test-anti-patterns-activation
Jun 18, 2026
Merged

Fix test-anti-patterns skill activation for 5 evals#786
Evangelink merged 6 commits into
mainfrom
fix/test-anti-patterns-activation

Conversation

@Evangelink

@Evangelink Evangelink commented Jun 17, 2026

Copy link
Copy Markdown
Member

Problem

Five test-anti-patterns eval scenarios were reported as skill_not_activated: an overlapping sibling skill (or no skill) loaded instead of test-anti-patterns. Separately, the assertion-quality eval had a validation error (prompts naming the target skill).

Root cause

In plugin eval runs a scenario only counts as activated if the target skill fires. Several prompts matched a sibling's advertised capability:

Scenario Sibling stealing activation
coverage-touching coverage-analysis
self-referential assertions assertion-quality
duplicated tests / magic values test-smell-detection
well-written tests assertion-quality / test-smell-detection
Python pytest assertion-quality / test-gap-analysis

Changes

Skill descriptions (disambiguation + activation):

  • test-anti-patterns — rewrote the description to own the umbrella audit/review my tests for anti-patterns request and front-load concrete trigger keywords (self-referential/round-trip, coverage-touching, flakiness: Thread.Sleep/DateTime.Now/time.sleep/reflection coupling, Python/pytest). Added DO NOT USE FOR redirects to the three metric-focused siblings. Kept within the 1024-char cap.
  • assertion-qualityDO NOT USE FOR redirect: a general severity-ranked anti-pattern audit (even self-referential focused) → test-anti-patterns unless an assertion-diversity metrics report is requested.
  • coverage-analysisDO NOT USE FOR redirect: auditing test code for the "coverage-touching" anti-pattern → test-anti-patterns; clarified it needs/produces Cobertura/CRAP metrics.

Eval fixes:

  • assertion-quality eval.yaml/eval.vally.yaml — replaced the hyphenated target name assertion-quality with spaced "assertion quality" in two prompts (fixes the "prompt mentions target name" validation error), and used Jest matcher call-form (toBeTruthy() etc.) per review feedback.
  • test-anti-patterns eval.yaml/eval.vally.yaml — made the mixed-severity and flakiness prompts realistic by removing the inline "answer-key" catalog enumeration (which made the agent self-sufficient and skip the skill). Rubric and output_matches assertions unchanged.

Results

  • Disambiguation works: no sibling steals activation anymore (detectedSkills=[] on misses).
  • Skill shows positive value: after the realistic-prompt change, the isolated improvement score is positive for mixed (+0.084), duplicated (+0.184), well-written (+0.074) and polyglot (+0.075) — the skill demonstrably helps when it's the candidate.

Why some scenario verdicts remain ❌ (pre-existing, structural — not caused by this PR)

These scenarios were already ❌ before this PR. They hit pattern #8 from InvestigatingResults.md ("baseline already good"): a strong model audits test anti-patterns at ~5.0/5 without any skill, so there's no quality headroom. The scenario score is min(isolated, plugin), and the plugin run is structurally ≤ 0 — loading the full plugin keeps ~27 sibling skill descriptions in context, adding token overhead even when the skill doesn't activate. So min() stays ≤ 0 regardless of activation.

This is not fixable by activation/description tuning. The only honest remedy is to give the skill genuine headroom (harder fixtures where the un-skilled baseline misses findings) — a separate, judgment-heavy change the repo's own docs flag as bordering on eval-gaming, intentionally left out of this PR. Compare coverage-analysis (baseline 2–3/5 → 5/5), which passes precisely because it has real headroom.

Scope

This PR fixes the activation/disambiguation problem and the validation error. The residual pattern-#8 verdicts are pre-existing and tracked separately; they predate and are unaffected by these changes.

Sibling skills with overlapping descriptions were stealing activation
from test-anti-patterns in plugin eval runs (coverage-analysis,
assertion-quality, test-smell-detection). Reword descriptions so
test-anti-patterns owns the umbrella 'audit my tests for anti-patterns'
severity-ranked report, and add DO NOT USE redirects in the
metric-focused siblings. Kept all descriptions within the 1024-char cap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 17, 2026 15:43
@github-actions

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test assertion-quality 20/22 90.9%
dotnet-test test-anti-patterns 19/21 90.5%
Uncovered: dotnet-test/assertion-quality
  • [Validation] Metrics are computed correctly (counts add up) (line 155)
  • [Validation] If the suite has good diversity, the report acknowledges this (line 160)
Uncovered: dotnet-test/test-anti-patterns
  • [Validation] Every finding includes a specific location (not just a general warning) (line 154)
  • [Validation] Recommendations are prioritized by severity (line 158)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates dotnet-test skill frontmatter descriptions to prevent sibling skills (e.g., coverage-analysis, assertion-quality) from “stealing” activation during eval runs, which was causing skill_not_activated failures for test-anti-patterns.

Changes:

  • Makes test-anti-patterns explicitly the umbrella “severity-ranked anti-pattern audit” skill and clarifies when not to use it (redirecting to metric-focused siblings).
  • Adds DO NOT USE redirects in coverage-analysis for the “coverage-touching” anti-pattern scenario (redirect to test-anti-patterns).
  • Tightens assertion-quality scope to metrics-focused assertion diversity, and redirects general anti-pattern audits to test-anti-patterns.
Show a summary per file
File Description
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md Repositions the skill as the default anti-pattern audit umbrella and adds redirects to sibling metric skills.
plugins/dotnet-test/skills/coverage-analysis/SKILL.md Redirects “coverage-touching” test-quality audits to test-anti-patterns and clarifies Cobertura/CRAP metrics focus.
plugins/dotnet-test/skills/assertion-quality/SKILL.md Narrows/clarifies scope to assertion-diversity metrics and redirects general anti-pattern audits to test-anti-patterns.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread plugins/dotnet-test/skills/assertion-quality/SKILL.md Outdated
@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 17, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Addresses review feedback: the Jest matcher example read like a property
without parentheses. Description stays within the 1024-char cap (1023).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

❌ Skill validation errors

  • assertion-quality: Eval scenario 'Identify self-referential assertions in identity and round-trip tests' prompt mentions target name 'assertion-quality' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite' prompt mentions target name 'assertion-quality' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.
Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 4.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, create, read_bash, stop_bash, view / ✅ coverage-analysis; tools: skill, bash, create, view 🟡 0.21
coverage-analysis Run coverage from scratch without existing data 3.3/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create, glob / ✅ coverage-analysis; tools: skill, glob, create 🟡 0.21
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, bash, create 🟡 0.21
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill, glob ✅ 0.20 [1]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.20 [2]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.20 [3]
test-anti-patterns Recognize well-written tests without inventing false positives 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill, glob / ⚠️ NOT ACTIVATED ✅ 0.20 [4]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.20 [5]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.20 [6]
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.20 [7]

[1] (Plugin) Quality unchanged but weighted score is -7.7% due to: tokens (27827 → 92391), tool calls (3 → 6), time (28.4s → 49.3s)
[2] ⚠️ High run-to-run variance (CV=69%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.2% due to: tokens (31573 → 56118), tool calls (2 → 4), time (27.4s → 34.0s)
[3] ⚠️ High run-to-run variance (CV=150%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.6% due to: judgment, tokens (40530 → 65542), tool calls (3 → 6), time (33.8s → 41.8s)
[4] ⚠️ High run-to-run variance (CV=63%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=64%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -28.1% due to: judgment, quality, tokens (28839 → 53867), tool calls (3 → 5), time (37.9s → 46.0s)
[6] (Plugin) Quality unchanged but weighted score is -10.9% due to: tokens (42384 → 127801), tool calls (4 → 10), time (41.7s → 93.7s), quality
[7] (Plugin) Quality unchanged but weighted score is -3.1% due to: tokens (28112 → 38628), quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27707275566 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/ce700f6675c3c6eaf048bd10f9b07c07c4a74716/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

github-actions Bot added a commit that referenced this pull request Jun 17, 2026
- assertion-quality eval.yaml/eval.vally.yaml: replace hyphenated
  'assertion-quality' (the target skill name) with spaced 'assertion
  quality' in two scenario prompts, fixing the 'prompt mentions target
  name' validation error that biased baseline runs.
- test-anti-patterns description: add 'what's wrong with my tests' /
  'are these tests any good' / 'flaky tests' trigger phrasing to improve
  organic activation for the flakiness, well-written and polyglot
  scenarios (which intermittently failed to activate in plugin runs).
  Stays within the 1024-char description cap (1007).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 17, 2026 17:43
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 1

Comment thread tests/dotnet-test/assertion-quality/eval.yaml Outdated
Addresses review feedback: write toBeDefined()/toBeTruthy()/not.toBeNull()/
toBe()/toThrow() in call form in the prompt so they read as matcher calls,
consistent with the skill description examples. Regex assertions and rubric
left untouched (they match agent output, which may use either form).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 17, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 3.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, create / ✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create ✅ 0.14 [1]
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, glob, create / ✅ coverage-analysis; tools: skill, create ✅ 0.14
coverage-analysis Coverage plateau diagnosis 3.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, create ✅ 0.14
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [2]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.19 [3]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [4]
test-anti-patterns Recognize well-written tests without inventing false positives 4.3/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; tools: skill ✅ 0.19 [5]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 4.7/5 🔴 ✅ test-anti-patterns; tools: skill ✅ 0.19 [6]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.7/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill ✅ 0.19 [7]
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.19 [8]
assertion-quality Identify low assertion diversity in equality-dominated test suite 4.0/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.28
assertion-quality Flag assertion-free tests and trivial-only assertions 4.0/5 → 4.3/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob 🟡 0.28 [9]
assertion-quality Recognize well-diversified assertion usage 4.0/5 → 4.3/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.28 [10]
assertion-quality Identify self-referential assertions in identity and round-trip tests 3.7/5 → 4.0/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.28 [11]
assertion-quality Decline request to write new tests from scratch 4.3/5 → 3.0/5 🔴 ℹ️ not activated (expected) / ✅ writing-mstest-tests; code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, bash, edit, task, read_agent 🟡 0.28 [12]
assertion-quality Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite 5.0/5 → 5.0/5 ✅ assertion-quality; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.28 [13]

[1] ⚠️ High run-to-run variance (CV=637%) — consider re-running with --runs 5
[2] (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (27845 → 38313)
[3] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (27255 → 43637)
[4] ⚠️ High run-to-run variance (CV=107%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=69%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=66%) — consider re-running with --runs 5
[7] (Plugin) Quality unchanged but weighted score is -5.6% due to: tokens (42205 → 93064), tool calls (4 → 5), time (35.1s → 43.8s)
[8] ⚠️ High run-to-run variance (CV=114%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -13.6% due to: judgment, quality
[9] ⚠️ High run-to-run variance (CV=275%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -4.5% due to: tokens (26584 → 56159), tool calls (2 → 5), time (22.0s → 35.7s)
[10] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=1823%) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=111%) — consider re-running with --runs 5
[13] (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (27367 → 37743)

timeout — run(s) hit the (300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27708721042 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/55dfa1c92e67f419a2e57b6d820896edb4f61419/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 17, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 55dfa1c. cc @dotnet/dotnet-testing — please review.

The flakiness and Python-pytest scenarios failed to activate even in
isolated runs (where it's the only candidate skill), because their prompts
enumerate the methodology and the description's keywords were too generic.
Front-load the concrete trigger keywords those prompts use: Thread.Sleep,
DateTime.Now, time.sleep, order-dependent, reflection coupling, and
Python/pytest. Stays within the 1024-char cap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 08:04
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 1

Comment thread tests/dotnet-test/assertion-quality/eval.yaml
github-actions Bot added a commit that referenced this pull request Jun 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, create ✅ 0.08
coverage-analysis Run coverage from scratch without existing data 3.3/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, glob, create ✅ 0.08
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create, view / ✅ coverage-analysis; tools: skill, bash, create, view ✅ 0.08 [1]
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.20 [2]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.20 [3]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill 🟡 0.20 [4]
test-anti-patterns Recognize well-written tests without inventing false positives 4.0/5 → 4.3/5 🟢 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill 🟡 0.20 [5]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill 🟡 0.20 [6]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.3/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill 🟡 0.20 [7]
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 4.3/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.20 [8]
assertion-quality Identify low assertion diversity in equality-dominated test suite 3.7/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob 🟡 0.26
assertion-quality Flag assertion-free tests and trivial-only assertions 4.0/5 → 4.3/5 🟢 ✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.26 [9]
assertion-quality Recognize well-diversified assertion usage 3.7/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.26 [10]
assertion-quality Identify self-referential assertions in identity and round-trip tests 4.0/5 → 3.7/5 🔴 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.26
assertion-quality Decline request to write new tests from scratch 4.7/5 → 4.3/5 🔴 ℹ️ not activated (expected) 🟡 0.26
assertion-quality Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite 5.0/5 → 5.0/5 ✅ assertion-quality; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.26 [11]

[1] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=121%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -19.1% due to: judgment, tokens (27850 → 53143), tool calls (3 → 5), time (26.4s → 35.8s)
[3] ⚠️ High run-to-run variance (CV=80%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (27243 → 43672)
[4] ⚠️ High run-to-run variance (CV=314%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -4.5% due to: tokens (40312 → 73244)
[5] ⚠️ High run-to-run variance (CV=51%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.8% due to: tokens (26412 → 47523), time (18.1s → 29.3s), tool calls (2 → 3)
[6] ⚠️ High run-to-run variance (CV=79%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -27.2% due to: judgment, quality, tokens (28422 → 53310), tool calls (3 → 4)
[7] ⚠️ High run-to-run variance (CV=805%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=363%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (35171 → 100467), tool calls (3 → 6), time (26.6s → 56.6s)
[10] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=178%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.4% due to: tokens (27471 → 37856)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27745619190 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/e08f24fd2836fd24d63edcf9bcd4c4480271765d/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

The mixed-severity and flakiness scenarios consistently failed to activate
test-anti-patterns in plugin runs (detected=[] — the agent loaded no skill
at all and answered directly). Both prompts enumerated the full anti-pattern
catalog inline, acting as an answer key that made the agent self-sufficient.
Replace the embedded checklists with realistic user asks while keeping the
'for .NET test anti-patterns' trigger, file references, severity-ranked
output format, and read-only constraint. Rubric and output_matches
assertions are unchanged — they validate the produced report.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 3.0/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, create, bash / ✅ coverage-analysis; tools: skill, create, glob, bash ✅ 0.11
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, glob, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, create, glob ✅ 0.11
coverage-analysis Coverage plateau diagnosis 3.3/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, bash, read_bash, stop_bash, create, view / ✅ coverage-analysis; tools: skill, bash, create, view ✅ 0.11 [1]
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.17 [2]
test-anti-patterns Detect flakiness indicators and test coupling 5.0/5 → 4.3/5 🔴 ✅ test-anti-patterns; tools: skill, glob / ⚠️ NOT ACTIVATED ✅ 0.17 [3]
test-anti-patterns Detect duplicated tests and magic values 4.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.17 [4]
test-anti-patterns Recognize well-written tests without inventing false positives 4.7/5 → 4.7/5 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.17 [5]
test-anti-patterns Detect coverage-touching pattern across a service facade 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill, glob / ✅ test-anti-patterns; tools: skill ✅ 0.17 [6]
test-anti-patterns Detect self-referential assertions in round-trip and identity tests 4.7/5 → 4.7/5 ✅ test-anti-patterns; tools: skill / ✅ test-anti-patterns; test-analysis-extensions; tools: skill ✅ 0.17 [7]
test-anti-patterns Polyglot: detect anti-patterns in a Python pytest suite 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.17 [8]
assertion-quality Identify low assertion diversity in equality-dominated test suite 4.0/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill, glob 🟡 0.24
assertion-quality Flag assertion-free tests and trivial-only assertions 4.0/5 → 4.3/5 🟢 ✅ assertion-quality; tools: skill, glob, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.24 [9]
assertion-quality Recognize well-diversified assertion usage 3.3/5 → 5.0/5 🟢 ✅ assertion-quality; tools: skill, glob, bash, grep / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.24
assertion-quality Identify self-referential assertions in identity and round-trip tests 4.7/5 → 3.3/5 🔴 ✅ assertion-quality; tools: skill, glob, grep, bash / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.24 [10]
assertion-quality Decline request to write new tests from scratch 4.3/5 → 4.7/5 🟢 ℹ️ not activated (expected) 🟡 0.24 [11]
assertion-quality Polyglot: evaluate shallow assertions in a Jest/TypeScript OrderService suite 5.0/5 → 5.0/5 ✅ assertion-quality; tools: skill, glob / ✅ assertion-quality; test-analysis-extensions; tools: skill 🟡 0.24 [12]

[1] ⚠️ High run-to-run variance (CV=76%) — consider re-running with --runs 5
[2] (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (27827 → 38217), quality
[3] ⚠️ High run-to-run variance (CV=70%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=208%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -17.8% due to: judgment, quality, tokens (40564 → 58014)
[5] ⚠️ High run-to-run variance (CV=84%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (26596 → 92250), tool calls (2 → 5), time (21.5s → 45.6s)
[6] (Isolated) Quality unchanged but weighted score is -28.3% due to: judgment, quality, tokens (28920 → 54074), tool calls (3 → 5), time (39.0s → 49.6s)
[7] (Isolated) Quality unchanged but weighted score is -6.3% due to: tokens (42529 → 75490), time (42.6s → 76.0s), tool calls (5 → 6)
[8] (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (28123 → 38527)
[9] ⚠️ High run-to-run variance (CV=299%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -2.4% due to: tokens (35106 → 68790), tool calls (3 → 7), time (25.3s → 47.8s)
[10] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=189%) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=167%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.3% due to: tokens (27510 → 64569), time (24.7s → 39.8s), tool calls (3 → 4)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 786 in dotnet/skills, download eval artifacts with gh run download 27756338562 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3b807d694060096249acb729b85b98a2309cf886/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@Evangelink Evangelink enabled auto-merge (squash) June 18, 2026 12:47
@Evangelink Evangelink merged commit 14d727f into main Jun 18, 2026
35 of 37 checks passed
@Evangelink Evangelink deleted the fix/test-anti-patterns-activation branch June 18, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants