Skip to content

Cover MSTESTxxxx analyzer diagnostics in writing-mstest-tests skill#794

Open
Evangelink wants to merge 6 commits into
mainfrom
improve/mstest-analyzer-rules
Open

Cover MSTESTxxxx analyzer diagnostics in writing-mstest-tests skill#794
Evangelink wants to merge 6 commits into
mainfrom
improve/mstest-analyzer-rules

Conversation

@Evangelink

Copy link
Copy Markdown
Member

What

Extends the existing writing-mstest-tests skill to cover the MSTest analyzer rules (MSTESTxxxx) instead of adding a new skill per rule.

Why

There are 63 MSTESTxxxx rules. They are Roslyn analyzers that already self-surface during build and in the IDE (with messages and, in most cases, automated code fixes). What an agent needs is the idiomatic fix + rationale, which is content — not 63 separate, overlapping, activation-gated skills that would cannibalize each other''s activation and require a web of "DO NOT USE" redirects. The existing skill already teaches the correct patterns these analyzers enforce, so this consolidates the remaining gaps in one place.

Changes

  • New "Step 8: Fix MSTest analyzer diagnostics (MSTESTxxxx)" section with a rule → problem → fix table covering the high-value rules not previously called out (MSTEST0023, 0025, 0032, 0038, 0044, 0052, 0024, 0036, 0061, the 0042/0060 duplicates, the 0002–0014 layout family), cross-linked to existing steps for the rules already covered (0006/0017/0037/0039/0046/0045-0049-0054).
  • MSTestAnalysisMode (None/Default/Recommended/All) guidance and the opt-in rules note.
  • Link to the official MSTest code analysis overview.
  • Added a USE FOR keyword (fix MSTEST analyzer diagnostics (MSTESTxxxx rules)) and a When-to-Use trigger; trimmed the verbose assertion-API list to keep the description at 1000 chars (under the 1024 cap).

Validation

skill-validator check --plugin ./plugins/dotnet-test → all checks pass (27 skills, 11 agents). Only a soft "approaching comprehensive token range" warning on the skill size.

Add a 'Fix MSTest analyzer diagnostics' workflow step mapping the common MSTESTxxxx rules to their idiomatic fixes, plus MSTestAnalysisMode guidance, instead of creating one skill per rule.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 14:14
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test code-testing-agent 5/5 100%
dotnet-test writing-mstest-tests 39/45 86.7%
Uncovered: dotnet-test/writing-mstest-tests
  • [CodePattern] Assert.IsNotEmpty (line 178)
  • [CodePattern] Assert.AreSame (line 154)
  • [CodePattern] Assert.IsEmpty (line 178)
  • [CodePattern] Assert.DoesNotContain (line 178)
  • [CodePattern] Assert.Contains (line 178)
  • [CodePattern] Assert.IsNull (line 154)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the writing-mstest-tests skill content to explicitly cover fixing MSTest analyzer diagnostics (MSTESTxxxx) within the existing workflow, instead of creating many separate per-rule skills.

Changes:

  • Updates the skill description/triggering text to include fixing MSTESTxxxx analyzer diagnostics.
  • Adds a new “Step 8” section with a rule → problem → fix table and guidance on MSTestAnalysisMode.
Show a summary per file
File Description
plugins/dotnet-test/skills/writing-mstest-tests/SKILL.md Adds MSTest analyzer diagnostics guidance (new Step 8) and updates description/triggers to cover MSTESTxxxx rules.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 2

Comment thread plugins/dotnet-test/skills/writing-mstest-tests/SKILL.md Outdated
Comment thread plugins/dotnet-test/skills/writing-mstest-tests/SKILL.md Outdated
…rammar

- Don't tie MSTest.Analyzers availability to TestFramework 3.7; note metapackage/SDK/explicit reference.
- Fix grammatically broken fix text for the MSTEST0002-0014 layout row.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 14:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment thread plugins/dotnet-test/skills/writing-mstest-tests/SKILL.md Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink Evangelink enabled auto-merge (squash) June 19, 2026 15:14
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions github-actions Bot added the waiting-on-review PR state label label Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
writing-mstest-tests Write unit tests for a service class 4.3/5 → 4.3/5 ✅ writing-mstest-tests; tools: skill, glob / ✅ writing-mstest-tests; tools: skill 🟡 0.33 [1]
writing-mstest-tests Write data-driven tests for a calculator 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill, glob, view / ⚠️ NOT ACTIVATED 🟡 0.33 [2]
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33
writing-mstest-tests Fix swapped Assert.AreEqual arguments 4.7/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33 [3]
writing-mstest-tests Modernize legacy test patterns 4.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33 [4]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.33 [5]
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.0/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33 [6]
writing-mstest-tests Use proper type assertions instead of casts 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33 [7]
writing-mstest-tests Set up test lifecycle correctly 2.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.33 [8]
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.33 [9]
writing-mstest-tests Use string assertions for format validation 3.7/5 → 4.0/5 ⏰ 🟢 ✅ writing-mstest-tests; tools: skill, edit, view, bash / ⚠️ NOT ACTIVATED 🟡 0.33 [10]
writing-mstest-tests Use comparison assertions for boundary testing 2.3/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.33 [11]
writing-mstest-tests Write tests with collection, null, and reference assertions 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: glob, skill / ⚠️ NOT ACTIVATED 🟡 0.33 [12]
writing-mstest-tests Configure conditional execution, retry, and cleanup 3.0/5 → 4.3/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.33 [13]
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.33 [14]

[1] ⚠️ High run-to-run variance (CV=349%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=2394%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=142%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=110%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=120%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=60%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (12819 → 18051)
[10] ⚠️ High run-to-run variance (CV=232%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -2.9% due to: tokens (116180 → 430002), tool calls (8 → 23), time (69.0s → 151.8s)
[11] ⚠️ High run-to-run variance (CV=9923%) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=100%) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=228%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -16.6% due to: judgment, quality, tokens (13228 → 18182)
[14] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5

timeout — run(s) hit the (180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 794 in dotnet/skills, download eval artifacts with gh run download 27833908923 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/15cfbbaf67a3a47e83a2b519f39d1949b4a82468/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 15cfbba. cc @dotnet/dotnet-testing — please review.

…om code-testing-agent

In plugin runs, code-testing-agent (generic 'write/comprehensive unit tests') was stealing activation from writing-mstest-tests for MSTest-specific prompts. Broaden code-testing-agent's DO NOT USE carve-out to defer writing/fixing/modernizing MSTest-specific tests, assertions, attributes, and lifecycle to writing-mstest-tests, and have writing-mstest-tests claim 'comprehensive MSTest unit tests'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 15:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 0 new

@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
writing-mstest-tests Write unit tests for a service class 4.0/5 → 4.3/5 🟢 ✅ writing-mstest-tests; tools: skill, glob 🟡 0.29 [1]
writing-mstest-tests Write data-driven tests for a calculator 3.7/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill, glob, view / ✅ writing-mstest-tests; tools: report_intent, view, skill, create, bash, edit 🟡 0.29 [2]
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [3]
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.29 [4]
writing-mstest-tests Modernize legacy test patterns 4.3/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [5]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.29
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.0/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [6]
writing-mstest-tests Use proper type assertions instead of casts 4.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.29 [7]
writing-mstest-tests Set up test lifecycle correctly 2.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ✅ writing-mstest-tests; tools: report_intent, skill 🟡 0.29 [8]
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.29 [9]
writing-mstest-tests Use string assertions for format validation 3.7/5 → 4.0/5 ⏰ 🟢 ✅ writing-mstest-tests; tools: skill, view, edit / ⚠️ NOT ACTIVATED 🟡 0.29 [10]
writing-mstest-tests Use comparison assertions for boundary testing 2.3/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [11]
writing-mstest-tests Write tests with collection, null, and reference assertions 4.0/5 → 4.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [12]
writing-mstest-tests Configure conditional execution, retry, and cleanup 2.7/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ✅ writing-mstest-tests; tools: skill 🟡 0.29 [13]
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.29
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.3/5 → 3.0/5 🔴 ✅ code-testing-agent; tools: skill / ✅ code-testing-extensions; code-testing-agent; tools: task, skill, read_agent, grep ✅ 0.18 [14]
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 4.0/5 → 4.3/5 🟢 ✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.18 [15]
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 4.7/5 → 4.7/5 ✅ code-testing-agent; tools: skill ✅ 0.18 [16]
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.18 [17]

[1] ⚠️ High run-to-run variance (CV=2057%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=113%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=203%) — consider re-running with --runs 5
[4] (Plugin) Quality unchanged but weighted score is -2.3% due to: tokens (12757 → 17995)
[5] ⚠️ High run-to-run variance (CV=217%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -15.3% due to: judgment, quality
[6] ⚠️ High run-to-run variance (CV=82%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=618%) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +3.4% due to: tool calls (1 → 0), tokens (21035 → 17873)
[8] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=105%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -13.8% due to: judgment, tokens (12916 → 18055)
[10] ⚠️ High run-to-run variance (CV=739%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -27.7% due to: judgment, quality, tokens (97530 → 311183), tool calls (6 → 19), time (61.3s → 104.8s)
[11] ⚠️ High run-to-run variance (CV=131%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -13.7% due to: judgment, tokens (13590 → 18635)
[12] ⚠️ High run-to-run variance (CV=52%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.5% due to: tokens (186872 → 265498)
[13] ⚠️ High run-to-run variance (CV=157%) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=61%) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=1596%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -16.0% due to: judgment, tokens (184221 → 259184), quality
[16] ⚠️ High run-to-run variance (CV=145%) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=82%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.3% due to: tokens (101101 → 130138)

timeout — run(s) hit the (180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 794 in dotnet/skills, download eval artifacts with gh run download 27836324404 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c263a61f795a2aabc2788e97629d5eb54350b824/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

…ugin activation

Post-fix eval (run 27836324404) showed code-testing-agent no longer steals (no sibling fires), but string/comparison/reference-assertion scenarios still don't activate in the plugin run — their trigger keywords (StartsWith, EndsWith, MatchesRegex, IsGreaterThan, IsLessThan, IsInRange, AreSame) had been trimmed for budget. Rebuild the description to restore all eval-relevant assertion APIs and lead with write/create/modernize/fix, staying at 1013 chars.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
writing-mstest-tests Write unit tests for a service class 4.3/5 → 4.0/5 🔴 ✅ writing-mstest-tests; tools: skill, glob 🟡 0.34
writing-mstest-tests Write data-driven tests for a calculator 3.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent, glob / ✅ writing-mstest-tests; tools: skill, report_intent, view, create, bash, edit 🟡 0.34 [1]
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.34 [2]
writing-mstest-tests Modernize legacy test patterns 4.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [3]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.34 [4]
writing-mstest-tests Use proper collection assertions 3.3/5 → 2.7/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [5]
writing-mstest-tests Use proper type assertions instead of casts 4.0/5 → 4.3/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [6]
writing-mstest-tests Set up test lifecycle correctly 2.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [7]
writing-mstest-tests Use string assertions for format validation 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill, bash, edit, view / ⚠️ NOT ACTIVATED 🟡 0.34 [8]
writing-mstest-tests Use comparison assertions for boundary testing 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [9]
writing-mstest-tests Write tests with collection, null, and reference assertions 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.34 [10]
writing-mstest-tests Configure conditional execution, retry, and cleanup 2.7/5 → 4.3/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.34 [11]
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.34
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.3/5 → 3.0/5 🔴 ✅ code-testing-agent; tools: skill, task, glob, read_agent, grep / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent, glob 🟡 0.21
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 4.3/5 → 4.0/5 🔴 ✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.21 [12]
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 5.0/5 → 4.3/5 🔴 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, edit 🟡 0.21 [13]
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.21 [14]

[1] ⚠️ High run-to-run variance (CV=119%) — consider re-running with --runs 5
[2] (Plugin) Quality unchanged but weighted score is -2.5% due to: tokens (12766 → 18009)
[3] ⚠️ High run-to-run variance (CV=88%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -0.3% due to: quality
[4] ⚠️ High run-to-run variance (CV=154%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=112%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=617%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -3.0% due to: tokens (12804 → 18055), time (6.9s → 9.4s)
[8] ⚠️ High run-to-run variance (CV=424%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=67814%) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=261%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=111%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.6% due to: tokens (13263 → 18271)
[12] ⚠️ High run-to-run variance (CV=113%) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=53%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (85464 → 117141)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 794 in dotnet/skills, download eval artifacts with gh run download 27837850135 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/7e553b51f9796b219537fef735c7a6b5bef4b257/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants