ci: add beval behavioral evaluation workflow for dt-coach agent by eedorenko · Pull Request #1129 · microsoft/hve-core

eedorenko · 2026-03-19T19:36:09Z

Description

Adds a behavioral evaluation (beval) CI workflow for the dt-coach agent using GitHub Copilot CLI over ACP (TCP). The workflow:

Starts two Copilot CLI instances (agent on port 3000, judge on port 3001)
Runs beval evaluations against cases defined in beval/cases/
Uploads results as a workflow artifact

Also pins all GitHub Actions dependencies to SHA hashes for supply chain security, and installs beval from the default branch of the vyta/beval repo.

Sample eval run: https://github.com/eedorenko/hve-core/actions/runs/23311489579/job/67799722616

======================================================================
  SCORECARD
======================================================================
  Overall: 0.81  (30/30 cases passed)

  Metric                Score  Bar
  -------------------- ------  ------------
  latency               0.86  [#########-]
  quality               0.78  [########--]

  Case                                      Score  Status
  ---------------------------------------- ------  ------
  Response follows Think/Speak/Empower ...  0.66  + PASS
  Keep responses concise — no methodolo...  0.86  + PASS
  End with choices not directives           0.91  + PASS
  Work WITH users, not FOR them             0.95  + PASS
  Do not prescribe specific solutions t...  0.93  + PASS
  Stay curious and supportive when user...  0.81  + PASS
  Method 1: Assess whether request is f...  0.93  + PASS
  Method 1: Guide stakeholder identific...  0.82  + PASS
  Method 2: Help plan systematic research   0.68  + PASS
  Method 3: Guide pattern recognition f...  0.81  + PASS
  Method 4: Facilitate divergent ideation   0.65  + PASS
  Method 5: Guide concept creation for ...  0.65  + PASS
  Method 6: Encourage scrappy constrain...  0.86  + PASS
  Method 7: Guide technical feasibility...  0.65  + PASS
  Method 8: Structure user testing for ...  0.65  + PASS
  Method 9: Guide continuous optimizati...  0.63  + PASS
  Start with broad hints when user is s...  0.73  + PASS
  Escalate hints when user remains stuck    0.90  + PASS
  Accept backward transitions between m...  0.83  + PASS
  Announce method shifts transparently      0.84  + PASS
  Avoid multiple-choice question lists      0.81  + PASS
  Do not change method focus without an...  0.95  + PASS
  Resume session with state context         0.63  + PASS
  Ask for project slug during initializ...  0.94  + PASS
  Gather role, team, and method focus d...  0.92  + PASS
  Default to Method 1 for new projects      0.88  + PASS
  Ask targeted, open-ended questions du...  0.88  + PASS
  Summarize progress and check direction    0.78  + PASS
  Recap accomplishments and confirm met...  0.83  + PASS
  Summarize session and suggest next st...  0.89  + PASS

  Avg time: 38.5s
======================================================================

Related Issue(s)

Type of Change

Code & Documentation:

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality)
Breaking change (fix or feature causing existing functionality to change)
Documentation update

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot instructions (.github/instructions/*.instructions.md)
Copilot prompt (.github/prompts/*.prompt.md)
Copilot agent (.github/agents/*.agent.md)
Copilot skill (.github/skills/*/SKILL.md)

Other:

Script/automation (.ps1, .sh, .py)
Other (please describe):

Testing

The workflow has been validated by triggering it manually via workflow_dispatch. All 30 evaluation cases passed with an overall score of 0.81.

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

AI Artifact Contributions

Used /prompt-analyze to review contribution
Addressed all feedback from prompt-builder review
Verified contribution follows common standards and type-specific requirements

Required Automated Checks

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

Additional Notes

All GitHub Actions uses: steps are pinned to SHA hashes per supply chain security best practices.

Add 30 test cases across 4 categories (coaching behaviors, session phases, method guidance, progressive hints) with ACP judge integration. Include reusable CI workflow and PR validation hook with fork guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removed port specification from agent startup command.

Add prompt to copilot agent startup command.

Added working-directory to Start agent step in beval.yml

Switch to init_prompt to reliably activate the dt-coach agent in ACP sessions. Remove --agent flag from copilot TCP start, add port-readiness polling. Add agent identity verification case. Copy dt-coach.agent.md to .github/agents/ for flat discovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

.github/workflows/test-token.yml

- Add beval, wireframes, parseable to cspell dictionary - Ignore beval/results/** from spell check (generated output) - Add top-level and job-level permissions blocks to test-token.yml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Add behavioral evaluation job to release-stable.yml - Remove test-token.yml debug workflow - Remove dt-coach.agent.md (not part of this contribution) - Remove beval/results/ (generated output, not for source control) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko · 2026-03-19T21:01:28Z

FYI @vyta @bjcmit

WilliamBerryiii

Thank you for this PR, @eedorenko. Behavioral evaluation for the dt-coach agent is a valuable addition, and we appreciate the effort to formalize agent quality testing with structured evaluation cases.

After reviewing the workflow changes against our CI security standards, we've identified several issues that need to be resolved before this can merge. The findings fall into two categories: supply-chain security violations in the beval workflow, and architectural concerns with integrating it into PR validation and release pipelines.

Important

The combination of unpinned dependencies from an external personal repository, unpinned npm range installs, inherited secrets, and persisted credentials creates a compound risk. A compromise of any one dependency effectively grants access to all repository secrets and the CI execution context.

We've added inline comments on each affected file with specific context and suggested changes. The critical items are:

pip install from vyta/beval with no commit SHA and no hash verification (see comment on beval.yml line 32)
npm install -g @github/copilot@1 with a major-version range and no lockfile (see comment on beval.yml line 29)
actions/checkout without persist-credentials: false (see comment on beval.yml line 21)
Both copilot instances launch with --allow-all, granting unrestricted permissions (see comment on beval.yml line 36)
secrets: inherit in both calling workflows forwards all repository secrets when only COPILOT_TOKEN is needed
Behavioral evaluation should not gate PR merges or releases at this stage (see comments on pr-validation.yml and release-stable.yml)

Our repository enforces these standards through Test-DependencyPinning.ps1, Test-WorkflowPermissions.ps1, and the conventions documented in workflows.instructions.md. The copilot-setup-steps.yml workflow demonstrates the expected pattern for downloading and verifying external binaries.

We recommend deploying beval as a standalone workflow_dispatch or scheduled workflow instead of integrating it into pr-validation.yml and release-stable.yml. This allows behavioral testing to proceed without gating contributor workflows or release processes.

Please comment if you have questions about any of the suggestions, and we can discuss further.

.github/workflows/beval.yml

.github/workflows/pr-validation.yml

.github/workflows/release-stable.yml

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-03-19T23:52:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.89%. Comparing base (e69486a) to head (e2b0414).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1129      +/-   ##
==========================================
- Coverage   86.90%   86.89%   -0.02%     
==========================================
  Files          59       59              
  Lines        8774     8774              
==========================================
- Hits         7625     7624       -1     
- Misses       1149     1150       +1

Flag	Coverage Δ
pester	`85.32% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add beval/package.json and package-lock.json pinning @github/copilot to 1.0.9 with SRI hashes for integrity verification - Replace npm install -g with npm ci --prefix beval and add beval/node_modules/.bin to PATH - Add persist-credentials: false to checkout step Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Pin vyta/beval to a9ab930ade3db13855b26b34b268327da9c881bc instead of HEAD to ensure reproducible installs with integrity verification. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace --allow-all on both agent and judge ACP instances with explicit --deny-tool flags scoped to what each role requires: - Agent: denies shell and web; only needs to read instruction files and respond to text prompts during evaluation - Judge: denies shell and web; only needs LLM inference to score responses, no tool access required Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace secrets: inherit with explicit COPILOT_TOKEN forwarding in both pr-validation.yml and release-stable.yml, and declare the secret in beval.yml's workflow_call trigger. This limits secret exposure to only what beval requires rather than forwarding all caller secrets. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Update to vyta/beval@4f363b7 which adds request_permission() to _ACPJudgeClient, fixing "Method not found" ACP errors when Copilot CLI is started with --deny-tool flags. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Remove --allow-all; beval's request_permission() callback handles tool permission requests automatically, so no blanket flag is needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

.github/workflows/beval.yml

katriendg · 2026-03-20T09:18:21Z

Truly love this addition, adding un-deterministic evaluations to DT coach. This is an area we wanted to explore, thanks for adding this!

Given the very experimental status of beval, security hardening of the repo, and APIs subject to change without notice, we probably want to make this very experimental as well, and not have any hard dependencies on this workflow.
Could we see it as a run we can manually trigger at some occasions, not entirely sure what approach would be most valuable to learn from? I fully agree with @WilliamBerryiii that we don't want these un-deterministic/behavioral workflows to be part of PR reviews or even release validations, at least not for a while.

Add continue-on-error: true to beval in both pr-validation.yml and release-stable.yml, and remove beval from release-please's needs list. Behavioral evaluation runs on every PR and release for observability but does not block merges or releases due to its non-deterministic nature. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Move COPILOT_GITHUB_TOKEN from job-level env to step-level env on the Start agent and Start judge steps. Checkout, Python setup, dependency install, and results upload steps no longer have access to the token. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Per reviewer feedback, beval is experimental with APIs subject to change and an as-yet-unsecured dependency repo. Remove it from pr-validation.yml and release-stable.yml entirely; beval.yml remains available for manual workflow_dispatch runs. Also update beval SHA pin to b92c200 which adds unit tests for the request_permission fix. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Move agent.yaml, eval.config.yaml, and cases/ into beval/dt-coach/ so each agent has its own isolated directory. Adding a new agent means adding a new subdirectory with no structural changes needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Resolve conflict in .cspell.json by keeping both "atheris" (upstream) and "beval" (branch). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Previous pin used the wrong full SHA — b92c200f... instead of b92c200d... Both share the first 7 chars, causing pip to fail with "not our ref". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The agent was responding with initialization questions ("Project slug. Your role.") instead of Method 8 guidance because the query lacked enough context. Add role, project name, and explicit method reference so the agent can skip initialization and respond substantively. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@github/copilot and its platform-specific packages use a non-SPDX proprietary license (LicenseRef-bad-see-license-in-license.md) that falls outside the repo's allowed license list. These are GitHub's own CLI toolchain, deliberately used in beval.yml, so they are added as explicit package-level exceptions rather than broadening the license allowlist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Remove beval/package.json and beval/package-lock.json. The lockfile caused the dependency review to flag @github/copilot's non-SPDX proprietary license, and allow-packages does not override license checks in the dependency-review-action. Use npm install -g @github/copilot@1.0.9 (exact version pin) instead. Global CLI installs cannot use npm ci as it requires a project-scoped lockfile; exact version pinning is the appropriate alternative. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

.github/workflows/beval.yml

+          python-version: "3.12"
+
+      - name: Install GitHub Copilot CLI
+        run: npm install -g @github/copilot@1.0.9


WilliamBerryiii · 2026-03-20T23:47:07Z

I've resolved the dependency pinning issues and license violation in the PR #1159 - this was related to dependent package licensing .. not the GHAS alert above.

Update SHA from branch tip (1f01760, eedorenko/judge-permission-fix) to the merge commit on vyta/beval main (a2effa1), satisfying the reviewer requirement to pin to a commit on the default branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko and others added 18 commits March 16, 2026 13:24

Update copilot command to use claude-opus model

ef56eae

Simplify agent startup command in beval.yml

8faa4ea

Removed port specification from agent startup command.

Modify copilot command to include prompt

50a03dd

Add prompt to copilot agent startup command.

Specify working directory for Start agent step

26fcbe7

Added working-directory to Start agent step in beval.yml

fix: pin GitHub Actions dependencies to SHA hashes

5a7ae11

Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: trigger beval workflow test

ade4c27

ci: add token debug workflow

c708932

ci: add token verification step to beval workflow

01849f7

ci: use claude-opus-4.6-fast model for agent and judge

de9e55e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: use claude-opus-4.6-1m model and add debug logging

859fa91

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: set AGENT_REPO_ROOT to absolute workspace path

7e5afbe

Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: temporarily run only agent_identity case

967c680

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: set model via ACP session instead of CLI flag

3bf5071

Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: remove token verification step and run full test suite

b00d3f0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove debug logging and agent_identity test case

4f1a9c2

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: install beval from default branch

fcaf374

Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko requested a review from a team as a code owner March 19, 2026 19:36

Merge branch 'main' into eedorenko/beval

d662c71

github-advanced-security bot found potential problems Mar 19, 2026

View reviewed changes

.github/workflows/test-token.yml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Mar 19, 2026

View reviewed changes

.github/workflows/test-token.yml Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Mar 19, 2026

View reviewed changes

.github/workflows/test-token.yml Fixed Show fixed Hide fixed

eedorenko marked this pull request as draft March 19, 2026 20:43

eedorenko and others added 2 commits March 19, 2026 13:45

fix: resolve flatted prototype pollution vulnerability

86b028e

Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko marked this pull request as ready for review March 19, 2026 21:00

WilliamBerryiii requested changes Mar 19, 2026

View reviewed changes

ci: restore full test suite after agent identity verification

6a61043

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko and others added 6 commits March 19, 2026 17:21

ci: pin beval install to specific commit SHA

373b4c6

Pin vyta/beval to a9ab930ade3db13855b26b34b268327da9c881bc instead of HEAD to ensure reproducible installs with integrity verification. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: omit permission flags from Copilot CLI ACP server

c343528

Remove --allow-all; beval's request_permission() callback handles tool permission requests automatically, so no blanket flag is needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

katriendg reviewed Mar 20, 2026

View reviewed changes

.github/workflows/beval.yml Outdated Show resolved Hide resolved

katriendg reviewed Mar 20, 2026

View reviewed changes

.github/workflows/beval.yml Show resolved Hide resolved

katriendg reviewed Mar 20, 2026

View reviewed changes

.github/workflows/beval.yml Outdated Show resolved Hide resolved

eedorenko and others added 8 commits March 20, 2026 11:37

chore: merge upstream/main

5ff0c54

Resolve conflict in .cspell.json by keeping both "atheris" (upstream) and "beval" (branch). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: fix beval SHA pin (correct full SHA for b92c200)

5976fba

Previous pin used the wrong full SHA — b92c200f... instead of b92c200d... Both share the first 7 chars, causing pip to fail with "not our ref". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: update beval SHA pin to 1f01760 (fix import order)

93d6137

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko requested a review from WilliamBerryiii March 20, 2026 20:37

eedorenko and others added 4 commits March 20, 2026 13:37

Merge branch 'main' into eedorenko/beval

61dcbdd

Merge branch 'main' into eedorenko/beval

b743d2f

github-advanced-security bot found potential problems Mar 20, 2026

View reviewed changes

eedorenko and others added 3 commits March 23, 2026 10:39

Merge branch 'main' into eedorenko/beval

a430a06

Merge branch 'main' into eedorenko/beval

e2b0414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add beval behavioral evaluation workflow for dt-coach agent#1129

ci: add beval behavioral evaluation workflow for dt-coach agent#1129
eedorenko wants to merge 45 commits intomicrosoft:mainfrom
eedorenko:eedorenko/beval

eedorenko commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eedorenko commented Mar 19, 2026

Uh oh!

WilliamBerryiii left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katriendg commented Mar 20, 2026

Uh oh!

Check warning

WilliamBerryiii commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

eedorenko commented Mar 19, 2026

Description

Related Issue(s)

Type of Change

Testing

Checklist

Required Checks

AI Artifact Contributions

Required Automated Checks

Security Considerations

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eedorenko commented Mar 19, 2026

Uh oh!

WilliamBerryiii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katriendg commented Mar 20, 2026

Uh oh!

Check warning

WilliamBerryiii commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 19, 2026 •

edited

Loading

WilliamBerryiii commented Mar 20, 2026 •

edited

Loading