ci: add beval behavioral evaluation workflow for dt-coach agent#1129
ci: add beval behavioral evaluation workflow for dt-coach agent#1129eedorenko wants to merge 45 commits intomicrosoft:mainfrom
Conversation
Add 30 test cases across 4 categories (coaching behaviors, session phases, method guidance, progressive hints) with ACP judge integration. Include reusable CI workflow and PR validation hook with fork guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed port specification from agent startup command.
Add prompt to copilot agent startup command.
Added working-directory to Start agent step in beval.yml
Switch to init_prompt to reliably activate the dt-coach agent in ACP sessions. Remove --agent flag from copilot TCP start, add port-readiness polling. Add agent identity verification case. Copy dt-coach.agent.md to .github/agents/ for flat discovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add beval, wireframes, parseable to cspell dictionary - Ignore beval/results/** from spell check (generated output) - Add top-level and job-level permissions blocks to test-token.yml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add behavioral evaluation job to release-stable.yml - Remove test-token.yml debug workflow - Remove dt-coach.agent.md (not part of this contribution) - Remove beval/results/ (generated output, not for source control) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Thank you for this PR, @eedorenko. Behavioral evaluation for the dt-coach agent is a valuable addition, and we appreciate the effort to formalize agent quality testing with structured evaluation cases.
After reviewing the workflow changes against our CI security standards, we've identified several issues that need to be resolved before this can merge. The findings fall into two categories: supply-chain security violations in the beval workflow, and architectural concerns with integrating it into PR validation and release pipelines.
Important
The combination of unpinned dependencies from an external personal repository, unpinned npm range installs, inherited secrets, and persisted credentials creates a compound risk. A compromise of any one dependency effectively grants access to all repository secrets and the CI execution context.
We've added inline comments on each affected file with specific context and suggested changes. The critical items are:
pip installfromvyta/bevalwith no commit SHA and no hash verification (see comment onbeval.ymlline 32)npm install -g @github/copilot@1with a major-version range and no lockfile (see comment onbeval.ymlline 29)actions/checkoutwithoutpersist-credentials: false(see comment onbeval.ymlline 21)- Both copilot instances launch with
--allow-all, granting unrestricted permissions (see comment onbeval.ymlline 36) secrets: inheritin both calling workflows forwards all repository secrets when onlyCOPILOT_TOKENis needed- Behavioral evaluation should not gate PR merges or releases at this stage (see comments on
pr-validation.ymlandrelease-stable.yml)
Our repository enforces these standards through Test-DependencyPinning.ps1, Test-WorkflowPermissions.ps1, and the conventions documented in workflows.instructions.md. The copilot-setup-steps.yml workflow demonstrates the expected pattern for downloading and verifying external binaries.
We recommend deploying beval as a standalone workflow_dispatch or scheduled workflow instead of integrating it into pr-validation.yml and release-stable.yml. This allows behavioral testing to proceed without gating contributor workflows or release processes.
Please comment if you have questions about any of the suggestions, and we can discuss further.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1129 +/- ##
==========================================
- Coverage 86.90% 86.89% -0.02%
==========================================
Files 59 59
Lines 8774 8774
==========================================
- Hits 7625 7624 -1
- Misses 1149 1150 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
- Add beval/package.json and package-lock.json pinning @github/copilot to 1.0.9 with SRI hashes for integrity verification - Replace npm install -g with npm ci --prefix beval and add beval/node_modules/.bin to PATH - Add persist-credentials: false to checkout step Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Pin vyta/beval to a9ab930ade3db13855b26b34b268327da9c881bc instead of HEAD to ensure reproducible installs with integrity verification. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace --allow-all on both agent and judge ACP instances with explicit --deny-tool flags scoped to what each role requires: - Agent: denies shell and web; only needs to read instruction files and respond to text prompts during evaluation - Judge: denies shell and web; only needs LLM inference to score responses, no tool access required Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace secrets: inherit with explicit COPILOT_TOKEN forwarding in both pr-validation.yml and release-stable.yml, and declare the secret in beval.yml's workflow_call trigger. This limits secret exposure to only what beval requires rather than forwarding all caller secrets. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Update to vyta/beval@4f363b7 which adds request_permission() to _ACPJudgeClient, fixing "Method not found" ACP errors when Copilot CLI is started with --deny-tool flags. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove --allow-all; beval's request_permission() callback handles tool permission requests automatically, so no blanket flag is needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Truly love this addition, adding un-deterministic evaluations to DT coach. This is an area we wanted to explore, thanks for adding this! Given the very experimental status of |
Add continue-on-error: true to beval in both pr-validation.yml and release-stable.yml, and remove beval from release-please's needs list. Behavioral evaluation runs on every PR and release for observability but does not block merges or releases due to its non-deterministic nature. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move COPILOT_GITHUB_TOKEN from job-level env to step-level env on the Start agent and Start judge steps. Checkout, Python setup, dependency install, and results upload steps no longer have access to the token. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Per reviewer feedback, beval is experimental with APIs subject to change and an as-yet-unsecured dependency repo. Remove it from pr-validation.yml and release-stable.yml entirely; beval.yml remains available for manual workflow_dispatch runs. Also update beval SHA pin to b92c200 which adds unit tests for the request_permission fix. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move agent.yaml, eval.config.yaml, and cases/ into beval/dt-coach/ so each agent has its own isolated directory. Adding a new agent means adding a new subdirectory with no structural changes needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in .cspell.json by keeping both "atheris" (upstream) and "beval" (branch). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Previous pin used the wrong full SHA — b92c200f... instead of b92c200d... Both share the first 7 chars, causing pip to fail with "not our ref". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The agent was responding with initialization questions ("Project slug.
Your role.") instead of Method 8 guidance because the query lacked
enough context. Add role, project name, and explicit method reference
so the agent can skip initialization and respond substantively.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github/copilot and its platform-specific packages use a non-SPDX proprietary license (LicenseRef-bad-see-license-in-license.md) that falls outside the repo's allowed license list. These are GitHub's own CLI toolchain, deliberately used in beval.yml, so they are added as explicit package-level exceptions rather than broadening the license allowlist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove beval/package.json and beval/package-lock.json. The lockfile caused the dependency review to flag @github/copilot's non-SPDX proprietary license, and allow-packages does not override license checks in the dependency-review-action. Use npm install -g @github/copilot@1.0.9 (exact version pin) instead. Global CLI installs cannot use npm ci as it requires a project-scoped lockfile; exact version pinning is the appropriate alternative. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
I've resolved the dependency pinning issues and license violation in the PR #1159 - this was related to dependent package licensing .. not the GHAS alert above. |
Update SHA from branch tip (1f01760, eedorenko/judge-permission-fix) to the merge commit on vyta/beval main (a2effa1), satisfying the reviewer requirement to pin to a commit on the default branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Description
Adds a behavioral evaluation (beval) CI workflow for the
dt-coachagent using GitHub Copilot CLI over ACP (TCP). The workflow:beval/cases/Also pins all GitHub Actions dependencies to SHA hashes for supply chain security, and installs beval from the default branch of the
vyta/bevalrepo.Sample eval run: https://github.com/eedorenko/hve-core/actions/runs/23311489579/job/67799722616
Related Issue(s)
Type of Change
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/instructions/*.instructions.md).github/prompts/*.prompt.md).github/agents/*.agent.md).github/skills/*/SKILL.md)Other:
.ps1,.sh,.py)Testing
The workflow has been validated by triggering it manually via
workflow_dispatch. All 30 evaluation cases passed with an overall score of 0.81.Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generateSecurity Considerations
Additional Notes
All GitHub Actions
uses:steps are pinned to SHA hashes per supply chain security best practices.