feat(evals): add Custom Token Exchange × Next.js two-persona eval#57
Draft
brth31 wants to merge 23 commits into
Draft
feat(evals): add Custom Token Exchange × Next.js two-persona eval#57brth31 wants to merge 23 commits into
brth31 wants to merge 23 commits into
Conversation
Two-leg graders covering SDK integration (auth0.customTokenExchange, CustomTokenExchangeError) and tenant configuration (token exchange profile via CLI or Terraform). Startup persona uses CLI; enterprise persona uses Terraform with a pre-seeded infra/auth0/ workspace. Graders: L1 SDK method + error import, L2 wrong-SDK anti-patterns, L3 credential hygiene, L4 structural + ranCommand/ranCommandOneOf config, L5 code↔config consistency + Terraform contextual fit. Validated against 4 known-broken variants (meta-eval.md).
Bumps [@google/gemini-cli](https://github.com/google-gemini/gemini-cli) from 0.45.2 to 0.46.0. - [Release notes](https://github.com/google-gemini/gemini-cli/releases) - [Changelog](https://github.com/google-gemini/gemini-cli/blob/main/docs/releases.md) - [Commits](google-gemini/gemini-cli@v0.45.2...v0.46.0) --- updated-dependencies: - dependency-name: "@google/gemini-cli" dependency-version: 0.46.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [braintrust](https://github.com/braintrustdata/braintrust-sdk-javascript/tree/HEAD/js) from 3.10.0 to 3.17.0. - [Release notes](https://github.com/braintrustdata/braintrust-sdk-javascript/releases) - [Changelog](https://github.com/braintrustdata/braintrust-sdk-javascript/blob/main/js/CHANGELOG.md) - [Commits](https://github.com/braintrustdata/braintrust-sdk-javascript/commits/braintrust@3.17.0/js) --- updated-dependencies: - dependency-name: braintrust dependency-version: 3.17.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
…#33) The modelIds map holds Bedrock IDs (global.anthropic.*) needed only by the Claude Code agent runner, which routes through the proxy's /anthropic pass-through endpoint. The LLM judge and the recommendation generator both POST to the LiteLLM /chat/completions endpoint, which serves models under their plain alias. Applying the map there rewrote e.g. claude-opus-4-8 to global.anthropic.claude-opus-4-8, which LiteLLM rejects with a 400. Send the model alias as-is from both paths and drop the now-unused modelMap plumbing from the judge. Add regression tests asserting the alias is sent unchanged even when a Bedrock map is configured.
Bumps [typescript-eslint](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/typescript-eslint) from 8.61.0 to 8.61.1. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/typescript-eslint/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.61.1/packages/typescript-eslint) --- updated-dependencies: - dependency-name: typescript-eslint dependency-version: 8.61.1 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@vitest/coverage-v8](https://github.com/vitest-dev/vitest/tree/HEAD/packages/coverage-v8) from 4.1.8 to 4.1.9. - [Release notes](https://github.com/vitest-dev/vitest/releases) - [Changelog](https://github.com/vitest-dev/vitest/blob/main/docs/releases.md) - [Commits](https://github.com/vitest-dev/vitest/commits/HEAD/packages/coverage-v8) --- updated-dependencies: - dependency-name: "@vitest/coverage-v8" dependency-version: 4.1.9 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [eslint](https://github.com/eslint/eslint) from 10.4.1 to 10.5.0. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](eslint/eslint@v10.4.1...v10.5.0) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.5.0 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@ai-sdk/openai](https://github.com/vercel/ai/tree/HEAD/packages/openai) from 3.0.69 to 3.0.71. - [Release notes](https://github.com/vercel/ai/releases) - [Changelog](https://github.com/vercel/ai/blob/@ai-sdk/openai@3.0.71/packages/openai/CHANGELOG.md) - [Commits](https://github.com/vercel/ai/commits/@ai-sdk/openai@3.0.71/packages/openai) --- updated-dependencies: - dependency-name: "@ai-sdk/openai" dependency-version: 3.0.71 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…event, not source substrings (#45)
Bumps [ai](https://github.com/vercel/ai/tree/HEAD/packages/ai) from 6.0.201 to 6.0.206. - [Release notes](https://github.com/vercel/ai/releases) - [Changelog](https://github.com/vercel/ai/blob/ai@6.0.206/packages/ai/CHANGELOG.md) - [Commits](https://github.com/vercel/ai/commits/ai@6.0.206/packages/ai) --- updated-dependencies: - dependency-name: ai dependency-version: 6.0.206 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…dates (#36) --- updated-dependencies: - dependency-name: "@openai/codex" dependency-version: 0.140.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: openai-codex - dependency-name: "@openai/codex-sdk" dependency-version: 0.140.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: openai-codex ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Gemini CLI >=0.46 emits MCP tool names as `mcp_<server>_<tool>` with a single-underscore prefix (see generateValidName in the bundle). Every other runner and the trace-based MCP graders use the double-underscore `mcp__` prefix (Claude Code emits it natively; codex/copilot translators normalize to it). The Gemini translator's mapMcpName was identity, so its tool calls were stored with a single underscore and never matched calledTool / calledToolOneOf — a successful MCP invocation scored 0%. Normalize the prefix in mapMcpName (idempotent for names already on the mcp__ convention).
…dates (#48) Keep @anthropic-ai/claude-code and @anthropic-ai/claude-agent-sdk in lockstep so they are bumped in a single PR, mirroring the openai-codex group. Easier to test the two together than separately.
Bumps the anthropic group with 2 updates: [@anthropic-ai/claude-agent-sdk](https://github.com/anthropics/claude-agent-sdk-typescript) and [@anthropic-ai/claude-code](https://github.com/anthropics/claude-code). Updates `@anthropic-ai/claude-agent-sdk` from 0.3.173 to 0.3.178 - [Release notes](https://github.com/anthropics/claude-agent-sdk-typescript/releases) - [Changelog](https://github.com/anthropics/claude-agent-sdk-typescript/blob/main/CHANGELOG.md) - [Commits](anthropics/claude-agent-sdk-typescript@v0.3.173...v0.3.178) Updates `@anthropic-ai/claude-code` from 2.1.173 to 2.1.178 - [Release notes](https://github.com/anthropics/claude-code/releases) - [Changelog](https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md) - [Commits](anthropics/claude-code@v2.1.173...v2.1.178) --- updated-dependencies: - dependency-name: "@anthropic-ai/claude-agent-sdk" dependency-version: 0.3.178 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: anthropic - dependency-name: "@anthropic-ai/claude-code" dependency-version: 2.1.178 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: anthropic ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… proxy (#50) Bumps @github/copilot-sdk from 0.3.0 to 1.0.1 (supersedes dependabot #25). The major version has breaking API changes: `cwd` -> `workingDirectory` and `cliPath` -> `connection: RuntimeConnection.forStdio({ path })`. While migrating, route the Copilot runner's inference through the configured LLM proxy via an OpenAI-compatible BYOK provider, instead of GitHub's Copilot backend. The runner previously authenticated via the logged-in gh user, whose backend does not serve the GPT models the eval matrix expects (e.g. gpt-5.4). Uses the Responses API because Copilot emits freeform/custom tool calls (apply_patch) that the chat-completions API rejects — same as the codex runner.
* feat: inject compile_command guidance into agent context files
Add an optional `compile_command` PROMPT.md frontmatter field. When set,
a verify-compiles instruction is appended to the agent's native context
file (CLAUDE.md / GEMINI.md / AGENTS.md / copilot-instructions.md)
alongside the existing "no docs files" guidance, so the agent verifies
the project compiles and the command appears in the tool trace.
Wires the field into all 10 quickstart evals.
* fix: make compile_command guidance imperative so agents run it
The injected compile-verification guidance used permissive wording
("you can use this command"), so capable models produced correct code
but skipped the build — failing the mandatory build-verification grader.
Rephrase as a "you MUST run" instruction and assert the mandatory
wording in tests.
Add a trigger row mapping package/runner/scoring/config/data-flow changes to docs/ARCHITECTURE.md, and extend the rule-of-thumb so the Mermaid diagrams are kept in sync with the code, not just the prose.
* docs: spec for post-run compile grader Design for running compile_command against the workspace after the agent finishes and driving a compiles() grader from the captured result, so agents are graded on whether their output compiles rather than on whether they ran the build themselves. * docs: implementation plan for post-run compile grader * feat(graders): add CompileResult type * feat(graders): add compiles() grader primitive * feat(core): thread compileResult through GraderContext * feat(core): add compile grader executor * feat(core): add compileResult parameter to runGraders * feat(core): add non-throwing runCompileCommand workspace helper * feat(eval): run compile_command post-agent on the host path * feat(eval): run compile_command post-agent in the sandbox path * test: enable compiles() grader for frontend quickstarts * docs: document compiles() grader and post-run compile behaviour * chore(graders): format long type export added for CompileResult * chore(core): format long type export line for RunCompileCommandOptions * chore: remove superpowers spec and plan docs
Mirrors the compile grader already present in react_quickstart — adds compile_command: npm run build to PROMPT.md frontmatter and a compiles() grader at L4 so generated MFA code is verified to build after each agent run.
* fix(compile): auto-run npm install before compile_command If the agent adds new deps to package.json but never runs npm install, runCompileCommand would fail with "module not found". Now we prepend npm install automatically whenever package.json exists in the workspace, so the build always has up-to-date dependencies. * fix(compile): use setup_command instead of heuristic npm install detection Replace the package.json-existence heuristic that unconditionally prepended `npm install` with an explicit `setupCommand` option on `RunCompileCommandOptions`. Callers (run.ts, sandbox-runner.ts) now pass `evalDef.setupCommand` so the eval's own declared setup_command is reused — only unshifted when it exists. * fix(compile): add comment explaining why setupCommand is prepended before compile
Aligns with new convention introduced in main (MFA eval updated same way). compile_command: npm run build added to both PROMPT.md files. compiles() grader added at L4 in both startup and enterprise graders.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
Two-persona eval for Custom Token Exchange (RFC 8693) with
@auth0/nextjs-auth0.nextjs-startup) — CLI persona. Graders check SDK method presence +ranCommandforPOST /api/v2/token-exchange-profilesnextjs-enterprise) — Terraform persona. Pre-seededinfra/auth0/main.tfwith resource stub;ranCommandOneOf+wroteFilegradersagent-skillsrepo (coverscustomTokenExchange,CustomTokenExchangeError, token type constraints, silent failures)meta-eval.mddocumenting expected grader rejectionsGraders (per eval)
customTokenExchange,CustomTokenExchangeError,@auth0/nextjs-auth0/errors,subjectToken,subjectTokenTypepresentgetAccessTokenSilently,@auth0/auth0-react,urn:ietf:absentSilent failures covered
EXCHANGE_FAILEDwith no hintEXCHANGE_FAILEDnullsession, no errorundefined, no errorVerification
npm run buildpasses, 687 tests pass. Meta-eval reviewed manually (noa0-evalbinary available locally).