feat(evals): add Custom Token Exchange × Next.js two-persona eval by brth31 · Pull Request #57 · auth0/auth0-evals

brth31 · 2026-06-22T08:50:57Z

What this adds

Two-persona eval for Custom Token Exchange (RFC 8693) with @auth0/nextjs-auth0.

Startup eval (nextjs-startup) — CLI persona. Graders check SDK method presence + ranCommand for POST /api/v2/token-exchange-profiles
Enterprise eval (nextjs-enterprise) — Terraform persona. Pre-seeded infra/auth0/main.tf with resource stub; ranCommandOneOf + wroteFile graders
Feature SKILL.md — saved separately to agent-skills repo (covers customTokenExchange, CustomTokenExchangeError, token type constraints, silent failures)
Known-broken variants — 4 variants with meta-eval.md documenting expected grader rejections

Graders (per eval)

Level	What it checks
L1	`customTokenExchange`, `CustomTokenExchangeError`, `@auth0/nextjs-auth0/errors`, `subjectToken`, `subjectTokenType` present
L2	`getAccessTokenSilently`, `@auth0/auth0-react`, `urn:ietf:` absent
L3	No hardcoded client secret or client ID in source files
L4	Server-side call, specific error handling, token exchange profile configured (CLI or Terraform)
L5	Code ↔ config token type consistency; Terraform contextual fit (enterprise)
Holistic	End-to-end correctness judge (always runs)

Silent failures covered

Token type mismatch (code vs profile config) → EXCHANGE_FAILED with no hint
No profile created → EXCHANGE_FAILED
Session assumption after exchange → null session, no error
Delegation refresh token assumption → undefined, no error

Verification

npm run build passes, 687 tests pass. Meta-eval reviewed manually (no a0-eval binary available locally).

Two-leg graders covering SDK integration (auth0.customTokenExchange, CustomTokenExchangeError) and tenant configuration (token exchange profile via CLI or Terraform). Startup persona uses CLI; enterprise persona uses Terraform with a pre-seeded infra/auth0/ workspace. Graders: L1 SDK method + error import, L2 wrong-SDK anti-patterns, L3 credential hygiene, L4 structural + ranCommand/ranCommandOneOf config, L5 code↔config consistency + Terraform contextual fit. Validated against 4 known-broken variants (meta-eval.md).

Bumps [@google/gemini-cli](https://github.com/google-gemini/gemini-cli) from 0.45.2 to 0.46.0. - [Release notes](https://github.com/google-gemini/gemini-cli/releases) - [Changelog](https://github.com/google-gemini/gemini-cli/blob/main/docs/releases.md) - [Commits](google-gemini/gemini-cli@v0.45.2...v0.46.0) --- updated-dependencies: - dependency-name: "@google/gemini-cli" dependency-version: 0.46.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [braintrust](https://github.com/braintrustdata/braintrust-sdk-javascript/tree/HEAD/js) from 3.10.0 to 3.17.0. - [Release notes](https://github.com/braintrustdata/braintrust-sdk-javascript/releases) - [Changelog](https://github.com/braintrustdata/braintrust-sdk-javascript/blob/main/js/CHANGELOG.md) - [Commits](https://github.com/braintrustdata/braintrust-sdk-javascript/commits/braintrust@3.17.0/js) --- updated-dependencies: - dependency-name: braintrust dependency-version: 3.17.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

…#33) The modelIds map holds Bedrock IDs (global.anthropic.*) needed only by the Claude Code agent runner, which routes through the proxy's /anthropic pass-through endpoint. The LLM judge and the recommendation generator both POST to the LiteLLM /chat/completions endpoint, which serves models under their plain alias. Applying the map there rewrote e.g. claude-opus-4-8 to global.anthropic.claude-opus-4-8, which LiteLLM rejects with a 400. Send the model alias as-is from both paths and drop the now-unused modelMap plumbing from the judge. Add regression tests asserting the alias is sent unchanged even when a Bedrock map is configured.

)

Bumps [typescript-eslint](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/typescript-eslint) from 8.61.0 to 8.61.1. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/typescript-eslint/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.61.1/packages/typescript-eslint) --- updated-dependencies: - dependency-name: typescript-eslint dependency-version: 8.61.1 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [@vitest/coverage-v8](https://github.com/vitest-dev/vitest/tree/HEAD/packages/coverage-v8) from 4.1.8 to 4.1.9. - [Release notes](https://github.com/vitest-dev/vitest/releases) - [Changelog](https://github.com/vitest-dev/vitest/blob/main/docs/releases.md) - [Commits](https://github.com/vitest-dev/vitest/commits/HEAD/packages/coverage-v8) --- updated-dependencies: - dependency-name: "@vitest/coverage-v8" dependency-version: 4.1.9 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [eslint](https://github.com/eslint/eslint) from 10.4.1 to 10.5.0. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](eslint/eslint@v10.4.1...v10.5.0) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.5.0 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [@ai-sdk/openai](https://github.com/vercel/ai/tree/HEAD/packages/openai) from 3.0.69 to 3.0.71. - [Release notes](https://github.com/vercel/ai/releases) - [Changelog](https://github.com/vercel/ai/blob/@ai-sdk/openai@3.0.71/packages/openai/CHANGELOG.md) - [Commits](https://github.com/vercel/ai/commits/@ai-sdk/openai@3.0.71/packages/openai) --- updated-dependencies: - dependency-name: "@ai-sdk/openai" dependency-version: 3.0.71 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…event, not source substrings (#45)

Bumps [ai](https://github.com/vercel/ai/tree/HEAD/packages/ai) from 6.0.201 to 6.0.206. - [Release notes](https://github.com/vercel/ai/releases) - [Changelog](https://github.com/vercel/ai/blob/ai@6.0.206/packages/ai/CHANGELOG.md) - [Commits](https://github.com/vercel/ai/commits/ai@6.0.206/packages/ai) --- updated-dependencies: - dependency-name: ai dependency-version: 6.0.206 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…dates (#36) --- updated-dependencies: - dependency-name: "@openai/codex" dependency-version: 0.140.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: openai-codex - dependency-name: "@openai/codex-sdk" dependency-version: 0.140.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: openai-codex ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Gemini CLI >=0.46 emits MCP tool names as `mcp_<server>_<tool>` with a single-underscore prefix (see generateValidName in the bundle). Every other runner and the trace-based MCP graders use the double-underscore `mcp__` prefix (Claude Code emits it natively; codex/copilot translators normalize to it). The Gemini translator's mapMcpName was identity, so its tool calls were stored with a single underscore and never matched calledTool / calledToolOneOf — a successful MCP invocation scored 0%. Normalize the prefix in mapMcpName (idempotent for names already on the mcp__ convention).

…dates (#48) Keep @anthropic-ai/claude-code and @anthropic-ai/claude-agent-sdk in lockstep so they are bumped in a single PR, mirroring the openai-codex group. Easier to test the two together than separately.

Bumps the anthropic group with 2 updates: [@anthropic-ai/claude-agent-sdk](https://github.com/anthropics/claude-agent-sdk-typescript) and [@anthropic-ai/claude-code](https://github.com/anthropics/claude-code). Updates `@anthropic-ai/claude-agent-sdk` from 0.3.173 to 0.3.178 - [Release notes](https://github.com/anthropics/claude-agent-sdk-typescript/releases) - [Changelog](https://github.com/anthropics/claude-agent-sdk-typescript/blob/main/CHANGELOG.md) - [Commits](anthropics/claude-agent-sdk-typescript@v0.3.173...v0.3.178) Updates `@anthropic-ai/claude-code` from 2.1.173 to 2.1.178 - [Release notes](https://github.com/anthropics/claude-code/releases) - [Changelog](https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md) - [Commits](anthropics/claude-code@v2.1.173...v2.1.178) --- updated-dependencies: - dependency-name: "@anthropic-ai/claude-agent-sdk" dependency-version: 0.3.178 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: anthropic - dependency-name: "@anthropic-ai/claude-code" dependency-version: 2.1.178 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: anthropic ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

… proxy (#50) Bumps @github/copilot-sdk from 0.3.0 to 1.0.1 (supersedes dependabot #25). The major version has breaking API changes: `cwd` -> `workingDirectory` and `cliPath` -> `connection: RuntimeConnection.forStdio({ path })`. While migrating, route the Copilot runner's inference through the configured LLM proxy via an OpenAI-compatible BYOK provider, instead of GitHub's Copilot backend. The runner previously authenticated via the logged-in gh user, whose backend does not serve the GPT models the eval matrix expects (e.g. gpt-5.4). Uses the Responses API because Copilot emits freeform/custom tool calls (apply_patch) that the chat-completions API rejects — same as the codex runner.

* feat: inject compile_command guidance into agent context files Add an optional `compile_command` PROMPT.md frontmatter field. When set, a verify-compiles instruction is appended to the agent's native context file (CLAUDE.md / GEMINI.md / AGENTS.md / copilot-instructions.md) alongside the existing "no docs files" guidance, so the agent verifies the project compiles and the command appears in the tool trace. Wires the field into all 10 quickstart evals. * fix: make compile_command guidance imperative so agents run it The injected compile-verification guidance used permissive wording ("you can use this command"), so capable models produced correct code but skipped the build — failing the mandatory build-verification grader. Rephrase as a "you MUST run" instruction and assert the mandatory wording in tests.

Add a trigger row mapping package/runner/scoring/config/data-flow changes to docs/ARCHITECTURE.md, and extend the rule-of-thumb so the Mermaid diagrams are kept in sync with the code, not just the prose.

* docs: spec for post-run compile grader Design for running compile_command against the workspace after the agent finishes and driving a compiles() grader from the captured result, so agents are graded on whether their output compiles rather than on whether they ran the build themselves. * docs: implementation plan for post-run compile grader * feat(graders): add CompileResult type * feat(graders): add compiles() grader primitive * feat(core): thread compileResult through GraderContext * feat(core): add compile grader executor * feat(core): add compileResult parameter to runGraders * feat(core): add non-throwing runCompileCommand workspace helper * feat(eval): run compile_command post-agent on the host path * feat(eval): run compile_command post-agent in the sandbox path * test: enable compiles() grader for frontend quickstarts * docs: document compiles() grader and post-run compile behaviour * chore(graders): format long type export added for CompileResult * chore(core): format long type export line for RunCompileCommandOptions * chore: remove superpowers spec and plan docs

Mirrors the compile grader already present in react_quickstart — adds compile_command: npm run build to PROMPT.md frontmatter and a compiles() grader at L4 so generated MFA code is verified to build after each agent run.

* fix(compile): auto-run npm install before compile_command If the agent adds new deps to package.json but never runs npm install, runCompileCommand would fail with "module not found". Now we prepend npm install automatically whenever package.json exists in the workspace, so the build always has up-to-date dependencies. * fix(compile): use setup_command instead of heuristic npm install detection Replace the package.json-existence heuristic that unconditionally prepended `npm install` with an explicit `setupCommand` option on `RunCompileCommandOptions`. Callers (run.ts, sandbox-runner.ts) now pass `evalDef.setupCommand` so the eval's own declared setup_command is reused — only unshifted when it exists. * fix(compile): add comment explaining why setupCommand is prepended before compile

Aligns with new convention introduced in main (MFA eval updated same way). compile_command: npm run build added to both PROMPT.md files. compiles() grader added at L4 in both startup and enterprise graders.

brth31 mentioned this pull request Jun 22, 2026

feat(skills): add auth0-custom-token-exchange skill auth0/agent-skills#132

Open

dependabot Bot and others added 22 commits June 22, 2026 17:11

fix(swift-eval): accept fluent webAuth().start() chain in L5 grader (#34

f49b04c

)

fix: strengthen express-api graders: verify issuer/audience via .env …

4df6dc1

…event, not source substrings (#45)

chore: group Anthropic claude-code and claude-agent-sdk dependabot up…

4ce53a8

…dates (#48) Keep @anthropic-ai/claude-code and @anthropic-ai/claude-agent-sdk in lockstep so they are bumped in a single PR, mirroring the openai-codex group. Easier to test the two together than separately.

docs: Add Architecture.md file (#35)

fb64064

docs(agents): require ARCHITECTURE.md diagram updates on drift (#52)

7286310

Add a trigger row mapping package/runner/scoring/config/data-flow changes to docs/ARCHITECTURE.md, and extend the rule-of-thumb so the Mermaid diagrams are kept in sync with the code, not just the prose.

feat(mfa/react): add compile-time check to React MFA eval (#53)

69bbeb2

Mirrors the compile grader already present in react_quickstart — adds compile_command: npm run build to PROMPT.md frontmatter and a compiles() grader at L4 so generated MFA code is verified to build after each agent run.

chore(evals): add compile_command and compiles() grader to CTE evals

6a5d485

Aligns with new convention introduced in main (MFA eval updated same way). compile_command: npm run build added to both PROMPT.md files. compiles() grader added at L4 in both startup and enterprise graders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evals): add Custom Token Exchange × Next.js two-persona eval#57

feat(evals): add Custom Token Exchange × Next.js two-persona eval#57
brth31 wants to merge 23 commits into
mainfrom
feat/add-custom-token-exchange-nextjs-eval

brth31 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

brth31 commented Jun 22, 2026

What this adds

Graders (per eval)

Silent failures covered

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants