Skip to content

feat(evals): add Custom Token Exchange × Next.js two-persona eval#57

Draft
brth31 wants to merge 23 commits into
mainfrom
feat/add-custom-token-exchange-nextjs-eval
Draft

feat(evals): add Custom Token Exchange × Next.js two-persona eval#57
brth31 wants to merge 23 commits into
mainfrom
feat/add-custom-token-exchange-nextjs-eval

Conversation

@brth31

@brth31 brth31 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What this adds

Two-persona eval for Custom Token Exchange (RFC 8693) with @auth0/nextjs-auth0.

  • Startup eval (nextjs-startup) — CLI persona. Graders check SDK method presence + ranCommand for POST /api/v2/token-exchange-profiles
  • Enterprise eval (nextjs-enterprise) — Terraform persona. Pre-seeded infra/auth0/main.tf with resource stub; ranCommandOneOf + wroteFile graders
  • Feature SKILL.md — saved separately to agent-skills repo (covers customTokenExchange, CustomTokenExchangeError, token type constraints, silent failures)
  • Known-broken variants — 4 variants with meta-eval.md documenting expected grader rejections

Graders (per eval)

Level What it checks
L1 customTokenExchange, CustomTokenExchangeError, @auth0/nextjs-auth0/errors, subjectToken, subjectTokenType present
L2 getAccessTokenSilently, @auth0/auth0-react, urn:ietf: absent
L3 No hardcoded client secret or client ID in source files
L4 Server-side call, specific error handling, token exchange profile configured (CLI or Terraform)
L5 Code ↔ config token type consistency; Terraform contextual fit (enterprise)
Holistic End-to-end correctness judge (always runs)

Silent failures covered

  • Token type mismatch (code vs profile config) → EXCHANGE_FAILED with no hint
  • No profile created → EXCHANGE_FAILED
  • Session assumption after exchange → null session, no error
  • Delegation refresh token assumption → undefined, no error

Verification

npm run build passes, 687 tests pass. Meta-eval reviewed manually (no a0-eval binary available locally).

Two-leg graders covering SDK integration (auth0.customTokenExchange,
CustomTokenExchangeError) and tenant configuration (token exchange profile
via CLI or Terraform). Startup persona uses CLI; enterprise persona uses
Terraform with a pre-seeded infra/auth0/ workspace.

Graders: L1 SDK method + error import, L2 wrong-SDK anti-patterns,
L3 credential hygiene, L4 structural + ranCommand/ranCommandOneOf config,
L5 code↔config consistency + Terraform contextual fit.

Validated against 4 known-broken variants (meta-eval.md).
dependabot Bot and others added 22 commits June 22, 2026 17:11
Bumps [@google/gemini-cli](https://github.com/google-gemini/gemini-cli) from 0.45.2 to 0.46.0.
- [Release notes](https://github.com/google-gemini/gemini-cli/releases)
- [Changelog](https://github.com/google-gemini/gemini-cli/blob/main/docs/releases.md)
- [Commits](google-gemini/gemini-cli@v0.45.2...v0.46.0)

---
updated-dependencies:
- dependency-name: "@google/gemini-cli"
  dependency-version: 0.46.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [braintrust](https://github.com/braintrustdata/braintrust-sdk-javascript/tree/HEAD/js) from 3.10.0 to 3.17.0.
- [Release notes](https://github.com/braintrustdata/braintrust-sdk-javascript/releases)
- [Changelog](https://github.com/braintrustdata/braintrust-sdk-javascript/blob/main/js/CHANGELOG.md)
- [Commits](https://github.com/braintrustdata/braintrust-sdk-javascript/commits/braintrust@3.17.0/js)

---
updated-dependencies:
- dependency-name: braintrust
  dependency-version: 3.17.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
…#33)

The modelIds map holds Bedrock IDs (global.anthropic.*) needed only by
the Claude Code agent runner, which routes through the proxy's /anthropic
pass-through endpoint. The LLM judge and the recommendation generator both
POST to the LiteLLM /chat/completions endpoint, which serves models under
their plain alias. Applying the map there rewrote e.g. claude-opus-4-8 to
global.anthropic.claude-opus-4-8, which LiteLLM rejects with a 400.

Send the model alias as-is from both paths and drop the now-unused
modelMap plumbing from the judge. Add regression tests asserting the alias
is sent unchanged even when a Bedrock map is configured.
Bumps [typescript-eslint](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/typescript-eslint) from 8.61.0 to 8.61.1.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/typescript-eslint/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.61.1/packages/typescript-eslint)

---
updated-dependencies:
- dependency-name: typescript-eslint
  dependency-version: 8.61.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@vitest/coverage-v8](https://github.com/vitest-dev/vitest/tree/HEAD/packages/coverage-v8) from 4.1.8 to 4.1.9.
- [Release notes](https://github.com/vitest-dev/vitest/releases)
- [Changelog](https://github.com/vitest-dev/vitest/blob/main/docs/releases.md)
- [Commits](https://github.com/vitest-dev/vitest/commits/HEAD/packages/coverage-v8)

---
updated-dependencies:
- dependency-name: "@vitest/coverage-v8"
  dependency-version: 4.1.9
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [eslint](https://github.com/eslint/eslint) from 10.4.1 to 10.5.0.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](eslint/eslint@v10.4.1...v10.5.0)

---
updated-dependencies:
- dependency-name: eslint
  dependency-version: 10.5.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@ai-sdk/openai](https://github.com/vercel/ai/tree/HEAD/packages/openai) from 3.0.69 to 3.0.71.
- [Release notes](https://github.com/vercel/ai/releases)
- [Changelog](https://github.com/vercel/ai/blob/@ai-sdk/openai@3.0.71/packages/openai/CHANGELOG.md)
- [Commits](https://github.com/vercel/ai/commits/@ai-sdk/openai@3.0.71/packages/openai)

---
updated-dependencies:
- dependency-name: "@ai-sdk/openai"
  dependency-version: 3.0.71
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [ai](https://github.com/vercel/ai/tree/HEAD/packages/ai) from 6.0.201 to 6.0.206.
- [Release notes](https://github.com/vercel/ai/releases)
- [Changelog](https://github.com/vercel/ai/blob/ai@6.0.206/packages/ai/CHANGELOG.md)
- [Commits](https://github.com/vercel/ai/commits/ai@6.0.206/packages/ai)

---
updated-dependencies:
- dependency-name: ai
  dependency-version: 6.0.206
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…dates (#36)

---
updated-dependencies:
- dependency-name: "@openai/codex"
  dependency-version: 0.140.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: openai-codex
- dependency-name: "@openai/codex-sdk"
  dependency-version: 0.140.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: openai-codex
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Gemini CLI >=0.46 emits MCP tool names as `mcp_<server>_<tool>` with a
single-underscore prefix (see generateValidName in the bundle). Every other
runner and the trace-based MCP graders use the double-underscore `mcp__`
prefix (Claude Code emits it natively; codex/copilot translators normalize
to it). The Gemini translator's mapMcpName was identity, so its tool calls
were stored with a single underscore and never matched calledTool /
calledToolOneOf — a successful MCP invocation scored 0%.

Normalize the prefix in mapMcpName (idempotent for names already on the
mcp__ convention).
…dates (#48)

Keep @anthropic-ai/claude-code and @anthropic-ai/claude-agent-sdk in
lockstep so they are bumped in a single PR, mirroring the openai-codex
group. Easier to test the two together than separately.
Bumps the anthropic group with 2 updates: [@anthropic-ai/claude-agent-sdk](https://github.com/anthropics/claude-agent-sdk-typescript) and [@anthropic-ai/claude-code](https://github.com/anthropics/claude-code).


Updates `@anthropic-ai/claude-agent-sdk` from 0.3.173 to 0.3.178
- [Release notes](https://github.com/anthropics/claude-agent-sdk-typescript/releases)
- [Changelog](https://github.com/anthropics/claude-agent-sdk-typescript/blob/main/CHANGELOG.md)
- [Commits](anthropics/claude-agent-sdk-typescript@v0.3.173...v0.3.178)

Updates `@anthropic-ai/claude-code` from 2.1.173 to 2.1.178
- [Release notes](https://github.com/anthropics/claude-code/releases)
- [Changelog](https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md)
- [Commits](anthropics/claude-code@v2.1.173...v2.1.178)

---
updated-dependencies:
- dependency-name: "@anthropic-ai/claude-agent-sdk"
  dependency-version: 0.3.178
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: anthropic
- dependency-name: "@anthropic-ai/claude-code"
  dependency-version: 2.1.178
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: anthropic
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… proxy (#50)

Bumps @github/copilot-sdk from 0.3.0 to 1.0.1 (supersedes dependabot #25).
The major version has breaking API changes: `cwd` -> `workingDirectory` and
`cliPath` -> `connection: RuntimeConnection.forStdio({ path })`.

While migrating, route the Copilot runner's inference through the configured
LLM proxy via an OpenAI-compatible BYOK provider, instead of GitHub's Copilot
backend. The runner previously authenticated via the logged-in gh user, whose
backend does not serve the GPT models the eval matrix expects (e.g. gpt-5.4).
Uses the Responses API because Copilot emits freeform/custom tool calls
(apply_patch) that the chat-completions API rejects — same as the codex runner.
* feat: inject compile_command guidance into agent context files

Add an optional `compile_command` PROMPT.md frontmatter field. When set,
a verify-compiles instruction is appended to the agent's native context
file (CLAUDE.md / GEMINI.md / AGENTS.md / copilot-instructions.md)
alongside the existing "no docs files" guidance, so the agent verifies
the project compiles and the command appears in the tool trace.

Wires the field into all 10 quickstart evals.

* fix: make compile_command guidance imperative so agents run it

The injected compile-verification guidance used permissive wording
("you can use this command"), so capable models produced correct code
but skipped the build — failing the mandatory build-verification grader.
Rephrase as a "you MUST run" instruction and assert the mandatory
wording in tests.
Add a trigger row mapping package/runner/scoring/config/data-flow changes
to docs/ARCHITECTURE.md, and extend the rule-of-thumb so the Mermaid
diagrams are kept in sync with the code, not just the prose.
* docs: spec for post-run compile grader

Design for running compile_command against the workspace after the agent
finishes and driving a compiles() grader from the captured result, so
agents are graded on whether their output compiles rather than on whether
they ran the build themselves.

* docs: implementation plan for post-run compile grader

* feat(graders): add CompileResult type

* feat(graders): add compiles() grader primitive

* feat(core): thread compileResult through GraderContext

* feat(core): add compile grader executor

* feat(core): add compileResult parameter to runGraders

* feat(core): add non-throwing runCompileCommand workspace helper

* feat(eval): run compile_command post-agent on the host path

* feat(eval): run compile_command post-agent in the sandbox path

* test: enable compiles() grader for frontend quickstarts

* docs: document compiles() grader and post-run compile behaviour

* chore(graders): format long type export added for CompileResult

* chore(core): format long type export line for RunCompileCommandOptions

* chore: remove superpowers spec and plan docs
Mirrors the compile grader already present in react_quickstart — adds
compile_command: npm run build to PROMPT.md frontmatter and a
compiles() grader at L4 so generated MFA code is verified to build
after each agent run.
* fix(compile): auto-run npm install before compile_command

If the agent adds new deps to package.json but never runs npm install,
runCompileCommand would fail with "module not found". Now we prepend
npm install automatically whenever package.json exists in the workspace,
so the build always has up-to-date dependencies.

* fix(compile): use setup_command instead of heuristic npm install detection

Replace the package.json-existence heuristic that unconditionally prepended
`npm install` with an explicit `setupCommand` option on `RunCompileCommandOptions`.
Callers (run.ts, sandbox-runner.ts) now pass `evalDef.setupCommand` so the
eval's own declared setup_command is reused — only unshifted when it exists.

* fix(compile): add comment explaining why setupCommand is prepended before compile
Aligns with new convention introduced in main (MFA eval updated same way).
compile_command: npm run build added to both PROMPT.md files.
compiles() grader added at L4 in both startup and enterprise graders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants