CLI reference

This file documents every cb subcommand: synopsis, flags, behavior, exit codes, and example invocations. Run any leaf with --help for the canonical, machine-generated version of the flag list.

cb is the short alias for chi-bench; both binaries resolve to the same Typer app (see pyproject.toml [project.scripts]). Pick whichever you prefer. If your shell already aliases cb to something else (e.g. a clipboard tool), use chi-bench.

Command groups

Group	Purpose
`cb serve`	Run the in-process simulation server (FastAPI + judge + MCP threads).
`cb mcp`	Run one or more MCP servers in isolation (provider, payer, care-management).
`cb experiment`	Single-config experiment lifecycle: `run`, `rejudge`, `status`.
`cb data`	Dataset operations: `verify`, `download`.
`cb submission`	Leaderboard submission lifecycle: `validate`, `run`, `status`, `package`.
`cb docker`	Build the `chi-bench:latest` single-image container.

Exit-code convention used throughout:

0 — success.
1 — runtime/validation failure (missing files, preflight errors, command returned non-zero).
2 — schema or YAML-load failure (typically cb submission validate / cb submission run / cb submission status / cb submission prepare).

`cb serve`

Boot the chi-Bench simulation backend in-process: FastAPI on --port, all three MCP servers on their fixed ports (provider 8020, payer 8100, cm 8200) on threads, and optionally the Vite frontend dev server.

cb serve [--port 8023] [--host 0.0.0.0] [--frontend | --no-frontend]
         [--frontend-port 5180] [--frontend-wait-timeout 120.0]
         [--log-level INFO]

Behavior

Before binding, the command attempts to kill any chi-Bench-related process already listening on --port (matched by command name against a small allowlist: python, python3, node, npm, uvicorn). Other processes are left alone with a warning. If lsof is not installed (e.g. inside slim Docker images), the kill step is skipped.
Sets CHI_BENCH_PAYER_MODE=agent if unset.
With --frontend, the frontend dev server is launched only after the backend reports healthy (or --frontend-wait-timeout elapses).

Examples

# Default: backend only, no frontend.
uv run cb serve

# Backend + frontend dev server on :5180.
uv run cb serve --frontend

`cb mcp`

Start one or all of the chi-Bench MCP servers as standalone processes (without the FastAPI shell).

cb mcp [--provider | --no-provider] [--payer | --no-payer] [--cm | --no-cm]
       [--host 0.0.0.0] [--log-level INFO]

Behavior

No flags = start all three (provider on :8020, payer on :8100, cm on :8200).
Any combination of --provider/--payer/--cm restricts to that subset.

Examples

# All three (the default Docker entrypoint flavor).
uv run cb mcp

# Only the payer MCP — useful when the provider/CM are already running.
uv run cb mcp --payer

`cb experiment`

The single-config experiment surface. A "config" here is a flat ExperimentConfig YAML — the matrix-style table configs (scripts/run_table.sh) are decomposed into single configs by scripts/_emit_run_table_commands.py before reaching this CLI.

`cb experiment run`

Run a Harbor trial (or set of trials) from an ExperimentConfig, optionally overriding fields from the command line.

cb experiment run [-f <config.yaml>] [--dataset <path-or-id>]
                  [--agent <id>] [--model <id>] [-n <int>]
                  [--trials-dir <dir>]
                  [-e docker|modal] [--modal-profile <name>]
                  [--modal-sandbox-timeout-secs <int>]
                  [--modal-force-build | --no-modal-force-build]

Flags

Flag	Default	Notes
`-f / --config`	—	Path to an `ExperimentConfig` YAML. Required unless every field is provided via flags.
`--dataset`	—	Path to a dataset directory or a registry ID.
`--agent`	`claude-code`	Agent identifier (e.g. `claude-code`, `codex`, `openclaw`).
`--model`	—	Model identifier in the agent's namespace.
`-n`	`1`	Concurrent trials.
`--trials-dir`	—	Harbor trials output directory; falls back to the config's `paths.trials_dir`.
`-e / --environment`	`docker`	Either `docker` (local single-image) or `modal` (remote sandbox parallelism). The runner auto-terminates its own sandboxes on exit.
`--modal-profile`	`actava`	Modal profile from `~/.modal.toml`. Pass `''` to skip Modal preflight.
`--modal-sandbox-timeout-secs`	`86400`	Per-sandbox lifetime cap. Matches YAML key `modal.sandbox_timeout_secs`.
`--modal-force-build`	unset	Force Modal to rebuild the environment image.

Examples

# One PA-UM trial via the local Docker harness.
uv run cb experiment run \
    --dataset data/prior_auth_um/tasks/pa_t008_t008_o002_p01_mdreview_payer \
    --agent codex --model openai/gpt-5.5

# Drive an entire config via Modal with 4-way parallelism.
uv run cb experiment run -f configs/pa_full.yaml -e modal -n 4

`cb experiment rejudge`

Rerun only the judge stage across every trial under --trial-root. Original verifier/ directories are never modified — rejudge artifacts land in <trial>/<output-subdir>/.

cb experiment rejudge --trial-root <dir> [-e local|modal]
                      [-n / --concurrency <int>]
                      [--output-subdir verifier_rejudge]
                      [--task-dataset-root data/care_management/tasks]

Flags

Flag	Default	Notes
`--trial-root`	required	Directory containing Harbor trial subdirs (each with a `verifier/`).
`-e / --environment`	`modal`	`local` = sequential in-process; `modal` = `ThreadPoolExecutor` at `--concurrency` (full Modal-sandbox wiring is a follow-up).
`-n / --concurrency`	`10`	Max parallel judge sessions when `-e modal`.
`--output-subdir`	`verifier_rejudge`	Per-trial output directory for rejudge artifacts (`scorecard.json`, `verdicts.json`, `judge_session_metadata.json`).
`--task-dataset-root`	`data/care_management/tasks`	Each trial's `expectations.json` is resolved by stripping the Harbor `__<slug>` suffix from the trial dir name and looking under `<root>/<task_id>/fixtures/`.

Example

uv run cb experiment rejudge \
    --trial-root logs/experiments/cm_2025-12-15 \
    -e local

`cb experiment status`

Compact pass/fail listing for trials in a directory.

cb experiment status [--trials-dir trials/]

Behavior

Iterates trials_dir/*/, reading each trial's status and reward.
Exits 1 if --trials-dir does not exist.

Example

uv run cb experiment status --trials-dir logs/experiments/pa_um_2025-12-15

`cb agent`

Inspect the agent registry that backs submission.agent: and the --agent flag.

`cb agent list`

uv run cb agent list           # human-readable table
uv run cb agent list --json    # array of {name, kind, import_path, env_vars}

Two kinds of agent appear:

in-tree — harness classes shipped under src/chi_bench/experiment/agents/. Dispatched by Harbor via --agent-import-path. To add one, see docs/extending.md § 3.
harbor-builtin — names Harbor knows about natively (claude-code, codex). Dispatched via -a <name>. Not added by chi-bench.

The env_vars column is a best-effort list of provider keys each harness's model routing reads — it is not the authoritative env allowlist (that's AGENT_ENV_ALLOWLIST in src/chi_bench/experiment/runner.py).

`cb data`

`cb data verify`

Auto-detect the data layout (source data/<domain>/tasks/... tree, or the baked image's flat tasks/) and assert every expected directory + task count is present.

cb data verify [--data-dir data]

Behavior

Source layout (typical host install): one <domain>/tasks/ directory per task family.
Baked image layout (inside chi-bench:latest): all families flattened under tasks/, with marathon/ and worlds/ as siblings. The handbook lives at /workspace/skills/managed-care-operations-handbook in this layout.
Auto-detection: if data_dir/tasks exists but data_dir/prior_auth_provider/tasks does not, the baked layout is assumed.
Exits 0 with OK — data layout matches expectations. on success, 1 with missing: / count mismatch: lines on failure.

Example

uv run cb data verify
uv run cb data verify --data-dir /opt/chi-bench

`cb data download`

Convenience wrapper around huggingface-cli download that also writes the revision tag to data/.chi-bench-version so submission preflight can verify the pin.

cb data download --revision <tag> [--repo-id actava/chi-bench] [--data-dir data]

Flags

Flag	Default	Notes
`--revision`	required	HF dataset revision tag, e.g. `chi-bench-v1.0.0`.
`--repo-id`	`actava/chi-bench`	HF repo id.
`--data-dir`	`data`	Local data root.

Equivalent raw form (documented as the primary route in README.md):

REV=chi-bench-v1.0.0
uv run huggingface-cli download actava/chi-bench --repo-type dataset \
    --revision "$REV" --local-dir data/
echo "$REV" > data/.chi-bench-version

cb data download exists so the pin file is written in one step; the README documents the raw HF CLI invocation as the canonical path.

`cb submission`

Leaderboard submission lifecycle. See configs/submission_example.yaml for the YAML schema.

`cb submission validate`

Schema-check a submission YAML; optionally run preflight (dataset version pin, Docker image / Modal token, agent name).

cb submission validate -f <sub.yaml> [--skip-preflight]

Behavior

Always runs schema validation. Exits 2 on schema errors or YAML load failures.
Without --skip-preflight, also probes data/.chi-bench-version, the chosen run.environment's CLI/token, and warns on unknown agent names.
Warnings are printed but non-blocking — exit code is 0 if no errors.
Errors raise exit 1.

Example

uv run cb submission validate -f configs/submissions/my-team.yaml

`cb submission run`

Load the submission YAML, optionally preflight, then run each selected domain through run_experiment(). Raw per-trial outputs land under paths.output_root (default logs/submissions/<id>/).

cb submission run -f <sub.yaml> [--domain pa|um|cm ...] [--skip-preflight]

Flags

Flag	Default	Notes
`-f / --config`	required	Submission YAML to execute.
`--domain`	every domain in the YAML	Restrict to one or more domains (`pa`, `um`, `cm`, or canonical names like `pa_provider`). Repeatable. Partial submissions are development-only — leaderboard policy requires all three.
`--skip-preflight`	unset	Skip dataset/environment/agent preflight (not recommended).

Exit codes

0 — submission completed.
1 — preflight errors, or one or more trials raised.
2 — schema/YAML failure.

Example

uv run cb submission run -f configs/submissions/my-team.yaml

`cb submission status`

Per-domain pass/fail tallies for an in-progress or completed submission. Safe to run while submission run is in flight.

cb submission status -f <sub.yaml> [--json]

Behavior

Counts only trials whose result.json carries a verifier_result block — the run-level aggregate result.json and incomplete trials are skipped (so the tally never claims more evidence than exists on disk).
Default output is a human-readable table; --json emits a machine-readable shape.

Example

uv run cb submission status -f configs/submissions/my-team.yaml
uv run cb submission status -f configs/submissions/my-team.yaml --json

`cb submission prepare`

Curate a submission run output tree into a leaderboard-ready packet directory.

cb submission prepare -f <sub.yaml> [-o <out-dir>] [--date YYYY-MM-DD] [--force]

Behavior

Writes <out-dir>/<YYYY-MM-DD>-<submission_id>/ containing submission.json (leaderboard-spec shape with nested results.overall + results.per_domain), results.csv (one row per domain + overall, with benchmark and dataset_version columns), frozen sub.yaml, provenance.json, an auto-generated README.md, and per-trial result.json + verifier/scorecard.json + verifier/reward.json + agent/trajectory.jsonl.zst.
Trajectories are re-serialized as JSONL and zstd-compressed (level 19). Inspect with zstdcat trajectory.jsonl.zst | jq ..
Workspace artifacts, server logs, agent session caches, and Harbor scratch are deliberately excluded. Typical packet size: <100 MB for a 75-trial pass@1 run; trajectories typically compress 5–20×.
The raw trial tree on disk is unchanged.

Flags

-f / --config — submission YAML (required).
-o / --out — packet base directory. Default: <output_root>/packet/.
--date — override the YYYY-MM-DD prefix on the packet directory name. Default: today (UTC).
--force — overwrite an existing packet at the target path.

Example

uv run cb submission prepare -f configs/submissions/my-team.yaml
# Packet ready: logs/submissions/my-team/packet/2026-05-12-my-team-claude-code-opus-4-6/

Submit the packet via the actava-ai/leaderboard repo (manual cp + git + gh pr or python scripts/submit.py). See the leaderboard's README for the current submission flow.

`cb docker`

`cb docker build`

Build the chi-Bench single-image container.

cb docker build [-t chi-bench:latest] [--target runtime|ci-skeleton]

Flags

Flag	Default	Notes
`-t / --tag`	`chi-bench:latest`	Image tag.
`--target`	`runtime`	Build stage: `runtime` (full image), or `ci-skeleton` (faster, used by `tests/smoke/test_docker_image_builds.py`).

Behavior

Echoes the underlying docker build ... invocation to stdout, then execs it. Exit code mirrors Docker's.

Example

uv run cb docker build
uv run cb docker build --target ci-skeleton -t chi-bench:ci

Environment variables

API keys and tokens are read from process env (or run.env_file for cb submission run, default .env). The relevant variables (see .env.example):

Variable	Used by
`ANTHROPIC_API_KEY`	Claude-family agents and the judge.
`OPENAI_API_KEY`	Codex rows.
`GEMINI_API_KEY`	Gemini-CLI rows.
`OPENROUTER_API_KEY`	OpenClaw, Hermes, OAI Agents, DeepAgents rows.
`CLAUDE_CODE_OAUTH_TOKEN`	Claude Code agent harness OAuth token.
`HF_TOKEN`	Hugging Face dataset access (private/gated revisions).
`MODAL_PROFILE`	Modal profile for `cb experiment run -e modal` and `cb submission run` with `run.environment: modal`.
`CHI_BENCH_PAYER_MODE`	Auto-set to `agent` by `cb serve` if unset; controls the in-container payer routing.

Provide whichever subset matches the rows you intend to run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI reference

Command groups

`cb serve`

`cb mcp`

`cb experiment`

`cb experiment run`

`cb experiment rejudge`

`cb experiment status`

`cb agent`

`cb agent list`

`cb data`

`cb data verify`

`cb data download`

`cb submission`

`cb submission validate`

`cb submission run`

`cb submission status`

`cb submission prepare`

`cb docker`

`cb docker build`

Environment variables

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

CLI reference

Command groups

cb serve

cb mcp

cb experiment

cb experiment run

cb experiment rejudge

cb experiment status

cb agent

cb agent list

cb data

cb data verify

cb data download

cb submission

cb submission validate

cb submission run

cb submission status

cb submission prepare

cb docker

cb docker build

Environment variables

`cb serve`

`cb mcp`

`cb experiment`

`cb experiment run`

`cb experiment rejudge`

`cb experiment status`

`cb agent`

`cb agent list`

`cb data`

`cb data verify`

`cb data download`

`cb submission`

`cb submission validate`

`cb submission run`

`cb submission status`

`cb submission prepare`

`cb docker`

`cb docker build`