CLaaS Eval Harness

Automated evaluation for SDPO continual learning. Runs feedback loops against a live CLaaS stack and measures whether training shifts the model toward preferred behaviours without collapsing.

Configuration (Hydra)

The eval harness uses Hydra for configuration. The default config lives in configs/base.yaml and can be overridden via CLI arguments using Hydra's key=value syntax.

Config file: `configs/base.yaml`

mode: tinker                         # execution backend: local | tinker | modal
claas_url: http://localhost:8080     # CLaaS API endpoint
base_model: Qwen/Qwen3-30B-A3B      # base model for LoRA init (Tinker name)

preferences:                         # preferences to train
  - no_emoji
  - concise
  - identity

metrics:                             # metrics to evaluate per step
  - logprob
  - compliance
  - general
  - collapse

num_steps: 20
batch_size: 4
collapse_steps: [0, 5, 10, 15, 19]  # steps where collapse metric runs
plots: true                          # generate matplotlib plots
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}

openclaw_url: http://localhost:18789  # OpenClaw gateway (null = use CLaaS API directly)

training:                             # forwarded to /v1/feedback TrainingConfig
  learning_rate: 3e-5
  alpha: 0.5
  is_clip: 5.0
  max_grad_norm: 1.0
  kl_reg_weight: 0.0
  teacher_top_k: 100
  steps_per_batch: 4                 # gradient updates per batch
  feedback_repetitions: 1            # times to repeat feedback string

Overriding config via CLI

Hydra overrides are positional arguments after uv run python -m claas.eval:

# Run only conciseness for 10 steps
uv run python -m claas.eval 'preferences=[concise]' num_steps=10

# Override base model and mode
uv run python -m claas.eval base_model=Qwen/Qwen3-30B-A3B mode=tinker

# Override training hyperparameters
uv run python -m claas.eval training.is_clip=7.0 training.learning_rate=1e-4

# Use a custom config directory
uv run python -m claas.eval --config-dir ./my_configs --config-name my_config

Programmatic usage

from claas.eval.runner import run_harness
from claas.eval.types import EvalConfig
import asyncio

config = EvalConfig(
    preferences=["concise"],
    num_steps=5,
    output_dir="./data/evals/manual-run",  # explicit when bypassing Hydra CLI
)
asyncio.run(run_harness(config))

Environment variables (secrets)

Secrets are resolved from env vars at runtime, NOT stored in config:

Variable	Required for	Purpose
`CLAAS_TINKER_API_KEY`	Tinker mode	Tinker SDK authentication
`OPENCLAW_GATEWAY_TOKEN`	When `openclaw_url` is set	Auth token for OpenClaw gateway

Running (Tinker mode, no GPU)

1. Install dependencies

uv sync --extra tinker --extra dev

2. Start the CLaaS API

CLAAS_TINKER_API_KEY="tml-..." \
  uv run python -m claas.api --config-name tinker

3. Run the eval

CLAAS_TINKER_API_KEY="tml-..." \
OPENCLAW_GATEWAY_TOKEN="openclaw-local-dev-token" \
  uv run python -m claas.eval 'preferences=[concise]' num_steps=20

Known gotchas

Tinker model naming: Tinker uses its own model identifiers that differ from HuggingFace names. For example, the HuggingFace model Qwen/Qwen3-Coder-30B-A3B-Instruct is Qwen/Qwen3-30B-A3B in Tinker. Sampling will work with either name, but LoRA training init will reject the HuggingFace name with a 400 error. Always use the Tinker name in base_model.

API entry point: Run the API via Hydra (python -m claas.api --config-name ...) instead of loading claas.api:web_app directly.

Collapse metric is slow: The collapse metric generates multiple stochastic samples per step. It only runs at steps listed in collapse_steps (default [0, 5, 10, 15, 19]) to limit overhead.

Metrics

Select metrics with the metrics list in config or via override.

Metric	What it measures
`logprob`	Logprob margin between preferred/dispreferred response pairs. Positive margin = model favours the preferred response. Delta from baseline tracks training progress.
`compliance`	Generates responses to probe prompts, runs a programmatic verifier (e.g. emoji count, sentence count, keyword presence), and averages the pass rate.
`general`	Coding task (fibonacci, exec + verify) + 3 IFEval-style instruction-following probes. Measures capability retention during training.
`collapse`	Three collapse detectors: token entropy (distribution confidence), self-ROUGE-L (output diversity across stochastic samples), and logprob drift (mean logprob shift from baseline).

Preferences (YAML-based)

Each preference is defined in a standalone YAML file under configs/preference/. To add a new preference:

Create configs/preference/my_pref.yaml with name, feedback_string, verifier (_target_ pointing to a class in claas.eval.metrics.verifiers), logprob_pairs, and probe_prompts
Add a verifier class to metrics/verifiers.py (must implement __call__(self, response: str) -> VerifierResult)
Run: uv run python -m claas.eval 'preferences=[my_pref]'

Verifiers (used by `compliance`)

Verifier class	Preference	Pass condition
`NoEmojiVerifier`	no_emoji	Zero emoji characters in response
`ConciseVerifier`	concise	<= 3 sentences (linear decay to 0.0 at 9+)
`IdentityVerifier`	identity	"kuro" appears in response (case-insensitive)

Output format

data/evals/<run-id>/
├── summary.json              # Per-preference pass/fail verdicts
└── <preference>/
    ├── metadata.json          # Run config + LoRA ID
    ├── baseline.json          # Pre-training metric snapshot
    └── steps.jsonl            # One JSON object per feedback step

Each line in steps.jsonl contains: step number, timestamp, feedback given, SDPO training metrics, eval metrics (logprob margin, compliance, general capability, collapse), and rollout transcripts.

Results can be viewed in the browser at GET /v1/eval?results_dir=./data/evals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLaaS Eval Harness

Configuration (Hydra)

Config file: `configs/base.yaml`

Overriding config via CLI

Programmatic usage

Environment variables (secrets)

Running (Tinker mode, no GPU)

1. Install dependencies

2. Start the CLaaS API

3. Run the eval

Known gotchas

Metrics

Preferences (YAML-based)

Verifiers (used by `compliance`)

Output format

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

CLaaS Eval Harness

Configuration (Hydra)

Config file: configs/base.yaml

Overriding config via CLI

Programmatic usage

Environment variables (secrets)

Running (Tinker mode, no GPU)

1. Install dependencies

2. Start the CLaaS API

3. Run the eval

Known gotchas

Metrics

Preferences (YAML-based)

Verifiers (used by compliance)

Output format

Config file: `configs/base.yaml`

Verifiers (used by `compliance`)