Automated evaluation for SDPO continual learning. Runs feedback loops against a live CLaaS stack and measures whether training shifts the model toward preferred behaviours without collapsing.
The eval harness uses Hydra for configuration. The default config lives in configs/base.yaml and can be overridden via CLI arguments using Hydra's key=value syntax.
mode: tinker # execution backend: local | tinker | modal
claas_url: http://localhost:8080 # CLaaS API endpoint
base_model: Qwen/Qwen3-30B-A3B # base model for LoRA init (Tinker name)
preferences: # preferences to train
- no_emoji
- concise
- identity
metrics: # metrics to evaluate per step
- logprob
- compliance
- general
- collapse
num_steps: 20
batch_size: 4
collapse_steps: [0, 5, 10, 15, 19] # steps where collapse metric runs
plots: true # generate matplotlib plots
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}
openclaw_url: http://localhost:18789 # OpenClaw gateway (null = use CLaaS API directly)
training: # forwarded to /v1/feedback TrainingConfig
learning_rate: 3e-5
alpha: 0.5
is_clip: 5.0
max_grad_norm: 1.0
kl_reg_weight: 0.0
teacher_top_k: 100
steps_per_batch: 4 # gradient updates per batch
feedback_repetitions: 1 # times to repeat feedback stringHydra overrides are positional arguments after uv run python -m claas.eval:
# Run only conciseness for 10 steps
uv run python -m claas.eval 'preferences=[concise]' num_steps=10
# Override base model and mode
uv run python -m claas.eval base_model=Qwen/Qwen3-30B-A3B mode=tinker
# Override training hyperparameters
uv run python -m claas.eval training.is_clip=7.0 training.learning_rate=1e-4
# Use a custom config directory
uv run python -m claas.eval --config-dir ./my_configs --config-name my_configfrom claas.eval.runner import run_harness
from claas.eval.types import EvalConfig
import asyncio
config = EvalConfig(
preferences=["concise"],
num_steps=5,
output_dir="./data/evals/manual-run", # explicit when bypassing Hydra CLI
)
asyncio.run(run_harness(config))Secrets are resolved from env vars at runtime, NOT stored in config:
| Variable | Required for | Purpose |
|---|---|---|
CLAAS_TINKER_API_KEY |
Tinker mode | Tinker SDK authentication |
OPENCLAW_GATEWAY_TOKEN |
When openclaw_url is set |
Auth token for OpenClaw gateway |
uv sync --extra tinker --extra devCLAAS_TINKER_API_KEY="tml-..." \
uv run python -m claas.api --config-name tinkerCLAAS_TINKER_API_KEY="tml-..." \
OPENCLAW_GATEWAY_TOKEN="openclaw-local-dev-token" \
uv run python -m claas.eval 'preferences=[concise]' num_steps=20Tinker model naming: Tinker uses its own model identifiers that differ from HuggingFace names. For example, the HuggingFace model Qwen/Qwen3-Coder-30B-A3B-Instruct is Qwen/Qwen3-30B-A3B in Tinker. Sampling will work with either name, but LoRA training init will reject the HuggingFace name with a 400 error. Always use the Tinker name in base_model.
API entry point: Run the API via Hydra (python -m claas.api --config-name ...) instead of loading claas.api:web_app directly.
Collapse metric is slow: The collapse metric generates multiple stochastic samples per step. It only runs at steps listed in collapse_steps (default [0, 5, 10, 15, 19]) to limit overhead.
Select metrics with the metrics list in config or via override.
| Metric | What it measures |
|---|---|
logprob |
Logprob margin between preferred/dispreferred response pairs. Positive margin = model favours the preferred response. Delta from baseline tracks training progress. |
compliance |
Generates responses to probe prompts, runs a programmatic verifier (e.g. emoji count, sentence count, keyword presence), and averages the pass rate. |
general |
Coding task (fibonacci, exec + verify) + 3 IFEval-style instruction-following probes. Measures capability retention during training. |
collapse |
Three collapse detectors: token entropy (distribution confidence), self-ROUGE-L (output diversity across stochastic samples), and logprob drift (mean logprob shift from baseline). |
Each preference is defined in a standalone YAML file under configs/preference/. To add a new preference:
- Create
configs/preference/my_pref.yamlwithname,feedback_string,verifier(_target_pointing to a class inclaas.eval.metrics.verifiers),logprob_pairs, andprobe_prompts - Add a verifier class to
metrics/verifiers.py(must implement__call__(self, response: str) -> VerifierResult) - Run:
uv run python -m claas.eval 'preferences=[my_pref]'
| Verifier class | Preference | Pass condition |
|---|---|---|
NoEmojiVerifier |
no_emoji | Zero emoji characters in response |
ConciseVerifier |
concise | <= 3 sentences (linear decay to 0.0 at 9+) |
IdentityVerifier |
identity | "kuro" appears in response (case-insensitive) |
data/evals/<run-id>/
├── summary.json # Per-preference pass/fail verdicts
└── <preference>/
├── metadata.json # Run config + LoRA ID
├── baseline.json # Pre-training metric snapshot
└── steps.jsonl # One JSON object per feedback step
Each line in steps.jsonl contains: step number, timestamp, feedback given, SDPO training metrics, eval metrics (logprob margin, compliance, general capability, collapse), and rollout transcripts.
Results can be viewed in the browser at GET /v1/eval?results_dir=./data/evals.