Skip to content

Latest commit

 

History

History
161 lines (116 loc) · 6.06 KB

File metadata and controls

161 lines (116 loc) · 6.06 KB

CLaaS Eval Harness

Automated evaluation for SDPO continual learning. Runs feedback loops against a live CLaaS stack and measures whether training shifts the model toward preferred behaviours without collapsing.

Configuration (Hydra)

The eval harness uses Hydra for configuration. The default config lives in configs/base.yaml and can be overridden via CLI arguments using Hydra's key=value syntax.

Config file: configs/base.yaml

mode: tinker                         # execution backend: local | tinker | modal
claas_url: http://localhost:8080     # CLaaS API endpoint
base_model: Qwen/Qwen3-30B-A3B      # base model for LoRA init (Tinker name)

preferences:                         # preferences to train
  - no_emoji
  - concise
  - identity

metrics:                             # metrics to evaluate per step
  - logprob
  - compliance
  - general
  - collapse

num_steps: 20
batch_size: 4
collapse_steps: [0, 5, 10, 15, 19]  # steps where collapse metric runs
plots: true                          # generate matplotlib plots
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}

openclaw_url: http://localhost:18789  # OpenClaw gateway (null = use CLaaS API directly)

training:                             # forwarded to /v1/feedback TrainingConfig
  learning_rate: 3e-5
  alpha: 0.5
  is_clip: 5.0
  max_grad_norm: 1.0
  kl_reg_weight: 0.0
  teacher_top_k: 100
  steps_per_batch: 4                 # gradient updates per batch
  feedback_repetitions: 1            # times to repeat feedback string

Overriding config via CLI

Hydra overrides are positional arguments after uv run python -m claas.eval:

# Run only conciseness for 10 steps
uv run python -m claas.eval 'preferences=[concise]' num_steps=10

# Override base model and mode
uv run python -m claas.eval base_model=Qwen/Qwen3-30B-A3B mode=tinker

# Override training hyperparameters
uv run python -m claas.eval training.is_clip=7.0 training.learning_rate=1e-4

# Use a custom config directory
uv run python -m claas.eval --config-dir ./my_configs --config-name my_config

Programmatic usage

from claas.eval.runner import run_harness
from claas.eval.types import EvalConfig
import asyncio

config = EvalConfig(
    preferences=["concise"],
    num_steps=5,
    output_dir="./data/evals/manual-run",  # explicit when bypassing Hydra CLI
)
asyncio.run(run_harness(config))

Environment variables (secrets)

Secrets are resolved from env vars at runtime, NOT stored in config:

Variable Required for Purpose
CLAAS_TINKER_API_KEY Tinker mode Tinker SDK authentication
OPENCLAW_GATEWAY_TOKEN When openclaw_url is set Auth token for OpenClaw gateway

Running (Tinker mode, no GPU)

1. Install dependencies

uv sync --extra tinker --extra dev

2. Start the CLaaS API

CLAAS_TINKER_API_KEY="tml-..." \
  uv run python -m claas.api --config-name tinker

3. Run the eval

CLAAS_TINKER_API_KEY="tml-..." \
OPENCLAW_GATEWAY_TOKEN="openclaw-local-dev-token" \
  uv run python -m claas.eval 'preferences=[concise]' num_steps=20

Known gotchas

Tinker model naming: Tinker uses its own model identifiers that differ from HuggingFace names. For example, the HuggingFace model Qwen/Qwen3-Coder-30B-A3B-Instruct is Qwen/Qwen3-30B-A3B in Tinker. Sampling will work with either name, but LoRA training init will reject the HuggingFace name with a 400 error. Always use the Tinker name in base_model.

API entry point: Run the API via Hydra (python -m claas.api --config-name ...) instead of loading claas.api:web_app directly.

Collapse metric is slow: The collapse metric generates multiple stochastic samples per step. It only runs at steps listed in collapse_steps (default [0, 5, 10, 15, 19]) to limit overhead.

Metrics

Select metrics with the metrics list in config or via override.

Metric What it measures
logprob Logprob margin between preferred/dispreferred response pairs. Positive margin = model favours the preferred response. Delta from baseline tracks training progress.
compliance Generates responses to probe prompts, runs a programmatic verifier (e.g. emoji count, sentence count, keyword presence), and averages the pass rate.
general Coding task (fibonacci, exec + verify) + 3 IFEval-style instruction-following probes. Measures capability retention during training.
collapse Three collapse detectors: token entropy (distribution confidence), self-ROUGE-L (output diversity across stochastic samples), and logprob drift (mean logprob shift from baseline).

Preferences (YAML-based)

Each preference is defined in a standalone YAML file under configs/preference/. To add a new preference:

  1. Create configs/preference/my_pref.yaml with name, feedback_string, verifier (_target_ pointing to a class in claas.eval.metrics.verifiers), logprob_pairs, and probe_prompts
  2. Add a verifier class to metrics/verifiers.py (must implement __call__(self, response: str) -> VerifierResult)
  3. Run: uv run python -m claas.eval 'preferences=[my_pref]'

Verifiers (used by compliance)

Verifier class Preference Pass condition
NoEmojiVerifier no_emoji Zero emoji characters in response
ConciseVerifier concise <= 3 sentences (linear decay to 0.0 at 9+)
IdentityVerifier identity "kuro" appears in response (case-insensitive)

Output format

data/evals/<run-id>/
├── summary.json              # Per-preference pass/fail verdicts
└── <preference>/
    ├── metadata.json          # Run config + LoRA ID
    ├── baseline.json          # Pre-training metric snapshot
    └── steps.jsonl            # One JSON object per feedback step

Each line in steps.jsonl contains: step number, timestamp, feedback given, SDPO training metrics, eval metrics (logprob margin, compliance, general capability, collapse), and rollout transcripts.

Results can be viewed in the browser at GET /v1/eval?results_dir=./data/evals.