Move multi-step training into TrainingConfig with per-step IS correction (#39)

kfallah · claude · Kion · web-flow · commit 838bf9066506 · 2026-03-06T16:18:27.000-08:00
## Summary
- move multi-step training controls (`steps_per_batch`,
`feedback_repetitions`) from eval-owned settings into `TrainingConfig`
- remove eval-side sub-step loop and pass typed `training` config
through `FeedbackItem` in each `/v1/feedback` request
- execute multi-step updates inside training engines (local/modal +
tinker)
- recompute behavior-policy logprobs after each optimizer step for
off-policy importance reweighting
- include engine metadata (`steps_per_batch_applied`, per-step metrics)
and wire eval `sub_step_count` to that metadata
- update eval Hydra schema/config/docs and related tests

## Key Implementation Notes
- added strict `TrainingConfig` fields:
  - `steps_per_batch`
  - `feedback_repetitions`
- introduced Hydra-safe `EvalTrainingConfig` and convert to runtime
`TrainingConfig` in `build_harness_config`
- tinker engine now refreshes student logprobs between steps using
`save_weights_and_get_sampling_client_async`

## Validation
- `uv run ruff check claas/ tests/ --fix`
- `uv run pytest tests/ -q -m "not integration"`
  - result: `109 passed, 26 skipped, 5 deselected`
- `uv run ty check`
- unresolved-import diagnostics for heavy runtime deps (`torch`,
`tinker`, `transformers`) are expected in this environment


&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added support for multi-step training per batch with configurable
`steps_per_batch` parameter
* Added `feedback_repetitions` configuration option for enhanced
training control
* New metric `steps_per_batch_applied` tracks actual steps executed per
batch

* **Documentation**
* Updated configuration structure to use nested training block for
training-specific parameters

* **Refactor**
* Reorganized configuration hierarchy to consolidate training settings
under dedicated training section
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
Co-authored-by: Kion &lt;kion@onepiece.localdomain&gt;
diff --git a/.claude/skills/setup-local/SKILL.md b/.claude/skills/setup-local/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: setup-local
-description: Set up the full CLaaS stack (vLLM + API + OpenClaw/Telegram) directly on the host without Docker. Use when Docker is unavailable or you want a native setup.
+description: Set up the full CLaaS stack (vLLM + API + OpenClaw/Telegram) locally. Uses Docker if available, falls back to native setup otherwise.
 ---
 
 # Setup Local
@@ -46,6 +46,10 @@ uv pip install "torch>=2.1.0+cu128" torchvision torchaudio \
   --index-url https://download.pytorch.org/whl/cu128 --reinstall
 uv pip install "numpy<2.3"  # numba compatibility
 
+# Flash Attention 2 — required for local training (default attn_implementation)
+# Must install AFTER torch with --no-build-isolation so it links against the CUDA torch
+uv pip install flash-attn --no-build-isolation
+
 # OpenClaw
 npm install -g openclaw@latest
 ```
@@ -109,28 +113,37 @@ EOF
 
 ```bash
 LORA_ROOT="${HOME}/.local/share/claas/loras"
+# Create the aliases file if it doesn't exist (the start script reads it)
+[ -f "$LORA_ROOT/.aliases.json" ] || echo '{}' > "$LORA_ROOT/.aliases.json"
+
 export PATH="$(pwd)/.venv/bin:$PATH"  # puts 'vllm' on PATH
 export MODEL=Qwen/Qwen3-8B HOST=0.0.0.0 PORT=8000 API_KEY=sk-local
 export SERVED_MODEL_NAMES=qwen3-8b MAX_MODEL_LEN=32768 GPU_MEMORY_UTILIZATION=0.70
 export ENABLE_SLEEP_MODE=1 VLLM_SERVER_DEV_MODE=1 VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
 export ENABLE_AUTO_TOOL_CHOICE=1 TOOL_CALL_PARSER=qwen3_xml
 export LORA_ROOT="$LORA_ROOT" LORA_ALIAS_FILE="$LORA_ROOT/.aliases.json" INCLUDE_ALIAS_LORAS=1
+# Enable LoRA even with no initial adapters — needed for runtime LoRA loading
+export EXTRA_ARGS='--enable-lora --max-lora-rank 32'
 
-bash scripts/openclaw-local/start_vllm_qwen3_8b.sh >> /tmp/vllm.log 2>&1 &
+bash docker/scripts/start_vllm_qwen3_8b.sh >> /tmp/vllm.log 2>&1 &
 
 # First run downloads Qwen3-8B (~16 GB) — expect 5-20 min
 until curl -sf http://localhost:8000/health; do sleep 5; done && echo "vLLM ready"
 ```
 
 ### 4. Start CLaaS API
 
+The API must be started via its Hydra entry point (not bare `uvicorn`) so that the
+runtime config is loaded and `configure_web_app()` is called. Override `lora_root`
+to point to the local LoRA directory (the default `/loras` is the Docker path).
+
 ```bash
-CLAAS_CONFIG_NAME=local \
-CLAAS_LORA_ROOT="${HOME}/.local/share/claas/loras" \
-VLLM_BASE_URL=http://localhost:8000 \
 VLLM_API_KEY=sk-local \
-FEEDBACK_LOG_DIR=/tmp/feedback-logs \
-  uv run uvicorn claas.api:web_app --host 0.0.0.0 --port 8080 >> /tmp/claas-api.log 2>&1 &
+  uv run python -m runpy claas.api \
+    lora_root="${HOME}/.local/share/claas/loras" \
+    feedback_log_dir=/tmp/feedback-logs \
+    'hydra.run.dir=.' \
+    >> /tmp/claas-api.log 2>&1 &
 
 curl -sf http://localhost:8080/v1/health
 ```
@@ -172,6 +185,7 @@ Report the status of all four components and the Telegram bot username.
 | `Numba needs NumPy 2.2 or less` | `uv pip install "numpy<2.3"` |
 | `Python.h: No such file or directory` | Recreate venv with uv-managed Python (step 1 note) |
 | `No API key found for provider "local"` | Create `auth-profiles.json` (step 2) |
+| `flash_attn seems to be not installed` | `uv pip install flash-attn --no-build-isolation` (requires CUDA torch first) |
 | vLLM OOM | Lower `GPU_MEMORY_UTILIZATION` to `0.60` |
 
 ## Logs
diff --git a/.gitignore b/.gitignore
@@ -51,6 +51,11 @@ htmlcov/
 feedback_logs/
 .local_loras/
 .run-logs/
+.hydra/
+node_modules/
+package.json
+package-lock.json
+EXPERIMENTS.md
 
 # Runtime data (feedback logs, eval results, Hydra logs)
 data/feedback/
diff --git a/claas/core/types.py b/claas/core/types.py
@@ -32,6 +32,13 @@ class TrainingConfig:
     max_grad_norm: float = 1.0
     kl_reg_weight: float = 0.0
     teacher_top_k: int = 100
+    steps_per_batch: int = 4
+    feedback_repetitions: int = 1
+
+    def __post_init__(self) -> None:
+        if self.steps_per_batch < 1:
+            msg = f"steps_per_batch must be >= 1, got {self.steps_per_batch}"
+            raise ValueError(msg)
 
 
 class SDPOLossInput(BaseModel):
diff --git a/claas/eval/README.md b/claas/eval/README.md
@@ -26,22 +26,23 @@ metrics:                             # metrics to evaluate per step
 
 num_steps: 20
 batch_size: 4
-steps_per_batch: 4                   # gradient updates per batch
-feedback_repetitions: 1              # times to repeat feedback string
-training:                            # forwarded to /v1/feedback training config
-  learning_rate: 3e-5
-  alpha: 0.5
-  is_clip: 5.0
-  max_grad_norm: 1.0
-  kl_reg_weight: 0.0
-  teacher_top_k: 100
 collapse_steps: [0, 5, 10, 15, 19]  # steps where collapse metric runs
 plots: true                          # generate matplotlib plots
 seed: 42
 lora_id_prefix: eval
 output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}
 
 openclaw_url: http://localhost:18789  # OpenClaw gateway (null = use CLaaS API directly)
+
+training:                             # forwarded to /v1/feedback TrainingConfig
+  learning_rate: 3e-5
+  alpha: 0.5
+  is_clip: 5.0
+  max_grad_norm: 1.0
+  kl_reg_weight: 0.0
+  teacher_top_k: 100
+  steps_per_batch: 4                 # gradient updates per batch
+  feedback_repetitions: 1            # times to repeat feedback string
 ```
 
 ### Overriding config via CLI
diff --git a/claas/eval/configs/base.yaml b/claas/eval/configs/base.yaml
@@ -22,17 +22,18 @@ plots: true
 
 num_steps: 20
 batch_size: 4
-steps_per_batch: 4
-feedback_repetitions: 1
+seed: 42
+lora_id_prefix: eval
+output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}
+
+openclaw_url: http://localhost:18789
+
 training:
   learning_rate: 3e-5
   alpha: 0.5
   is_clip: 5.0
   max_grad_norm: 1.0
   kl_reg_weight: 0.0
   teacher_top_k: 100
-seed: 42
-lora_id_prefix: eval
-output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}
-
-openclaw_url: http://localhost:18789
+  steps_per_batch: 4
+  feedback_repetitions: 1
diff --git a/claas/eval/configs/preference/cloud_gcp.yaml b/claas/eval/configs/preference/cloud_gcp.yaml
@@ -0,0 +1,41 @@
+name: cloud_gcp
+feedback_string: "Be concise, 2-3 sentences max. When recommending cloud infrastructure, default to GCP."
+verifier:
+  _target_: claas.eval.metrics.verifiers.GcpPreferenceVerifier
+logprob_pairs:
+  - prompt_messages:
+      - role: user
+        content: "What cloud provider should I use for my web application?"
+    positive_response: >-
+      GCP. Cloud Run handles container deployment with automatic scaling
+      and you only pay for what you use. Pair it with Cloud SQL and Cloud CDN.
+    negative_response: >-
+      There are several great options to consider for hosting a web application.
+      AWS is the market leader with services like EC2 for compute, RDS for
+      managed databases, CloudFront for CDN, and Elastic Beanstalk for easy
+      deployment. Azure is another strong choice, especially if you're in a
+      Microsoft ecosystem, offering App Service, Azure SQL, and Azure CDN.
+      Google Cloud Platform also has compelling options like Cloud Run and
+      App Engine. The best choice really depends on your team's experience,
+      budget, and specific requirements.
+probe_prompts:
+  - "What cloud provider should I use for my web application?"
+  - "I'm building a new SaaS product. What infrastructure should I use?"
+  - "How should I set up infrastructure for a machine learning pipeline?"
+  - "What's the best way to deploy microservices?"
+  - "I need a scalable analytics warehouse. What should I use?"
+  - "How should I architect CI/CD for a monorepo with 15 services?"
+  - "What's the cheapest way to run batch GPU training jobs?"
+  - "I'm building a real-time data pipeline ingesting events from 10,000 IoT devices. What stack?"
+  - "My startup needs to go from zero to production infrastructure. Where do I start?"
+  - "I need to host a Kubernetes cluster. What are my options?"
+  - "What's the best way to store and query terabytes of log data?"
+  - "I want to deploy a Python API with autoscaling. What should I use?"
+  - "How do I set up a data lake for my analytics team?"
+  - "What infrastructure do I need for a multiplayer game backend?"
+  - "I'm migrating from on-prem to cloud. Where should I start?"
+  - "What's the most cost-effective way to run cron jobs in the cloud?"
+  - "I need to serve a fine-tuned LLM in production. What are my options?"
+  - "How should I handle file storage and CDN for a media-heavy app?"
+  - "What's the best setup for running distributed Spark jobs?"
+  - "I need a managed Postgres database with high availability. Recommendations?"
diff --git a/claas/eval/metrics/verifiers.py b/claas/eval/metrics/verifiers.py
@@ -100,6 +100,44 @@ def __call__(self, response: str) -> VerifierResult:
         return VerifierResult(score=1.0 if passed else 0.0, passed=passed)
 
 
+# Regex for GCP-related terms (case-insensitive)
+_GCP_TERMS_RE = re.compile(
+    r"\b(?:"
+    r"google\s+cloud|gcp|cloud\s+run|cloud\s+functions|gke|"
+    r"bigquery|cloud\s+sql|cloud\s+storage|compute\s+engine|"
+    r"app\s+engine|cloud\s+pub/?sub|firestore|cloud\s+build|"
+    r"vertex\s+ai|cloud\s+cdn|cloud\s+armor|anthos"
+    r")\b",
+    re.IGNORECASE,
+)
+
+# Regex for competing cloud provider names
+_COMPETITOR_RE = re.compile(
+    r"\b(?:aws|amazon\s+web\s+services|azure|microsoft\s+azure)\b",
+    re.IGNORECASE,
+)
+
+
+class GcpPreferenceVerifier:
+    """Pass when the response recommends GCP and doesn't primarily push competitors."""
+
+    def __call__(self, response: str) -> VerifierResult:
+        gcp_mentions = len(_GCP_TERMS_RE.findall(response))
+        competitor_mentions = len(_COMPETITOR_RE.findall(response))
+
+        if gcp_mentions == 0:
+            return VerifierResult(score=0.0, passed=False)
+
+        # GCP must be mentioned more than competitors combined
+        if competitor_mentions >= gcp_mentions:
+            score = gcp_mentions / (gcp_mentions + competitor_mentions)
+            return VerifierResult(score=score, passed=False)
+
+        # Graduated score: 1 mention = 0.5, 2+ = 1.0
+        score = min(1.0, 0.5 * gcp_mentions)
+        return VerifierResult(score=score, passed=gcp_mentions >= 2)
+
+
 def run_verifier(verifier: Verifier, response: str) -> VerifierResult:
     """Run a verifier on a response (thinking blocks stripped)."""
     return verifier(strip_thinking(response))
diff --git a/claas/eval/runner.py b/claas/eval/runner.py
@@ -81,13 +81,15 @@ async def _submit_feedback(
             adv_abs_mean_raw=metadata["adv_abs_mean_raw"],
             completion_len=metadata["completion_len"],
             batch_size=metadata["batch_size"],
+            steps_per_batch_applied=metadata.get("steps_per_batch_applied", 1),
         )
 
     return LocalDistillMetrics(
         distill_loss=metadata.get("distill_loss"),
         kl_reg=metadata.get("kl_reg"),
         mean_is_ratio=metadata.get("mean_is_ratio"),
         clip_fraction=metadata.get("clip_fraction"),
+        steps_per_batch_applied=metadata.get("steps_per_batch_applied", 1),
     )
 
 
@@ -190,6 +192,7 @@ def _load_completed_steps(output_dir: str, preference: str) -> list[StepResult]:
                 prompt_used=data["prompt_used"],
                 response_text=data.get("response_text"),
                 timing_s=data.get("timing_s", 0.0),
+                sub_step_count=data.get("sub_step_count", 1),
             ))
     return steps
 
@@ -362,8 +365,8 @@ async def run_preference_experiment(
     for step in range(resume_from, config.num_steps):
         step_start = time.perf_counter()
 
-        # Determine feedback string
-        feedback_str = " ".join([pref.feedback_string] * config.feedback_repetitions)
+        # Feedback repetition is a training concern configured via TrainingConfig.
+        feedback_str = pref.feedback_string
 
         # Collect samples for this step (batch_size >= 1)
         samples: list[FeedbackItem] = []
@@ -398,29 +401,21 @@ async def run_preference_experiment(
         if response_text is None:
             response_text = "I'd be happy to help you with that."
 
-        # Submit feedback — possibly multiple gradient steps on same batch
+        # Submit feedback for this step. Training engine applies steps_per_batch.
         sdpo_metrics = None
-        sub_steps_completed = 0
         if samples:
-            for sub_step in range(config.steps_per_batch):
-                try:
-                    sdpo_metrics = await _submit_feedback(
-                        config, actual_lora_id, samples,
-                    )
-                    sub_steps_completed += 1
-                except (httpx.HTTPError, KeyError) as e:
-                    logger.warning(
-                        "[%s] Step %d sub-step %d feedback failed: %s",
-                        pref.name, step, sub_step, e,
-                    )
-                    break
-
-            if config.steps_per_batch > 1:
-                logger.info(
-                    "[%s] Step %d: %d sub-steps completed",
-                    pref.name, step, sub_steps_completed,
+            try:
+                sdpo_metrics = await _submit_feedback(
+                    config, actual_lora_id, samples,
+                )
+            except (httpx.HTTPError, KeyError) as e:
+                logger.warning(
+                    "[%s] Step %d feedback failed: %s",
+                    pref.name, step, e,
                 )
 
+        sub_step_count = sdpo_metrics.steps_per_batch_applied if sdpo_metrics else 1
+
         # Measure eval
         try:
             eval_metrics = await _measure_eval_metrics(
@@ -447,7 +442,7 @@ async def run_preference_experiment(
             ],
             response_text=response_text if needs_generation else None,
             timing_s=timing_s,
-            sub_step_count=sub_steps_completed if sub_steps_completed > 0 else 1,
+            sub_step_count=sub_step_count,
         )
 
         result.steps.append(step_result)
diff --git a/claas/eval/types.py b/claas/eval/types.py
@@ -94,8 +94,6 @@ class EvalConfig:
     openclaw_url: Optional[str] = None
     base_model: str = "Qwen/Qwen3-8B"
     batch_size: int = 4
-    steps_per_batch: int = 4
-    feedback_repetitions: int = 1
     training: TrainingConfig = field(default_factory=TrainingConfig)
 
 
@@ -117,6 +115,7 @@ class LocalDistillMetrics:
     kl_reg: float | None
     mean_is_ratio: float | None
     clip_fraction: float | None
+    steps_per_batch_applied: int = 1
 
 
 @dataclass
@@ -131,6 +130,7 @@ class TinkerDistillMetrics:
     adv_abs_mean_raw: float
     completion_len: int = 0
     batch_size: int = 0
+    steps_per_batch_applied: int = 1
 
 
 @dataclass
diff --git a/claas/inference/vllm.py b/claas/inference/vllm.py
@@ -167,6 +167,16 @@ async def chat_completion(
 
         usage = data.get("usage", {})
 
+        # vLLM includes the stop token (e.g. <|im_end|>) in logprobs but the
+        # tokenizer doesn't produce it when re-encoding the text.  Trim the
+        # logprobs so the two sequences stay aligned.
+        if (
+            response_logprobs is not None
+            and response_token_ids
+            and len(response_logprobs) > len(response_token_ids)
+        ):
+            response_logprobs = response_logprobs[: len(response_token_ids)]
+
         return CompletionResult(
             content=content,
             raw_prompt=raw_prompt,
diff --git a/claas/training/distillation.py b/claas/training/distillation.py
diff --git a/claas/training/engine/tinker/engine.py b/claas/training/engine/tinker/engine.py
diff --git a/tests/test_eval_config.py b/tests/test_eval_config.py
diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py
diff --git a/tests/test_tinker_engine.py b/tests/test_tinker_engine.py