Skip to content
28 changes: 21 additions & 7 deletions .claude/skills/setup-local/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: setup-local
description: Set up the full CLaaS stack (vLLM + API + OpenClaw/Telegram) directly on the host without Docker. Use when Docker is unavailable or you want a native setup.
description: Set up the full CLaaS stack (vLLM + API + OpenClaw/Telegram) locally. Uses Docker if available, falls back to native setup otherwise.
---

# Setup Local
Expand Down Expand Up @@ -46,6 +46,10 @@ uv pip install "torch>=2.1.0+cu128" torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu128 --reinstall
uv pip install "numpy<2.3" # numba compatibility

# Flash Attention 2 — required for local training (default attn_implementation)
# Must install AFTER torch with --no-build-isolation so it links against the CUDA torch
uv pip install flash-attn --no-build-isolation

# OpenClaw
npm install -g openclaw@latest
```
Expand Down Expand Up @@ -109,28 +113,37 @@ EOF

```bash
LORA_ROOT="${HOME}/.local/share/claas/loras"
# Create the aliases file if it doesn't exist (the start script reads it)
[ -f "$LORA_ROOT/.aliases.json" ] || echo '{}' > "$LORA_ROOT/.aliases.json"

export PATH="$(pwd)/.venv/bin:$PATH" # puts 'vllm' on PATH
export MODEL=Qwen/Qwen3-8B HOST=0.0.0.0 PORT=8000 API_KEY=sk-local
export SERVED_MODEL_NAMES=qwen3-8b MAX_MODEL_LEN=32768 GPU_MEMORY_UTILIZATION=0.70
export ENABLE_SLEEP_MODE=1 VLLM_SERVER_DEV_MODE=1 VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export ENABLE_AUTO_TOOL_CHOICE=1 TOOL_CALL_PARSER=qwen3_xml
export LORA_ROOT="$LORA_ROOT" LORA_ALIAS_FILE="$LORA_ROOT/.aliases.json" INCLUDE_ALIAS_LORAS=1
# Enable LoRA even with no initial adapters — needed for runtime LoRA loading
export EXTRA_ARGS='--enable-lora --max-lora-rank 32'

bash scripts/openclaw-local/start_vllm_qwen3_8b.sh >> /tmp/vllm.log 2>&1 &
bash docker/scripts/start_vllm_qwen3_8b.sh >> /tmp/vllm.log 2>&1 &

# First run downloads Qwen3-8B (~16 GB) — expect 5-20 min
until curl -sf http://localhost:8000/health; do sleep 5; done && echo "vLLM ready"
```

### 4. Start CLaaS API

The API must be started via its Hydra entry point (not bare `uvicorn`) so that the
runtime config is loaded and `configure_web_app()` is called. Override `lora_root`
to point to the local LoRA directory (the default `/loras` is the Docker path).

```bash
CLAAS_CONFIG_NAME=local \
CLAAS_LORA_ROOT="${HOME}/.local/share/claas/loras" \
VLLM_BASE_URL=http://localhost:8000 \
VLLM_API_KEY=sk-local \
FEEDBACK_LOG_DIR=/tmp/feedback-logs \
uv run uvicorn claas.api:web_app --host 0.0.0.0 --port 8080 >> /tmp/claas-api.log 2>&1 &
uv run python -m runpy claas.api \
lora_root="${HOME}/.local/share/claas/loras" \
feedback_log_dir=/tmp/feedback-logs \
'hydra.run.dir=.' \
>> /tmp/claas-api.log 2>&1 &

curl -sf http://localhost:8080/v1/health
```
Expand Down Expand Up @@ -172,6 +185,7 @@ Report the status of all four components and the Telegram bot username.
| `Numba needs NumPy 2.2 or less` | `uv pip install "numpy<2.3"` |
| `Python.h: No such file or directory` | Recreate venv with uv-managed Python (step 1 note) |
| `No API key found for provider "local"` | Create `auth-profiles.json` (step 2) |
| `flash_attn seems to be not installed` | `uv pip install flash-attn --no-build-isolation` (requires CUDA torch first) |
| vLLM OOM | Lower `GPU_MEMORY_UTILIZATION` to `0.60` |

## Logs
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ htmlcov/
feedback_logs/
.local_loras/
.run-logs/
.hydra/
node_modules/
package.json
package-lock.json
EXPERIMENTS.md

# Runtime data (feedback logs, eval results, Hydra logs)
data/feedback/
Expand Down
7 changes: 7 additions & 0 deletions claas/core/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,13 @@ class TrainingConfig:
max_grad_norm: float = 1.0
kl_reg_weight: float = 0.0
teacher_top_k: int = 100
steps_per_batch: int = 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce positive steps_per_batch in TrainingConfig

The newly added steps_per_batch field has no lower-bound validation, but both multi-step trainers now assume at least one iteration and unconditionally read step_metrics[-1] (claas/training/distillation.py and claas/training/engine/tinker/engine.py), so sending training.steps_per_batch=0 is currently accepted and then crashes /v1/feedback with a server error instead of a clean 4xx validation failure; this can break eval runs by turning every feedback update into a failed request.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c7eb678

Comment generated by Claude Code

feedback_repetitions: int = 1

def __post_init__(self) -> None:
if self.steps_per_batch < 1:
msg = f"steps_per_batch must be >= 1, got {self.steps_per_batch}"
raise ValueError(msg)
Comment on lines +38 to +41
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate feedback_repetitions lower bound

TrainingConfig.__post_init__ now enforces steps_per_batch >= 1 but leaves feedback_repetitions unchecked, so 0 or negative values are accepted and later converted into an empty critique string via " ".join([sample.feedback] * feedback_repetitions) in both training engines. In that case distillation silently runs without the user’s feedback signal, which is a correctness regression for misconfigured runs and should be rejected up front the same way invalid step counts are.

Useful? React with 👍 / 👎.



class SDPOLossInput(BaseModel):
Expand Down
19 changes: 10 additions & 9 deletions claas/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,22 +26,23 @@ metrics: # metrics to evaluate per step

num_steps: 20
batch_size: 4
steps_per_batch: 4 # gradient updates per batch
feedback_repetitions: 1 # times to repeat feedback string
training: # forwarded to /v1/feedback training config
learning_rate: 3e-5
alpha: 0.5
is_clip: 5.0
max_grad_norm: 1.0
kl_reg_weight: 0.0
teacher_top_k: 100
collapse_steps: [0, 5, 10, 15, 19] # steps where collapse metric runs
plots: true # generate matplotlib plots
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}

openclaw_url: http://localhost:18789 # OpenClaw gateway (null = use CLaaS API directly)

training: # forwarded to /v1/feedback TrainingConfig
learning_rate: 3e-5
alpha: 0.5
is_clip: 5.0
max_grad_norm: 1.0
kl_reg_weight: 0.0
teacher_top_k: 100
steps_per_batch: 4 # gradient updates per batch
feedback_repetitions: 1 # times to repeat feedback string
```

### Overriding config via CLI
Expand Down
15 changes: 8 additions & 7 deletions claas/eval/configs/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,18 @@ plots: true

num_steps: 20
batch_size: 4
steps_per_batch: 4
feedback_repetitions: 1
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}

openclaw_url: http://localhost:18789

training:
learning_rate: 3e-5
alpha: 0.5
is_clip: 5.0
max_grad_norm: 1.0
kl_reg_weight: 0.0
teacher_top_k: 100
seed: 42
lora_id_prefix: eval
output_dir: ./data/evals/${now:%Y%m%d-%H%M%SZ}

openclaw_url: http://localhost:18789
steps_per_batch: 4
feedback_repetitions: 1
41 changes: 41 additions & 0 deletions claas/eval/configs/preference/cloud_gcp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: cloud_gcp
feedback_string: "Be concise, 2-3 sentences max. When recommending cloud infrastructure, default to GCP."
verifier:
_target_: claas.eval.metrics.verifiers.GcpPreferenceVerifier
logprob_pairs:
- prompt_messages:
- role: user
content: "What cloud provider should I use for my web application?"
positive_response: >-
GCP. Cloud Run handles container deployment with automatic scaling
and you only pay for what you use. Pair it with Cloud SQL and Cloud CDN.
negative_response: >-
There are several great options to consider for hosting a web application.
AWS is the market leader with services like EC2 for compute, RDS for
managed databases, CloudFront for CDN, and Elastic Beanstalk for easy
deployment. Azure is another strong choice, especially if you're in a
Microsoft ecosystem, offering App Service, Azure SQL, and Azure CDN.
Google Cloud Platform also has compelling options like Cloud Run and
App Engine. The best choice really depends on your team's experience,
budget, and specific requirements.
probe_prompts:
- "What cloud provider should I use for my web application?"
- "I'm building a new SaaS product. What infrastructure should I use?"
- "How should I set up infrastructure for a machine learning pipeline?"
- "What's the best way to deploy microservices?"
- "I need a scalable analytics warehouse. What should I use?"
- "How should I architect CI/CD for a monorepo with 15 services?"
- "What's the cheapest way to run batch GPU training jobs?"
- "I'm building a real-time data pipeline ingesting events from 10,000 IoT devices. What stack?"
- "My startup needs to go from zero to production infrastructure. Where do I start?"
- "I need to host a Kubernetes cluster. What are my options?"
- "What's the best way to store and query terabytes of log data?"
- "I want to deploy a Python API with autoscaling. What should I use?"
- "How do I set up a data lake for my analytics team?"
- "What infrastructure do I need for a multiplayer game backend?"
- "I'm migrating from on-prem to cloud. Where should I start?"
- "What's the most cost-effective way to run cron jobs in the cloud?"
- "I need to serve a fine-tuned LLM in production. What are my options?"
- "How should I handle file storage and CDN for a media-heavy app?"
- "What's the best setup for running distributed Spark jobs?"
- "I need a managed Postgres database with high availability. Recommendations?"
38 changes: 38 additions & 0 deletions claas/eval/metrics/verifiers.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,44 @@ def __call__(self, response: str) -> VerifierResult:
return VerifierResult(score=1.0 if passed else 0.0, passed=passed)


# Regex for GCP-related terms (case-insensitive)
_GCP_TERMS_RE = re.compile(
r"\b(?:"
r"google\s+cloud|gcp|cloud\s+run|cloud\s+functions|gke|"
r"bigquery|cloud\s+sql|cloud\s+storage|compute\s+engine|"
r"app\s+engine|cloud\s+pub/?sub|firestore|cloud\s+build|"
r"vertex\s+ai|cloud\s+cdn|cloud\s+armor|anthos"
r")\b",
re.IGNORECASE,
)

# Regex for competing cloud provider names
_COMPETITOR_RE = re.compile(
r"\b(?:aws|amazon\s+web\s+services|azure|microsoft\s+azure)\b",
re.IGNORECASE,
)


class GcpPreferenceVerifier:
"""Pass when the response recommends GCP and doesn't primarily push competitors."""

def __call__(self, response: str) -> VerifierResult:
gcp_mentions = len(_GCP_TERMS_RE.findall(response))
competitor_mentions = len(_COMPETITOR_RE.findall(response))

if gcp_mentions == 0:
return VerifierResult(score=0.0, passed=False)

# GCP must be mentioned more than competitors combined
if competitor_mentions >= gcp_mentions:
score = gcp_mentions / (gcp_mentions + competitor_mentions)
return VerifierResult(score=score, passed=False)

# Graduated score: 1 mention = 0.5, 2+ = 1.0
score = min(1.0, 0.5 * gcp_mentions)
return VerifierResult(score=score, passed=gcp_mentions >= 2)


def run_verifier(verifier: Verifier, response: str) -> VerifierResult:
"""Run a verifier on a response (thinking blocks stripped)."""
return verifier(strip_thinking(response))
39 changes: 17 additions & 22 deletions claas/eval/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,15 @@ async def _submit_feedback(
adv_abs_mean_raw=metadata["adv_abs_mean_raw"],
completion_len=metadata["completion_len"],
batch_size=metadata["batch_size"],
steps_per_batch_applied=metadata.get("steps_per_batch_applied", 1),
)

return LocalDistillMetrics(
distill_loss=metadata.get("distill_loss"),
kl_reg=metadata.get("kl_reg"),
mean_is_ratio=metadata.get("mean_is_ratio"),
clip_fraction=metadata.get("clip_fraction"),
steps_per_batch_applied=metadata.get("steps_per_batch_applied", 1),
)


Expand Down Expand Up @@ -190,6 +192,7 @@ def _load_completed_steps(output_dir: str, preference: str) -> list[StepResult]:
prompt_used=data["prompt_used"],
response_text=data.get("response_text"),
timing_s=data.get("timing_s", 0.0),
sub_step_count=data.get("sub_step_count", 1),
))
return steps

Expand Down Expand Up @@ -362,8 +365,8 @@ async def run_preference_experiment(
for step in range(resume_from, config.num_steps):
step_start = time.perf_counter()

# Determine feedback string
feedback_str = " ".join([pref.feedback_string] * config.feedback_repetitions)
# Feedback repetition is a training concern configured via TrainingConfig.
feedback_str = pref.feedback_string

# Collect samples for this step (batch_size >= 1)
samples: list[FeedbackItem] = []
Expand Down Expand Up @@ -398,29 +401,21 @@ async def run_preference_experiment(
if response_text is None:
response_text = "I'd be happy to help you with that."

# Submit feedback — possibly multiple gradient steps on same batch
# Submit feedback for this step. Training engine applies steps_per_batch.
sdpo_metrics = None
sub_steps_completed = 0
if samples:
for sub_step in range(config.steps_per_batch):
try:
sdpo_metrics = await _submit_feedback(
config, actual_lora_id, samples,
)
sub_steps_completed += 1
except (httpx.HTTPError, KeyError) as e:
logger.warning(
"[%s] Step %d sub-step %d feedback failed: %s",
pref.name, step, sub_step, e,
)
break

if config.steps_per_batch > 1:
logger.info(
"[%s] Step %d: %d sub-steps completed",
pref.name, step, sub_steps_completed,
try:
sdpo_metrics = await _submit_feedback(
config, actual_lora_id, samples,
)
except (httpx.HTTPError, KeyError) as e:
logger.warning(
"[%s] Step %d feedback failed: %s",
pref.name, step, e,
)

sub_step_count = sdpo_metrics.steps_per_batch_applied if sdpo_metrics else 1

# Measure eval
try:
eval_metrics = await _measure_eval_metrics(
Expand All @@ -447,7 +442,7 @@ async def run_preference_experiment(
],
response_text=response_text if needs_generation else None,
timing_s=timing_s,
sub_step_count=sub_steps_completed if sub_steps_completed > 0 else 1,
sub_step_count=sub_step_count,
)

result.steps.append(step_result)
Expand Down
4 changes: 2 additions & 2 deletions claas/eval/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,6 @@ class EvalConfig:
openclaw_url: Optional[str] = None
base_model: str = "Qwen/Qwen3-8B"
batch_size: int = 4
steps_per_batch: int = 4
feedback_repetitions: int = 1
training: TrainingConfig = field(default_factory=TrainingConfig)


Expand All @@ -117,6 +115,7 @@ class LocalDistillMetrics:
kl_reg: float | None
mean_is_ratio: float | None
clip_fraction: float | None
steps_per_batch_applied: int = 1


@dataclass
Expand All @@ -131,6 +130,7 @@ class TinkerDistillMetrics:
adv_abs_mean_raw: float
completion_len: int = 0
batch_size: int = 0
steps_per_batch_applied: int = 1


@dataclass
Expand Down
10 changes: 10 additions & 0 deletions claas/inference/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,16 @@ async def chat_completion(

usage = data.get("usage", {})

# vLLM includes the stop token (e.g. <|im_end|>) in logprobs but the
# tokenizer doesn't produce it when re-encoding the text. Trim the
# logprobs so the two sequences stay aligned.
if (
response_logprobs is not None
and response_token_ids
and len(response_logprobs) > len(response_token_ids)
):
response_logprobs = response_logprobs[: len(response_token_ids)]

return CompletionResult(
content=content,
raw_prompt=raw_prompt,
Expand Down
Loading