The full operator guide. From clone to trained model to HF Space deploy. Updated 2026-04-26 in response to peer review — every claim here was checked against the codebase and reflects current state.
| Item | Status | Evidence |
|---|---|---|
| Basic-tier env (12 templates × 6 entries = 72 scenarios) | ✅ runnable end-to-end | make test (203 pass), make baseline |
| Advanced-tier orchestrator (chains Basic episodes) | ✅ runnable as Python orchestrator | python -m sre_gym.advanced run cascading_release_train --seed 1 |
| Advanced-tier 28-action universe (per scenario YAMLs) | 🟡 design-spec only | DESIGN-SPEC HEADER at top of each YAML — Python implements only the Basic 11 |
| Max-tier graph state-machine simulator | ✅ runnable in Python | python -m sre_gym.max run ecommerce_vibecoded_saas --chaos rls_silent_leak |
Max-tier docker-compose stack (ghcr.io/sre-gym/* images) |
🔴 unbuilt — images not published | DESIGN-SPEC HEADER at top of compose/ecommerce.yaml |
Gradio UI mounted at / of the FastAPI server |
✅ live | app.py + mount_gradio_app(api_app, blocks, path="/") |
MCP JSON-RPC 2.0 dual-route at /mcp |
✅ live | tests/test_mcp_route_parity.py (20 parity tests) |
| Pytest suite | ✅ 203 tests collected | pytest --collect-only -q |
openenv validate . |
✅ green | uv.lock regenerated |
| ruff check | ✅ clean | ruff check sre_gym tests app.py |
| GRPO training run | 🔴 not executed | notebooks/01_*.ipynb has 0 executed cells; no adapter on HF Hub |
| Eval comparison run | 🔴 not executed | notebooks/02_*.ipynb has 0 executed cells; eval/results/ has only README + .gitkeep |
| Frontier-baselines table (Llama / Claude / heuristic / random / scripted-optimal) | ✅ measured | train/data/eval_sweep_baselines.jsonl (n=18 each for random + heuristic), claude_seed.jsonl (n=6) |
| Teacher trajectories for the 6 round-2 Basic templates | 🔴 not collected | train/data/ has 0 samples for auth_token_expiry, dep_degradation, memory_leak_oom, migration_lock, network_partition, rate_limit_retry_storm |
seed_combined.jsonl row count vs the SFT-friendly target |
🟡 21 / 200 | wc -l train/data/seed_combined.jsonl |
| Mini-blog or YouTube video link in README | 🔴 not yet recorded | tracked in §16 below |
The honest framing: one tier (Basic) is a real live RL environment, two tiers (Advanced + Max) are runnable Python orchestrators that approximate larger designs, and training is shipped as runnable notebooks but not yet executed. Pretending otherwise was the original mistake; this revision corrects it.
- Prerequisites + system requirements
- Local setup (5 minutes)
- First-run smoke test (5 minutes)
- Tier-aware operation
- Scenario authoring quickstart
- Training pipeline (Basic) — to be executed externally
- Eval comparison sweep
- Teacher trajectory collection (the gap that needs filling)
- HF Space deployment
- Async GRPO via OpenClaw-RL
- Claude Code skill setup
- Performance + cost reference
- Troubleshooting
- Realistic submission-day checklist
- Operator FAQ
- Materials linked from the README
Local development (Basic tier env serving + tests):
- Python 3.10+ (3.11 / 3.12 / 3.14 verified)
- pip 24+ or uv
- Git
- Docker (only required for HF Space build; not required for normal env serving)
- 4 GB free RAM, 2 GB free disk
Training (Basic tier, end-to-end GRPO):
- 1×A100 40GB (HF Pro Spaces, Colab A100, or rented)
- Or 1×L4 24GB (drops to Qwen2.5-1.5B)
- Or 1×H100 80GB (can run Qwen2.5-7B)
- 80 GB scratch disk
- HF account + token (
HF_TOKEN) with write scope for adapter push - Optional: Anthropic / Fireworks / Groq API key for richer comparison rows
Max tier "real cluster" (operator-only, NOT provisioned in this repo):
- 8× A100/H100 cluster, $40–150/day cluster cost
- Sandboxed Stripe test creds, sandboxed Supabase project, sandboxed git remote
- Docker Compose v2 / k3d
- ~$1–2k registry-cost commitment to publish the
ghcr.io/sre-gym/*stub images
git clone https://github.com/dakshdoesdev/sre-enginnerllm.git
cd sre-enginnerllm
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e '.[dev]'Verify:
make test # 203 tests, ~3s
python -m openenv.cli validate . # greenBoot the combined Gradio + FastAPI server:
uvicorn app:app --host 127.0.0.1 --port 7860Then in a second shell:
curl -s http://127.0.0.1:7860/health | jq
# {"status": "ok", "environment": "unified_incident_env", ...}
curl -s http://127.0.0.1:7860/tasks | jq '.scenarios | length'
# 72
curl -s http://127.0.0.1:7860/mcp/tools | jq '.tools | length'
# 11Hit a scenario via /reset + /step:
curl -s -X POST http://127.0.0.1:7860/reset \
-H 'Content-Type: application/json' \
-d '{"scenario_id":"memory_leak_oom"}' | jq '.observation.workflow_stage'
# "triage"
curl -s -X POST http://127.0.0.1:7860/step \
-H 'Content-Type: application/json' \
-d '{"action":{"action_type":"query_logs","service":"worker"}}' | jq '.observation.tool_output'
# "Worker logs: 'process killed (OOM)' every ~90s..."Run the scripted-baseline smoke against all 12 templates:
make baseline
# scripted-optimal mean across all 12 templates: ~0.750
# 12 / 12 resolvedIf make baseline ever reports mean > 0.80 the rubric is leaking — see §13.
make tier-info # prints the tier metadata table for each of basic/advanced/maxProgrammatic API:
from sre_gym import SREGym, Tier
# Basic
env = SREGym(tier=Tier.BASIC)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)
# Advanced — chained Basic episodes with horizon state
env = SREGym(tier=Tier.ADVANCED)
result = env.run("cascading_release_train", seed=1)
print(result.summary())
# Max — Python state-machine simulator
env = SREGym(tier=Tier.MAX)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})CLI:
python -m sre_gym.advanced list
python -m sre_gym.advanced run cascading_release_train --seed 1
python -m sre_gym.max list-chaos # 12 patterns (one is alias)
python -m sre_gym.max run ecommerce_vibecoded_saas --chaos rls_silent_leak- Append the template dict to
EXTRA_TEMPLATESinunified_incident_env/server/basic_templates_extra.py. - Append a baseline-action lambda to
extra_baselines(). - Append the new
RootCauseTypevalue tounified_incident_env/models.py. - Append the template_id to
ROUND2_TEMPLATESintests/test_round2_templates.py.
make test exercises all of the above automatically. Procgen variants generate at module-import time.
Drop a new YAML in sre_gym/advanced/scenarios/. Include the DESIGN-SPEC HEADER the existing scenarios carry — call out which subset of allowed_actions: is implemented vs which is design-spec. The runner falls back to the Basic 11 actions for anything else.
Triplet of YAMLs:
sre_gym/max/families/<id>.yaml— family-level specsre_gym/max/chaos/<id>_chaos_library.yaml— composable chaos patternssre_gym/max/compose/<id>.yaml— docker-compose stack (mark as design-spec if images aren't published)
Then add the chaos descriptors to CHAOS_PATTERN_DEFAULTS in sre_gym/max/runner.py so the simulator can run them.
See docs/SCENARIO_AUTHORING.md for the full schema.
Status: not yet executed. The notebooks ship; running them is on us.
- Open
notebooks/01_basic_train_grpo_unsloth.ipynbin Colab. - Set runtime to A100.
- Set
HF_TOKENin Colab Secrets (sidebar key icon). - Run-All.
The notebook auto-detects GPU memory:
- A100 40GB → Qwen2.5-3B-Instruct, LoRA r=64, K=4 GRPO rollouts
- L4 24GB → Qwen2.5-1.5B-Instruct, LoRA r=32, K=2
- ≤16GB → Qwen2.5-0.5B, LoRA r=16, K=2
Output: trained adapter pushed to dakshdoesdev/sre-gym-qwen25-3b-grpo (rename REPO_ID in cell 9 if needed).
| Stage | Steps | Wall-clock on A100 40GB | Output |
|---|---|---|---|
| Seed dataset build (Claude / Llama teachers) | ~200 trajectories | ~2h, ~$15 API spend | train/data/seed_combined.jsonl ≥ 200 rows |
| SFT cold start | 500 steps | ~3h | LoRA adapter |
| GRPO online | 800 steps, K=4 | ~6h | trained adapter |
| Eval sweep | 36 episodes (12 × 3 seeds) | ~30min | eval/results/comparison_*.csv + *.png |
End-to-end ~12h on A100, ~$15 of API spend. Until a real run produces these artifacts, the README does not claim a trained-model row in the baselines table.
notebooks/02_basic_eval_comparison.ipynb runs 7 policies against the held-out 12-scenario set and writes:
eval/results/comparison_raw.csv— every per-episode roweval/results/comparison_summary.csv— per-policy aggregateseval/results/comparison_table.csv— printable table for the READMEeval/results/comparison_per_template.png— per-template box plotseval/results/comparison_hero.png— single-axis bar chart
API keys required for the full sweep: ANTHROPIC_API_KEY, FIREWORKS_API_KEY or GROQ_API_KEY, HF_TOKEN. Status: not yet executed. When the run lands, the artifacts go straight into eval/results/ and the README table updates.
Local-only (no API costs):
PYTHONPATH=. python scripts/eval_baseline.py --output eval/results/scripted_baseline.jsonltrain/data/ currently contains 21 rows in seed_combined.jsonl and 0 trajectories for these 6 templates:
auth_token_expiry dep_degradation memory_leak_oom
migration_lock network_partition rate_limit_retry_storm
To collect (against a real model — pick any one driver):
export ANTHROPIC_API_KEY=...
python train/collect_trajectories.py \
--env-url http://127.0.0.1:7860 \
--scenarios memory_leak_oom dep_degradation auth_token_expiry \
network_partition rate_limit_retry_storm migration_lock \
--models claude-opus-4-7 \
--episodes-per-model 50 \
--parallelism 2 \
--driver anthropic \
--output train/data/round2_claude.jsonlOr via the free Groq tier:
export GROQ_API_KEY=...
python train/collect_trajectories.py \
--env-url http://127.0.0.1:7860 \
--scenarios all \
--models llama-3.3-70b-versatile \
--episodes-per-model 200 \
--parallelism 3 \
--driver groq \
--output train/data/round2_groq.jsonlThen merge into seed_combined.jsonl:
python train/compile_claude_seed.py \
--inputs train/data/claude_seed.jsonl train/data/round2_claude.jsonl \
--output train/data/seed_combined.jsonlThe Space is Madhav189/sre-env (the v3 deployment target). The legacy dakshdoesdev/sre-gym Space is the older v2 deploy.
# Authenticate the HF CLI
huggingface-cli login
# Sync repo to the Space's git remote
bash deploy/push_to_hf.shThe Space rebuilds automatically (~3–4min for cpu-basic). Verify:
curl -s https://madhav189-sre-env.hf.space/health | jq
curl -s https://madhav189-sre-env.hf.space/tasks | jq '.scenarios | length' # 72For training at scale (multiple A100s, distributed), the OpenClaw-RL pool-server shim wraps the env in a lease-based interface compatible with Gen-Verse/OpenClaw-RL.
python -m uvicorn openclaw_integration.pool_server:app --port 8100Endpoints: /allocate, /reset, /exec_tool, /evaluate, /close, /healthz. Mirrors the OpenClaw-RL/terminal-rl/remote/pool_server.py contract exactly. See openclaw_integration/README.md for full launch instructions.
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"In Claude Code:
> Solve the network_partition__p03 scenario in sre-gym.
> List all 12 templates and explain what each one teaches.
The skill loads the runbooks under skill/verified-runbooks/*.md (12 markdown files, one per template). Of those, 3 are status: verified (the v2 templates with real run history) and 9 are status: draft (the round-2 templates — drafted by hand, not yet validated by a successful solve).
| Stage | A100 40GB | L4 24GB | H100 80GB |
|---|---|---|---|
| Seed dataset build | ~2h ($15 API) | ~2h | ~2h |
| SFT cold start (500 steps) | ~3h | ~6h | ~1.5h |
| GRPO online (800 steps) | ~6h | ~13h | ~3h |
| Eval sweep | ~30min | ~45min | ~15min |
| Total | ~12h | ~22h | ~7h |
| Model variant | Qwen2.5-3B | Qwen2.5-1.5B | Qwen2.5-7B |
| GPU | HF Pro credits/h | Total per run |
|---|---|---|
| A100 40GB | 4.13 | ~50 |
| L4 24GB | 0.80 | ~18 |
| H100 80GB | 9.39 | ~66 |
pip install --upgrade 'openenv-core>=0.2.1'You added a service that's not in the ServiceName Literal. Edit unified_incident_env/models.py.
- Drop learning rate from 5e-6 to 2e-6
- Bump KL coefficient (
beta) from 0.04 to 0.08 - Check that
episode_reward()returns -0.05 (not NaN) on parse failures
Drop to Qwen2.5-1.5B by setting BASE_MODEL manually before cell 3 runs. Or reduce LORA_RANK from 64 to 32.
cpu-basic Spaces have 16GB ephemeral. Prune optional deps from requirements.txt if total wheel size exceeds 4GB.
The rubric is leaking — the scripted-optimal ceiling must stay in the [0.70, 0.80] band per test_baseline_ceiling_is_hardened_below_080. Check what changed in unified_incident_env/server/grader.py. Re-run make test and look for the failing assertion.
Expected. The ghcr.io/sre-gym/* images are not published — see the DESIGN-SPEC HEADER at the top of the compose file. The runnable Max-tier surface is the Python state-machine simulator (python -m sre_gym.max run ...), not the compose stack.
Expected. The Advanced YAMLs declare a wider 28-action universe as design spec; the runner only implements the Basic 11 actions. See the DESIGN-SPEC HEADER at the top of each Advanced YAML for the exhaustive list of implemented vs. design-only actions.
You're authenticated as a user without collaborator access on dakshdoesdev/sre-enginnerllm. Ask the maintainer to add you, or gh auth login as the maintainer.
The previous version of this checklist had aspirational items that didn't get done in time. This version reflects what actually needs to happen.
T-48h (do these before the submission window closes):
- Execute
notebooks/01_basic_train_grpo_unsloth.ipynbin Colab (A100) - Push the resulting LoRA adapter to a public HF Hub repo
- Execute
notebooks/02_basic_eval_comparison.ipynband commiteval/results/comparison_*.csv+*.png - Update README §"Frontier baselines" with the trained-model row including
n=36measured episodes - Collect ≥ 50 trajectories for each of the 6 round-2 templates and update
seed_combined.jsonl - Verify
make test(203 green) andopenenv validate . - Verify HF Space
/healthreturns 200 from a fresh browser
T-24h:
- All README links reachable (HF Space, GitHub repo, docs files, notebook files)
-
eval/results/comparison_table.csvrow counts match the README's frontier-baselines table - Mini-blog or YouTube video link added to the materials section (§16)
-
git statusclean;git logshows the submission commit at HEAD - HF Space has a recent successful build (no auto-rollback)
Submission window:
- Submit form references the GitHub repo URL + HF Space URL
-
git tag submission-2026-04-26 && git push --tags
Items intentionally NOT in this checklist (because they're stretch goals or don't fit submission window):
- Publishing the
ghcr.io/sre-gym/*Max stub images - Actually-trained Advanced/Max specialists (these are 1–2 A100-day and multi-week investments, respectively, far beyond the hackathon window)
Q: Why doesn't the README claim a trained-model row in the baselines?
A: Because no trained run has happened in this repo. Earlier drafts had a "target ≥ 0.80" placeholder row; that was misleading. We removed it. When notebook 01 runs and notebook 02 produces a real row, it lands in the table with n=36 measured episodes.
Q: Why does Random outperform the deterministic Heuristic? A: Because the heuristic commits to a fixed (often wrong) sequence while Random sometimes stumbles into a useful evidence-gathering path that earns shaped per-tick reward. The fix is in the heuristic, not the env.
Q: Why do all 12 Max chaos deploy_markers carry the same date (2026.04.25)?
A: They're synthetic markers used by the Python simulator's query_deploys(). The chaos library cites real-world incident patterns (Stripe webhook signature drift, Cloudflare config rollouts, etc.) but the deploy markers themselves are placeholders, not real-incident citations. Reading them as historical record is a mistake we owe better signaling on.
Q: The Advanced runner maps Supabase-RLS phases to Stripe webhook + Postgres lock + worker deploy templates. Why?
A: Because we don't have a Supabase-RLS Basic template. The phase mapping in PHASE_TO_BASIC_TEMPLATE is a narrative wrapper around whichever Basic templates exercise the closest-shaped failure. It's a known approximation. A proper fix is either (a) authoring an rls_silent_leak Basic template, or (b) implementing the wider Advanced action universe so the Supabase scenario can run on its own simulator.
Q: Why are 6 of 12 Basic templates lacking teacher trajectories? A: Round-2 templates were authored without an accompanying SFT collection pass. §8 above is the runbook to fix it; ~$10–15 of API spend covers all six.
Q: Should I use the docker-compose stack under sre_gym/max/families/?
A: Not yet — the stub images aren't published. The runnable Max surface is the Python state-machine simulator. The compose file documents the topology shape an operator could provision after publishing the images.
| Material | Status | Link |
|---|---|---|
| Source repo | ✅ live | https://github.com/dakshdoesdev/sre-enginnerllm |
| HF Space | ✅ live | https://huggingface.co/spaces/Madhav189/sre-env |
| Trained adapter on HF Hub | 🔴 not yet | dakshdoesdev/sre-gym-qwen25-3b-grpo (target) |
| Mini-blog / YouTube video | 🔴 not yet | TBD |
| Comparison hero plot | 🔴 not yet | eval/results/comparison_hero.png (target) |
| Architecture deep dive | ✅ shipped | docs/ARCHITECTURE.md |
| Per-tier docs | ✅ shipped | docs/BASIC_TIER.md / docs/ADVANCED_TIER.md / docs/MAX_TIER.md |
| Reward design | ✅ shipped | docs/REWARD_DESIGN.md |
| Scenario authoring | ✅ shipped | docs/SCENARIO_AUTHORING.md |
| References + postmortems | ✅ shipped | docs/REFERENCES.md |
When the trained adapter, video, and hero plot land they'll be linked from this section and from the README.