Execution runbook — sre-gym v3.1

The full operator guide. From clone to trained model to HF Space deploy. Updated 2026-04-26 in response to peer review — every claim here was checked against the codebase and reflects current state.

Current state — what's done, what isn't (read this first)

Item	Status	Evidence
Basic-tier env (12 templates × 6 entries = 72 scenarios)	✅ runnable end-to-end	`make test` (203 pass), `make baseline`
Advanced-tier orchestrator (chains Basic episodes)	✅ runnable as Python orchestrator	`python -m sre_gym.advanced run cascading_release_train --seed 1`
Advanced-tier 28-action universe (per scenario YAMLs)	🟡 design-spec only	`DESIGN-SPEC HEADER` at top of each YAML — Python implements only the Basic 11
Max-tier graph state-machine simulator	✅ runnable in Python	`python -m sre_gym.max run ecommerce_vibecoded_saas --chaos rls_silent_leak`
Max-tier docker-compose stack (`ghcr.io/sre-gym/*` images)	🔴 unbuilt — images not published	`DESIGN-SPEC HEADER` at top of `compose/ecommerce.yaml`
Gradio UI mounted at `/` of the FastAPI server	✅ live	`app.py` + `mount_gradio_app(api_app, blocks, path="/")`
MCP JSON-RPC 2.0 dual-route at `/mcp`	✅ live	`tests/test_mcp_route_parity.py` (20 parity tests)
Pytest suite	✅ 203 tests collected	`pytest --collect-only -q`
`openenv validate .`	✅ green	`uv.lock` regenerated
ruff check	✅ clean	`ruff check sre_gym tests app.py`
GRPO training run	🔴 not executed	`notebooks/01_*.ipynb` has 0 executed cells; no adapter on HF Hub
Eval comparison run	🔴 not executed	`notebooks/02_*.ipynb` has 0 executed cells; `eval/results/` has only README + .gitkeep
Frontier-baselines table (Llama / Claude / heuristic / random / scripted-optimal)	✅ measured	`train/data/eval_sweep_baselines.jsonl` (n=18 each for random + heuristic), `claude_seed.jsonl` (n=6)
Teacher trajectories for the 6 round-2 Basic templates	🔴 not collected	`train/data/` has 0 samples for `auth_token_expiry`, `dep_degradation`, `memory_leak_oom`, `migration_lock`, `network_partition`, `rate_limit_retry_storm`
`seed_combined.jsonl` row count vs the SFT-friendly target	🟡 21 / 200	`wc -l train/data/seed_combined.jsonl`
Mini-blog or YouTube video link in README	🔴 not yet recorded	tracked in §16 below

The honest framing: one tier (Basic) is a real live RL environment, two tiers (Advanced + Max) are runnable Python orchestrators that approximate larger designs, and training is shipped as runnable notebooks but not yet executed. Pretending otherwise was the original mistake; this revision corrects it.

Prerequisites + system requirements
Local setup (5 minutes)
First-run smoke test (5 minutes)
Tier-aware operation
Scenario authoring quickstart
Training pipeline (Basic) — to be executed externally
Eval comparison sweep
Teacher trajectory collection (the gap that needs filling)
HF Space deployment
Async GRPO via OpenClaw-RL
Claude Code skill setup
Performance + cost reference
Troubleshooting
Realistic submission-day checklist
Operator FAQ
Materials linked from the README

1. Prerequisites + system requirements

Local development (Basic tier env serving + tests):

Python 3.10+ (3.11 / 3.12 / 3.14 verified)
pip 24+ or uv
Git
Docker (only required for HF Space build; not required for normal env serving)
4 GB free RAM, 2 GB free disk

Training (Basic tier, end-to-end GRPO):

1×A100 40GB (HF Pro Spaces, Colab A100, or rented)
Or 1×L4 24GB (drops to Qwen2.5-1.5B)
Or 1×H100 80GB (can run Qwen2.5-7B)
80 GB scratch disk
HF account + token (HF_TOKEN) with write scope for adapter push
Optional: Anthropic / Fireworks / Groq API key for richer comparison rows

Max tier "real cluster" (operator-only, NOT provisioned in this repo):

8× A100/H100 cluster, $40–150/day cluster cost
Sandboxed Stripe test creds, sandboxed Supabase project, sandboxed git remote
Docker Compose v2 / k3d
~$1–2k registry-cost commitment to publish the ghcr.io/sre-gym/* stub images

2. Local setup (5 minutes)

git clone https://github.com/dakshdoesdev/sre-enginnerllm.git
cd sre-enginnerllm

python3 -m venv .venv
source .venv/bin/activate                    # Windows: .venv\Scripts\activate

pip install --upgrade pip
pip install -e '.[dev]'

Verify:

make test                                      # 203 tests, ~3s
python -m openenv.cli validate .               # green

3. First-run smoke test (5 minutes)

Boot the combined Gradio + FastAPI server:

uvicorn app:app --host 127.0.0.1 --port 7860

Then in a second shell:

curl -s http://127.0.0.1:7860/health | jq
# {"status": "ok", "environment": "unified_incident_env", ...}

curl -s http://127.0.0.1:7860/tasks | jq '.scenarios | length'
# 72

curl -s http://127.0.0.1:7860/mcp/tools | jq '.tools | length'
# 11

Hit a scenario via /reset + /step:

curl -s -X POST http://127.0.0.1:7860/reset \
  -H 'Content-Type: application/json' \
  -d '{"scenario_id":"memory_leak_oom"}' | jq '.observation.workflow_stage'
# "triage"

curl -s -X POST http://127.0.0.1:7860/step \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"query_logs","service":"worker"}}' | jq '.observation.tool_output'
# "Worker logs: 'process killed (OOM)' every ~90s..."

Run the scripted-baseline smoke against all 12 templates:

make baseline
# scripted-optimal mean across all 12 templates: ~0.750
# 12 / 12 resolved

If make baseline ever reports mean > 0.80 the rubric is leaking — see §13.

4. Tier-aware operation

make tier-info    # prints the tier metadata table for each of basic/advanced/max

Programmatic API:

from sre_gym import SREGym, Tier

# Basic
env = SREGym(tier=Tier.BASIC)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)

# Advanced — chained Basic episodes with horizon state
env = SREGym(tier=Tier.ADVANCED)
result = env.run("cascading_release_train", seed=1)
print(result.summary())

# Max — Python state-machine simulator
env = SREGym(tier=Tier.MAX)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})

CLI:

python -m sre_gym.advanced list
python -m sre_gym.advanced run cascading_release_train --seed 1

python -m sre_gym.max list-chaos                                   # 12 patterns (one is alias)
python -m sre_gym.max run ecommerce_vibecoded_saas --chaos rls_silent_leak

5. Scenario authoring quickstart

5.1 Add a 13th Basic template

Append the template dict to EXTRA_TEMPLATES in unified_incident_env/server/basic_templates_extra.py.
Append a baseline-action lambda to extra_baselines().
Append the new RootCauseType value to unified_incident_env/models.py.
Append the template_id to ROUND2_TEMPLATES in tests/test_round2_templates.py.

make test exercises all of the above automatically. Procgen variants generate at module-import time.

5.2 Add an Advanced reference scenario

Drop a new YAML in sre_gym/advanced/scenarios/. Include the DESIGN-SPEC HEADER the existing scenarios carry — call out which subset of allowed_actions: is implemented vs which is design-spec. The runner falls back to the Basic 11 actions for anything else.

5.3 Add a Max scenario family

Triplet of YAMLs:

sre_gym/max/families/<id>.yaml — family-level spec
sre_gym/max/chaos/<id>_chaos_library.yaml — composable chaos patterns
sre_gym/max/compose/<id>.yaml — docker-compose stack (mark as design-spec if images aren't published)

Then add the chaos descriptors to CHAOS_PATTERN_DEFAULTS in sre_gym/max/runner.py so the simulator can run them.

See docs/SCENARIO_AUTHORING.md for the full schema.

6. Training pipeline (Basic) — to be executed externally

Status: not yet executed. The notebooks ship; running them is on us.

6.1 In Colab (recommended, A100)

Open notebooks/01_basic_train_grpo_unsloth.ipynb in Colab.
Set runtime to A100.
Set HF_TOKEN in Colab Secrets (sidebar key icon).
Run-All.

The notebook auto-detects GPU memory:

A100 40GB → Qwen2.5-3B-Instruct, LoRA r=64, K=4 GRPO rollouts
L4 24GB → Qwen2.5-1.5B-Instruct, LoRA r=32, K=2
≤16GB → Qwen2.5-0.5B, LoRA r=16, K=2

Output: trained adapter pushed to dakshdoesdev/sre-gym-qwen25-3b-grpo (rename REPO_ID in cell 9 if needed).

6.2 Stages (with measured budgets)

Stage	Steps	Wall-clock on A100 40GB	Output
Seed dataset build (Claude / Llama teachers)	~200 trajectories	~2h, ~$15 API spend	`train/data/seed_combined.jsonl` ≥ 200 rows
SFT cold start	500 steps	~3h	LoRA adapter
GRPO online	800 steps, K=4	~6h	trained adapter
Eval sweep	36 episodes (12 × 3 seeds)	~30min	`eval/results/comparison_.csv` + `.png`

End-to-end ~12h on A100, ~$15 of API spend. Until a real run produces these artifacts, the README does not claim a trained-model row in the baselines table.

7. Eval comparison sweep

notebooks/02_basic_eval_comparison.ipynb runs 7 policies against the held-out 12-scenario set and writes:

eval/results/comparison_raw.csv — every per-episode row
eval/results/comparison_summary.csv — per-policy aggregates
eval/results/comparison_table.csv — printable table for the README
eval/results/comparison_per_template.png — per-template box plots
eval/results/comparison_hero.png — single-axis bar chart

API keys required for the full sweep: ANTHROPIC_API_KEY, FIREWORKS_API_KEY or GROQ_API_KEY, HF_TOKEN. Status: not yet executed. When the run lands, the artifacts go straight into eval/results/ and the README table updates.

Local-only (no API costs):

PYTHONPATH=. python scripts/eval_baseline.py --output eval/results/scripted_baseline.jsonl

8. Teacher trajectory collection (the gap that needs filling)

train/data/ currently contains 21 rows in seed_combined.jsonl and 0 trajectories for these 6 templates:

auth_token_expiry         dep_degradation         memory_leak_oom
migration_lock            network_partition       rate_limit_retry_storm

To collect (against a real model — pick any one driver):

export ANTHROPIC_API_KEY=...
python train/collect_trajectories.py \
  --env-url http://127.0.0.1:7860 \
  --scenarios memory_leak_oom dep_degradation auth_token_expiry \
              network_partition rate_limit_retry_storm migration_lock \
  --models claude-opus-4-7 \
  --episodes-per-model 50 \
  --parallelism 2 \
  --driver anthropic \
  --output train/data/round2_claude.jsonl

Or via the free Groq tier:

export GROQ_API_KEY=...
python train/collect_trajectories.py \
  --env-url http://127.0.0.1:7860 \
  --scenarios all \
  --models llama-3.3-70b-versatile \
  --episodes-per-model 200 \
  --parallelism 3 \
  --driver groq \
  --output train/data/round2_groq.jsonl

Then merge into seed_combined.jsonl:

python train/compile_claude_seed.py \
  --inputs train/data/claude_seed.jsonl train/data/round2_claude.jsonl \
  --output train/data/seed_combined.jsonl

9. HF Space deployment

The Space is Madhav189/sre-env (the v3 deployment target). The legacy dakshdoesdev/sre-gym Space is the older v2 deploy.

# Authenticate the HF CLI
huggingface-cli login

# Sync repo to the Space's git remote
bash deploy/push_to_hf.sh

The Space rebuilds automatically (~3–4min for cpu-basic). Verify:

curl -s https://madhav189-sre-env.hf.space/health | jq
curl -s https://madhav189-sre-env.hf.space/tasks | jq '.scenarios | length'   # 72

10. Async GRPO via OpenClaw-RL

For training at scale (multiple A100s, distributed), the OpenClaw-RL pool-server shim wraps the env in a lease-based interface compatible with Gen-Verse/OpenClaw-RL.

python -m uvicorn openclaw_integration.pool_server:app --port 8100

Endpoints: /allocate, /reset, /exec_tool, /evaluate, /close, /healthz. Mirrors the OpenClaw-RL/terminal-rl/remote/pool_server.py contract exactly. See openclaw_integration/README.md for full launch instructions.

11. Claude Code skill setup

ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"

In Claude Code:

> Solve the network_partition__p03 scenario in sre-gym.
> List all 12 templates and explain what each one teaches.

The skill loads the runbooks under skill/verified-runbooks/*.md (12 markdown files, one per template). Of those, 3 are status: verified (the v2 templates with real run history) and 9 are status: draft (the round-2 templates — drafted by hand, not yet validated by a successful solve).

12. Performance + cost reference

Wall-clock (Basic-tier full pipeline)

Stage	A100 40GB	L4 24GB	H100 80GB
Seed dataset build	~2h ($15 API)	~2h	~2h
SFT cold start (500 steps)	~3h	~6h	~1.5h
GRPO online (800 steps)	~6h	~13h	~3h
Eval sweep	~30min	~45min	~15min
Total	~12h	~22h	~7h
Model variant	Qwen2.5-3B	Qwen2.5-1.5B	Qwen2.5-7B

HF credit consumption

GPU	HF Pro credits/h	Total per run
A100 40GB	4.13	~50
L4 24GB	0.80	~18
H100 80GB	9.39	~66

13. Troubleshooting

13.1 `ModuleNotFoundError: openenv`

pip install --upgrade 'openenv-core>=0.2.1'

13.2 `pydantic.ValidationError: ServiceName` on a new template

You added a service that's not in the ServiceName Literal. Edit unified_incident_env/models.py.

13.3 GRPO loss diverges after step ~200

Drop learning rate from 5e-6 to 2e-6
Bump KL coefficient (beta) from 0.04 to 0.08
Check that episode_reward() returns -0.05 (not NaN) on parse failures

13.4 Notebook 01 OOM

Drop to Qwen2.5-1.5B by setting BASE_MODEL manually before cell 3 runs. Or reduce LORA_RANK from 64 to 32.

13.5 HF Space build fails with "out of disk"

cpu-basic Spaces have 16GB ephemeral. Prune optional deps from requirements.txt if total wheel size exceeds 4GB.

13.6 `make baseline` reports overall mean > 0.80

The rubric is leaking — the scripted-optimal ceiling must stay in the [0.70, 0.80] band per test_baseline_ceiling_is_hardened_below_080. Check what changed in unified_incident_env/server/grader.py. Re-run make test and look for the failing assertion.

13.7 `docker compose -f sre_gym/max/compose/ecommerce.yaml up` fails with HTTP 404

Expected. The ghcr.io/sre-gym/* images are not published — see the DESIGN-SPEC HEADER at the top of the compose file. The runnable Max-tier surface is the Python state-machine simulator (python -m sre_gym.max run ...), not the compose stack.

13.8 Advanced YAML references actions that don't exist in the env

Expected. The Advanced YAMLs declare a wider 28-action universe as design spec; the runner only implements the Basic 11 actions. See the DESIGN-SPEC HEADER at the top of each Advanced YAML for the exhaustive list of implemented vs. design-only actions.

13.9 `git push` returns 403

You're authenticated as a user without collaborator access on dakshdoesdev/sre-enginnerllm. Ask the maintainer to add you, or gh auth login as the maintainer.

14. Realistic submission-day checklist

The previous version of this checklist had aspirational items that didn't get done in time. This version reflects what actually needs to happen.

T-48h (do these before the submission window closes):

Execute notebooks/01_basic_train_grpo_unsloth.ipynb in Colab (A100)
Push the resulting LoRA adapter to a public HF Hub repo
Execute notebooks/02_basic_eval_comparison.ipynb and commit eval/results/comparison_*.csv + *.png
Update README §"Frontier baselines" with the trained-model row including n=36 measured episodes
Collect ≥ 50 trajectories for each of the 6 round-2 templates and update seed_combined.jsonl
Verify make test (203 green) and openenv validate .
Verify HF Space /health returns 200 from a fresh browser

T-24h:

All README links reachable (HF Space, GitHub repo, docs files, notebook files)
eval/results/comparison_table.csv row counts match the README's frontier-baselines table
Mini-blog or YouTube video link added to the materials section (§16)
git status clean; git log shows the submission commit at HEAD
HF Space has a recent successful build (no auto-rollback)

Submission window:

Submit form references the GitHub repo URL + HF Space URL
git tag submission-2026-04-26 && git push --tags

Items intentionally NOT in this checklist (because they're stretch goals or don't fit submission window):

Publishing the ghcr.io/sre-gym/* Max stub images
Actually-trained Advanced/Max specialists (these are 1–2 A100-day and multi-week investments, respectively, far beyond the hackathon window)

15. Operator FAQ

Q: Why doesn't the README claim a trained-model row in the baselines? A: Because no trained run has happened in this repo. Earlier drafts had a "target ≥ 0.80" placeholder row; that was misleading. We removed it. When notebook 01 runs and notebook 02 produces a real row, it lands in the table with n=36 measured episodes.

Q: Why does Random outperform the deterministic Heuristic? A: Because the heuristic commits to a fixed (often wrong) sequence while Random sometimes stumbles into a useful evidence-gathering path that earns shaped per-tick reward. The fix is in the heuristic, not the env.

Q: Why do all 12 Max chaos deploy_markers carry the same date (2026.04.25)? A: They're synthetic markers used by the Python simulator's query_deploys(). The chaos library cites real-world incident patterns (Stripe webhook signature drift, Cloudflare config rollouts, etc.) but the deploy markers themselves are placeholders, not real-incident citations. Reading them as historical record is a mistake we owe better signaling on.

Q: The Advanced runner maps Supabase-RLS phases to Stripe webhook + Postgres lock + worker deploy templates. Why? A: Because we don't have a Supabase-RLS Basic template. The phase mapping in PHASE_TO_BASIC_TEMPLATE is a narrative wrapper around whichever Basic templates exercise the closest-shaped failure. It's a known approximation. A proper fix is either (a) authoring an rls_silent_leak Basic template, or (b) implementing the wider Advanced action universe so the Supabase scenario can run on its own simulator.

Q: Why are 6 of 12 Basic templates lacking teacher trajectories? A: Round-2 templates were authored without an accompanying SFT collection pass. §8 above is the runbook to fix it; ~$10–15 of API spend covers all six.

Q: Should I use the docker-compose stack under sre_gym/max/families/? A: Not yet — the stub images aren't published. The runnable Max surface is the Python state-machine simulator. The compose file documents the topology shape an operator could provision after publishing the images.

16. Materials linked from the README

Material	Status	Link
Source repo	✅ live	https://github.com/dakshdoesdev/sre-enginnerllm
HF Space	✅ live	https://huggingface.co/spaces/Madhav189/sre-env
Trained adapter on HF Hub	🔴 not yet	`dakshdoesdev/sre-gym-qwen25-3b-grpo` (target)
Mini-blog / YouTube video	🔴 not yet	TBD
Comparison hero plot	🔴 not yet	`eval/results/comparison_hero.png` (target)
Architecture deep dive	✅ shipped	`docs/ARCHITECTURE.md`
Per-tier docs	✅ shipped	`docs/BASIC_TIER.md` / `docs/ADVANCED_TIER.md` / `docs/MAX_TIER.md`
Reward design	✅ shipped	`docs/REWARD_DESIGN.md`
Scenario authoring	✅ shipped	`docs/SCENARIO_AUTHORING.md`
References + postmortems	✅ shipped	`docs/REFERENCES.md`

When the trained adapter, video, and hero plot land they'll be linked from this section and from the README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution runbook — sre-gym v3.1

Current state — what's done, what isn't (read this first)

Table of contents

1. Prerequisites + system requirements

2. Local setup (5 minutes)

3. First-run smoke test (5 minutes)

4. Tier-aware operation

5. Scenario authoring quickstart

5.1 Add a 13th Basic template

5.2 Add an Advanced reference scenario

5.3 Add a Max scenario family

6. Training pipeline (Basic) — to be executed externally

6.1 In Colab (recommended, A100)

6.2 Stages (with measured budgets)

7. Eval comparison sweep

8. Teacher trajectory collection (the gap that needs filling)

9. HF Space deployment

10. Async GRPO via OpenClaw-RL

11. Claude Code skill setup

12. Performance + cost reference

Wall-clock (Basic-tier full pipeline)

HF credit consumption

13. Troubleshooting

13.1 `ModuleNotFoundError: openenv`

13.2 `pydantic.ValidationError: ServiceName` on a new template

13.3 GRPO loss diverges after step ~200

13.4 Notebook 01 OOM

13.5 HF Space build fails with "out of disk"

13.6 `make baseline` reports overall mean > 0.80

13.7 `docker compose -f sre_gym/max/compose/ecommerce.yaml up` fails with HTTP 404

13.8 Advanced YAML references actions that don't exist in the env

13.9 `git push` returns 403

14. Realistic submission-day checklist

15. Operator FAQ

16. Materials linked from the README

FilesExpand file tree

execution.md

Latest commit

History

execution.md

File metadata and controls

Execution runbook — sre-gym v3.1

Current state — what's done, what isn't (read this first)

Table of contents

1. Prerequisites + system requirements

2. Local setup (5 minutes)

3. First-run smoke test (5 minutes)

4. Tier-aware operation

5. Scenario authoring quickstart

5.1 Add a 13th Basic template

5.2 Add an Advanced reference scenario

5.3 Add a Max scenario family

6. Training pipeline (Basic) — to be executed externally

6.1 In Colab (recommended, A100)

6.2 Stages (with measured budgets)

7. Eval comparison sweep

8. Teacher trajectory collection (the gap that needs filling)

9. HF Space deployment

10. Async GRPO via OpenClaw-RL

11. Claude Code skill setup

12. Performance + cost reference

Wall-clock (Basic-tier full pipeline)

HF credit consumption

13. Troubleshooting

13.1 ModuleNotFoundError: openenv

13.2 pydantic.ValidationError: ServiceName on a new template

13.3 GRPO loss diverges after step ~200

13.4 Notebook 01 OOM

13.5 HF Space build fails with "out of disk"

13.6 make baseline reports overall mean > 0.80

13.7 docker compose -f sre_gym/max/compose/ecommerce.yaml up fails with HTTP 404

13.8 Advanced YAML references actions that don't exist in the env

13.9 git push returns 403

14. Realistic submission-day checklist

15. Operator FAQ

16. Materials linked from the README

13.1 `ModuleNotFoundError: openenv`

13.2 `pydantic.ValidationError: ServiceName` on a new template

13.6 `make baseline` reports overall mean > 0.80

13.7 `docker compose -f sre_gym/max/compose/ecommerce.yaml up` fails with HTTP 404

13.9 `git push` returns 403