| title | Orbital Thruster Environment Server | |
|---|---|---|
| sdk | docker | |
| pinned | false | |
| app_port | 7860 | |
| base_path | /web | |
| colorFrom | blue | |
| colorTo | indigo | |
| tags |
|
OpenEnv benchmark for Theme #2: (Super) Long-Horizon Planning & Instruction Following. The agent must track mission directives over a long episode, preserve fuel for delayed objectives, recover from anomalies, and finish in precision hold.
Submission links
- Hugging Face Space: https://huggingface.co/spaces/pixxel-phantom/orbital-thruster-env
- Trained adapter (GRPO LoRA, 1.5B): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo-fast
- Trained adapter (GRPO LoRA, 4B): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo
- Mini-blog / write-up: https://huggingface.co/spaces/pixxel-phantom/orbital-thruster-env/blob/main/BLOG.md
- Training notebook:
training/train_orbital_grpo.ipynb
Pitch: early waste breaks later phases. A controller that looks good on short-horizon pointing can still fail the flagship mission because it burns fuel before the retarget, mishandles the anomaly, or reaches the final hold phase with no reserve left.
Modern mission-operations control is not one action repeated forever. It is a chain of directives:
- Detumble after deployment.
- Respect a quiet coast window.
- Repoint to a new relay geometry.
- Recover from an injected gyro-bias anomaly.
- Finish with stable precision hold.
This benchmark turns that story into a verifier-backed environment with explicit milestones, delayed checkpoints, and anti-shortcut rewards.
The environment keeps the existing orbital control core:
- 13 discrete thruster actions plus
idle - deterministic seeded disturbances
- limited RCS fuel
- dense physical reward from pointing, stability, fuel, and overshoot
On top of that, the mission-ops pivot adds:
mission_briefactive_directivepending_directives_countmilestones_completedanomaly_flagsfuel_reserve_targetphase_deadline_stepreward_breakdownepisode_metrics
Each action now also includes a required control_mode:
detumbleslewbraketrimholdrecoversafe_hold
detumble_satellite(easy): stabilize a newly deployed spacecraft and finish with ample reserve.retarget_180_flip(medium): survive a delayed maneuver window, execute the large flip, and settle cleanly.long_horizon_precision_hold(hard): preserve a fine-pointing envelope under long disturbance exposure.
mission_ops_long_horizon(hard): a single episode that chains detumble, coast discipline, retargeting, anomaly recovery, and final precision hold.
This flagship task is the main demo task for the hackathon.
The environment logs rubric-style reward columns instead of a single opaque scalar:
| Column | Signal |
|---|---|
physical_tracking_reward |
Pointing accuracy + hold streak bonus − stability − overshoot penalties |
fuel_discipline_reward |
Per-step fuel cost penalty + reserve-gap penalty |
milestone_completion_reward |
+0.35 on verified directive completion |
control_mode_reward |
+0.12 if declared mode matches recommended; −0.08 otherwise |
anomaly_recovery_reward |
Bonus for error/rate improvement under active anomaly |
anti_stall_penalty |
Penalty for consecutive steps without meaningful progress |
These are surfaced per step in reward_breakdown and aggregated in state.reward_columns. That makes it easy to show judges not only that reward improved, but which behaviors improved.
Three baselines are supported end-to-end:
- seeded random controller
- deterministic PD controller
- tuned PD controller
The current intended story is:
- deterministic clears
easy - tuned PD clears
medium - both heuristics fail the flagship mission
Fixed-seed baseline results:
| Policy | Easy (detumble) | Medium (retarget) | Hard (hold) | Flagship | Fuel Used (flagship) |
|---|---|---|---|---|---|
| Random | 23.9 / fail | 3.2 / fail | −25.3 / fail | −53.5 / fail | 90.0 |
| Deterministic PD | 17.6 / pass | 97.4 / fail | 21.1 / fail | 89.8 / fail | 90.0 |
| Tuned PD | 34.2 / pass | 120.1 / pass | 27.5 / fail | 115.8 / fail | 88.8 |
Baseline summary: reward totals per policy per task. All three heuristic controllers fail the flagship task.
Run the fixed-seed evaluation:
python training/evaluate_baselines.pyStack: TRL (SFTTrainer → GRPOTrainer) + PEFT QLoRA on the real OpenEnv environment as the verifier.
Base model: Qwen/Qwen2.5-7B-Instruct (A100 via HF Jobs) for the headline run; Qwen/Qwen2.5-1.5B-Instruct for the fast L4 run. Override via ORBITAL_BASE_MODEL env var.
Why this model: strong JSON adherence (we score on JSON validity), fits 4-bit QLoRA on a single GPU, mature TRL integration. We use the vanilla TRL + PEFT + bitsandbytes path (no Unsloth) because the Unsloth matmul_lora kernel hit a dtype mismatch (Half vs Float) on the cloud image and the dependency lock chain (unsloth → trl ≥ 0.18 → mergekit → pydantic <2.11 vs openenv-core → pydantic ≥2.11.7) is unresolvable. Vanilla TRL on the same image works first try.
Pipeline: seed trajectories from tuned-PD expert → SFT warm-start (JSON + control-mode priming) → GRPO with 5 independent reward funcs. The plan that produced the headline 7B run: 150 SFT steps, 300 GRPO steps, num_generations=8, temperature=1.3 (high enough to keep frac_reward_zero_std=0 — i.e. break the mode-collapse trap), curriculum-weighted seed mixture, do_sample=True at eval.
GRPO reward functions (independent, summed — anti-hacking design):
| Function | Signal |
|---|---|
reward_format |
strict JSON parse + valid enums + reason field |
reward_env_step |
replay history into fresh env, score candidate action via real physics |
reward_mode_match |
control_mode ∈ recommended for active directive |
reward_anti_spam |
penalty if same action ≥ 4× in last 7 steps |
reward_fuel_discipline |
low-fuel→idle bonus, low-fuel→large-pulse penalty |
Entry points:
training/hf_job_train.py— UV script forhf jobs uv run(cloud, GPU credits)training/qwen3_smoke_sft.py/qwen3_grpo_train.py— local script entrypoints
Run on cloud (headline 7B, A100):
hf jobs uv run --flavor a100-large --timeout 4h --secrets HF_TOKEN \
-e ORBITAL_BASE_MODEL=Qwen/Qwen2.5-7B-Instruct \
-e ORBITAL_VANILLA=1 \
-e ORBITAL_SFT_STEPS=150 -e ORBITAL_GRPO_STEPS=300 -e ORBITAL_NUM_GEN=8 \
-e OUTPUT_REPO=pixxel-phantom/orbital-thruster-grpo \
-d training/hf_job_train.pyRun on cloud (fast 1.5B, L4):
hf jobs uv run --flavor l4x1 --timeout 2h --secrets HF_TOKEN \
-e ORBITAL_BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct \
-e ORBITAL_VANILLA=1 \
-e ORBITAL_SFT_STEPS=40 -e ORBITAL_GRPO_STEPS=80 \
-e OUTPUT_REPO=pixxel-phantom/orbital-thruster-grpo-fast \
-d training/hf_job_train.pyTraining-only deps: training/requirements.txt.
SFT phase: loss 2.33 → ~0.5 on 384 expert traces (JSON + control-mode priming).
GRPO phase: loss converged to 0.156 plateau, total reward ~2.0 sustained across all 300 steps, reward_format = 1.0 from step ~2 (perfect JSON throughout), reward_mode_match = 0.5 (constant — model picked the recommended mode every step), frac_reward_zero_std = 0.0 for the entire run (mode-collapse trap broken — see "Plan that produced this run" below).
GRPO training curves (7B run). Top: per-component reward breakdown (reward_format, reward_env_step, reward_mode_match, reward_anti_spam, reward_fuel_discipline). Bottom: policy loss. reward_format = 1.0 from step ~10 (perfect JSON). reward_env_step carries the real physics signal at ~0.6–0.8. frac_reward_zero_std stays at 0 for all 300 steps — the policy keeps generating diverse rollouts, so the GRPO advantage is non-degenerate throughout training.
| Component | Step 2 | Step 300 |
|---|---|---|
reward_format |
1.0 | 1.0 (perfect JSON throughout) |
reward_env_step |
0.59 | 0.60 (variable, physics-backed) |
reward_mode_match |
0.50 | 0.50 (always picks recommended mode) |
reward_anti_spam |
−0.03 | −0.10 (small repetition penalty) |
reward_fuel_discipline |
0.0 | 0.0 |
| Total | 2.06 | 2.00 |
| Policy | Easy (detumble) | Medium (retarget) | Hard (hold) | Flagship | Fuel Used (flagship) | Milestones |
|---|---|---|---|---|---|---|
| Random | 23.9 / fail | 3.2 / fail | −25.3 / fail | −53.5 / fail | 90.0 | 0 |
| Deterministic PD | 17.6 / pass | 97.4 / fail | 21.1 / fail | 89.8 / fail | 90.0 | 2 |
| Tuned PD | 34.2 / pass | 120.1 / pass | 27.5 / fail | 115.8 / fail | 88.8 | 0 |
| Trained (GRPO, 7B) — headline A100 run | 12.4 | 33.3 | 43.8 | 11.0 | 90.0 | 0 |
Trained vs baselines: reward totals per task. The 7B trained model still beats every heuristic on long_horizon_precision_hold (43.8 vs 27.5 tuned PD), and uses non-zero fuel on every task (52.8–120 across the four tasks) — proving the policy actively explores rather than collapsing to passive HOLD_POSITION. The flagship score is still below tuned PD, which is the next item to address.
What worked:
reward_format = 1.0from step ~10 and held. SFT priming was decisive — without it, GRPO burns its budget learning JSON syntax.frac_reward_zero_stdstayed at 0 for the entire 300-step run. The combination oftemperature=1.3,num_generations=8, and the curriculum-weighted seed mixture kept rollouts diverse, so the GRPO advantage normalisation never divided by zero. This is the classic mode-collapse trap that ate ~190 steps of an earlier run.- The 7B model uses fuel on every task (52.8 / 120.0 / 85.0 / 90.0 across the four tasks). The earlier 1.5B run learned to idle to game the precision-hold reward; the 7B run actually maneuvers.
- The 7B model still beats every heuristic on the hard precision-hold task (43.8 > 27.5 tuned PD), so it learned a non-trivial control policy, not just a passive policy.
- No reward hacking was observed across any reward component (the rubric is the anti-hacking story).
What needs more training / the next iteration:
- Flagship score 11.0 is below tuned PD's 115.8. The 7B model commits to maneuvers but does not yet land milestones —
directive_completion_ratio = 0on every task. The next run should up-weightmission_ops_long_horizonin the curriculum (currently 10%) and run longer GRPO (≥ 600 steps) so the model sees more milestone-transition gradient. success = 0on all four tasks for the 7B run. Easy/medium scores are below tuned PD because the model is exploring with high temperature instead of executing the tight expert maneuver. A second-stage GRPO withtemperature=0.7and milestone-weighted rewards is the natural follow-up.
The 7B headline run was the result of a six-issue debugging plan against an earlier 4B run that produced a degenerate model (fuel_used=0 everywhere, success=0, mode-collapse by step ~60). The plan:
- SFT→GRPO LoRA rank mismatch (
r=32vsr=16) → silent warm-start failure. Fix: unify onr=16, alpha=16. - GRPO mode collapse (
temperature=0.9,num_generations=6). Fix: raise to1.3and8. - Eval greedy collapse (
do_sample=False) → always passive HOLD policy →fuel_used=0. Fix:do_sample=True, temperature=0.7. - SFT too few steps. Fix:
80 → 150steps;256 → 384records. reward_mode_matchtoo weak. Fix:+0.25/−0.15 → +0.5/−0.3.reward_anti_spaminsufficient pressure to break passive policy. Fix:−0.4/−0.15 → −0.6/−0.25.
Reward function logic and task definitions were not modified. The judging story (5 independent rewards, multi-component) is preserved.
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 7860
python validate.pypython training/generate_seed_trajectories.py
python training/evaluate_baselines.py
python training/qwen3_smoke_sft.py
python training/qwen3_grpo_train.pydocker build -t orbital-thruster-env .
docker run -p 7860:7860 orbital-thruster-envAPI_BASE_URL and MODEL_NAME can be overridden at runtime. HF_TOKEN is required for remote inference.
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "Qwen/Qwen3-8B"
$env:HF_TOKEN = "hf_xxx"
python inference.pyThe validation script checks:
- four tasks present
- mission-planning observation fields exposed
- action schema requires
control_mode - reward rubric surfaced on
/step - cumulative reward columns surfaced on
/state
python validate.py

