| title | Disaster Triage Environment | |||||
|---|---|---|---|---|---|---|
| emoji | ๐ | |||||
| colorFrom | red | |||||
| colorTo | gray | |||||
| sdk | docker | |||||
| app_port | 7860 | |||||
| tags |
|
|||||
| short_description | Strategic triage benchmark for resource-constrained agents. |
๐ Live Demo:
An OpenEnv-compliant reinforcement learning benchmark for evaluating agentic decision-making in resource-constrained disaster logistics.
In the immediate aftermath of a disaster, responders are faced with a chaotic "Fog of War." Resources are finite, timelines are compressed, and information is often missing or noisy. This is not just a pattern recognition taskโit is high-stakes decision-making under severe constraints.
Existing RL benchmarks (like Atari or MuJoCo) focus on motor control or games. Our environment fills the gap by modeling Operational Triage, where the cost of a wrong move is measured in system-wide failure and misallocated survival resources.
๐ โThis is not just detection โ it is decision-making under constraints.โ
This environment models real-world disaster response logistics where:
- Information is incomplete (communication failure)
- Resources are finite (supply chain constraints)
- Decisions are irreversible (allocation cannot be undone)
- Time directly impacts outcomes (delayed response reduces effectiveness)
Unlike traditional RL benchmarks, this environment evaluates whether an agent can act as a decision-maker in high-stakes operational settings such as:
- Emergency response coordination
- Supply chain disruption management
- Crisis logistics planning
This makes it a benchmark for Agentic AI, not just predictive models.
This environment introduces a structured benchmark for:
Resource-Constrained Decision Making under Partial Observability
Unlike existing RL benchmarks:
- Not a game
- Not static reasoning
- Not single-step evaluation
It combines:
- POMDP dynamics
- Multi-objective optimization
- Irreversible actions
- Time-constrained planning
This makes it a test of true agentic behavior rather than pattern matching.
- Determinism: All rewards and transitions are 100% reproducible given the same seed.
- Bounded Scoring: Final rewards are strictly in (0.01, 0.99) to comply with OpenEnv validation and avoid degenerate extremes.
- Efficiency Pressure: Limited action horizons (7/10/13 steps) enforce meaningful, high-impact decisions.
- No Reward Hacking: Over-allocation, "action spamming," and random guessing are heavily penalized through the prioritization and utilization axes.
- Exploration Cost: Information gathering (
request_info) has explicit trade-offs and temporal costs.
The environment is modeled as a POMDP (Partially Observable Markov Decision Process) simulating a lead disaster architect coordinating multiple zones.
- Zones: Multiple crisis areas with unique, hidden severity and resource demands.
- Resources: Fixed global stockpiles of Food, Water, and Medicine.
- Metrics: Agents must balance noise-filtered signals (
urgency_signal) against the cost of ground-truth reconnaissance.
Ground-truth severity and demand are hidden and must be actively revealed via request_info, forcing informed exploration.
A shared, finite pool of Food, Water, and Medicine must be distributed across all zones, introducing real trade-offs.
Each action consumes a step and reduces achievable reward, simulating time-sensitive disaster response.
Agents must simultaneously optimize prioritization, efficiency, and resource utilization, not just maximize a single metric.
All transitions and rewards are fully reproducible, ensuring fair and consistent benchmarking across agents.
Zones that are revealed but not acted upon within a limited number of steps experience an increase in effective severity.
- Commitment Pressure: Encourages immediate action after information gathering.
- Dynamic POMDP: Transforms the environment from a static allocation problem into an evolving crisis system.
| Difficulty | Description | Step Budget | Observability | Core Challenge |
|---|---|---|---|---|
| Easy | 3 Zones with fully visible data | 7 | Full | Allocation correctness |
| Medium | 5 Zones with partial hidden information | 10 | ~50% Revealed | Explore vs Exploit |
| Hard | 7 Zones with high uncertainty and tight budget | 13 | Fully Hidden | Prioritization under constraints |
๐ Demo Configuration Note: The step budgets above (7/10/13) are optimized for rapid evaluation during this submission. The full environment is architected to support 30 / 40 / 70 step budgets for Easy / Medium / Hard respectively, enabling deeper multi-phase strategies, richer explore-exploit tradeoffs, and more realistic crisis timelines.
| Action Type | Parameters | Description | Trade-off | Conditions | Step Reward |
|---|---|---|---|---|---|
request_info |
zone_id |
Reveals true severity and demand for a zone | Costs 1 step but reduces uncertainty | Available for unrevealed zones | +0.02 |
allocate_resource |
zone_id, resource_type, amount |
Allocates a specific quantity of resource to a zone | Irreversible; bounded by global supply | Must have sufficient resources; amount > 0 | +0.05 (if severity โฅ 4) |
finalize |
None | Terminates the episode and triggers final grading | Premature finalization reduces achievable score | Can be called at any time | -0.005 per step taken |
At each step, the agent receives:
zones: Array of zone objects containing:urgency_signal(noisy indicator)revealed(boolean)known_severity(if revealed)known_demand(if revealed)
available_resources: Remaining global stockpile of Food, Water, Medicinestep_count/max_steps: Tracks remaining decision budgetdata_completeness: Fraction of zones with revealed ground-truth data
This environment utilizes a deterministic, multi-axis evaluation system designed to strictly penalize pattern matching and reward strategic, resource-aware decision-making.
| Component | Type | Purpose |
|---|---|---|
| Step rewards | Dense | Provides intermediate feedback on information gain and allocation quality. |
| Action Costs | Sparse | Implicitly penalizes dithering and inefficient pathing via the step budget. |
| Final Reward | Terminal | The "Ground Truth" score produced by the 3-axis deterministic grader. |
| Action | Reward Range | Strategic Purpose |
|---|---|---|
request_info |
+0.02 |
Rewards active exploration and uncertainty reduction. |
allocate_resource |
+0.05 |
Rewards high-impact allocation (Severity โฅ 4). |
step_increment |
-0.005 |
Implicit penalty for every action taken, enforcing efficiency. |
Tip
Reward Engineering: These signals are balanced to ensure that "blindly" allocating (without info) or "spamming" info requests result in a lower aggregate score than focused, informed execution.
The terminal evaluation is computed using the following weighted objective function:
- Safety Guard: Final scores are strictly bounded to
[0.01, 0.99]for OpenEnv compliance. - Reproducibility: Evaluation is stateless and purely deterministic (Fixed Input โ Fixed Output).
| Metric | Weight | Description | Behavior Encouraged |
|---|---|---|---|
| Prioritization | 35% | Evaluates if critical zones (Severity 4-5) were served first. | Severity-Aware Triage |
| Efficiency | 40% | Measures demand satisfaction accuracy per resource type. | High-Precision Logistics |
| Utilization | 25% | Penalizes resource waste and excessive over-allocation. | Stewardship under Scarcity |
To simulate high-stakes triage, zone importance is scaled exponentially:
Effect: A Severity 5 zone is 16x more important than a Severity 1 zone, forcing the agent to ignore low-priority background noise.
Measures the ratio of useful allocation vs. the total environmental demand:
Stewardship is measured by minimizing useless "Waste" (allocation exceeding demand):
- Un-gameable: Final evaluation uses true "Hidden" values that the agent never sees directly.
- Randomness Recovery: Blind guessing or random allocation distributions yield scores below
0.15. - Sustainability: Heavy over-allocation (e.g., dumping all food on one zone) causes the Utilization score to collapse, capping the total reward.
- Dominance: High-severity zones dominate the weighted average; ignoring a Severity 5 zone makes a score > 0.6 impossible.
Traditional RL benchmarks (Atari, Gym) typically optimize for a single scalar reward. The Disaster Triage environment introduces Resource-Constrained Optimization through three competing metrics:
- Multi-Objective Optimization: Agents must balance meeting demand (Efficiency) against not overspending (Stewardship).
- Strategic Reasoning: Because actions are irreversible, the agent must commit to a plan based on its revealed knowledge.
- Real-World Modeling: By using severity-weighted scoring, we simulate the ethical and operational realities of a Disaster Response Coordinator.
- Request info for unrevealed zones with highest urgency
- Prioritize by
urgency_signal(noisy proxy for severity) - Skip this phase in Easy mode
- Sort revealed zones by severity (descending)
- Allocate resources in priority order: medicine > water > food
- For each zone:
amount = min(demand - already_allocated, available) - Continue until resources exhausted or target steps reached
- Call
finalize()at target step count - Triggers grader computation
- Returns final episode reward
[START] task=medium env=disaster-triage-env model=llama-3.1-8b-instant
[STEP] step=1 action={"action_type":"request_info","zone_id":"Z2"} reward=0.0200 done=false error=null
[STEP] step=2 action={"action_type":"allocate_resource","zone_id":"Z2","resource_type":"food","amount":35.0} reward=0.0500 done=false error=null
[END] success=true steps=10 score=0.684 rewards=0.02,0.05,0.45...
# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure variables (see .env.example)
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_token_here"
# 3. Start Environment Server
python server/app.py
# 4. Run Baseline Inference
python inference.py --task mediumdocker build -t disaster-triage .
docker run -p 7860:7860 --env-file .env disaster-triagePOST /reset: Initialize or restart a session (returns first observation).POST /step: Execute an agent action.GET /state: Global ground-truth view (for debugging/diagnostic only).GET /health: Server liveness and active session summary.
- Runtime: Full evaluation (Easy, Medium, Hard) runs in < 20 minutes.
- Hardware: Designed to run on standard 2 vCPU / 8GB RAM instances.
- Beyond Toy Problems: This environment forces agents to contend with genuine operational fog, simulating logistics rather than arcade physics.
- Agentic Stress Test: The tight step budgets and exploration costs separate simple chatbots from capable action-oriented agents.
- Logistics Modeling: Every detail, from the resource types to the noise in the urgency signal, is designed to mirror real-world supply chain optimization during a crisis.
Disaster Triage Environment โ Moving Intelligence from Chat to Execution.