Skip to content

Praptii21/disaster-triage-env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

68 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

title Disaster Triage Environment
emoji ๐Ÿš‘
colorFrom red
colorTo gray
sdk docker
app_port 7860
tags
openenv
disaster-triage
agentic-ai
pomdp
rescue-logistics
short_description Strategic triage benchmark for resource-constrained agents.

๐Ÿ”— Live Demo:

Space


๐Ÿšจ Disaster Triage Environment (DTE)

An OpenEnv-compliant reinforcement learning benchmark for evaluating agentic decision-making in resource-constrained disaster logistics.

๐Ÿ”น Problem Statement

In the immediate aftermath of a disaster, responders are faced with a chaotic "Fog of War." Resources are finite, timelines are compressed, and information is often missing or noisy. This is not just a pattern recognition taskโ€”it is high-stakes decision-making under severe constraints.

Existing RL benchmarks (like Atari or MuJoCo) focus on motor control or games. Our environment fills the gap by modeling Operational Triage, where the cost of a wrong move is measured in system-wide failure and misallocated survival resources.

๐Ÿ‘‰ โ€œThis is not just detection โ€” it is decision-making under constraints.โ€


๐ŸŒ Real-World Utility

This environment models real-world disaster response logistics where:

  • Information is incomplete (communication failure)
  • Resources are finite (supply chain constraints)
  • Decisions are irreversible (allocation cannot be undone)
  • Time directly impacts outcomes (delayed response reduces effectiveness)

Unlike traditional RL benchmarks, this environment evaluates whether an agent can act as a decision-maker in high-stakes operational settings such as:

  • Emergency response coordination
  • Supply chain disruption management
  • Crisis logistics planning

This makes it a benchmark for Agentic AI, not just predictive models.


๐Ÿ’ก Novelty

This environment introduces a structured benchmark for:

Resource-Constrained Decision Making under Partial Observability

Unlike existing RL benchmarks:

  • Not a game
  • Not static reasoning
  • Not single-step evaluation

It combines:

  • POMDP dynamics
  • Multi-objective optimization
  • Irreversible actions
  • Time-constrained planning

This makes it a test of true agentic behavior rather than pattern matching.


โญ Design Principles

  • Determinism: All rewards and transitions are 100% reproducible given the same seed.
  • Bounded Scoring: Final rewards are strictly in (0.01, 0.99) to comply with OpenEnv validation and avoid degenerate extremes.
  • Efficiency Pressure: Limited action horizons (7/10/13 steps) enforce meaningful, high-impact decisions.
  • No Reward Hacking: Over-allocation, "action spamming," and random guessing are heavily penalized through the prioritization and utilization axes.
  • Exploration Cost: Information gathering (request_info) has explicit trade-offs and temporal costs.

๐Ÿ”น Solution Overview

The environment is modeled as a POMDP (Partially Observable Markov Decision Process) simulating a lead disaster architect coordinating multiple zones.

  • Zones: Multiple crisis areas with unique, hidden severity and resource demands.
  • Resources: Fixed global stockpiles of Food, Water, and Medicine.
  • Metrics: Agents must balance noise-filtered signals (urgency_signal) against the cost of ground-truth reconnaissance.

โญ Key Features

๐Ÿ” Partial Observability

Ground-truth severity and demand are hidden and must be actively revealed via request_info, forcing informed exploration.

๐Ÿ“ฆ Global Resource Constraints

A shared, finite pool of Food, Water, and Medicine must be distributed across all zones, introducing real trade-offs.

โฑ๏ธ Temporal Pressure

Each action consumes a step and reduces achievable reward, simulating time-sensitive disaster response.

โš–๏ธ Multi-Objective Optimization

Agents must simultaneously optimize prioritization, efficiency, and resource utilization, not just maximize a single metric.

๐ŸŽฏ Deterministic Evaluation

All transitions and rewards are fully reproducible, ensuring fair and consistent benchmarking across agents.

๐Ÿ”ฅ Dynamic Urgency Escalation (Novel Mechanic)

Zones that are revealed but not acted upon within a limited number of steps experience an increase in effective severity.

  • Commitment Pressure: Encourages immediate action after information gathering.
  • Dynamic POMDP: Transforms the environment from a static allocation problem into an evolving crisis system.

๐ŸŽฏ Tasks & Difficulty Levels

Difficulty Description Step Budget Observability Core Challenge
Easy 3 Zones with fully visible data 7 Full Allocation correctness
Medium 5 Zones with partial hidden information 10 ~50% Revealed Explore vs Exploit
Hard 7 Zones with high uncertainty and tight budget 13 Fully Hidden Prioritization under constraints

๐Ÿ“Œ Demo Configuration Note: The step budgets above (7/10/13) are optimized for rapid evaluation during this submission. The full environment is architected to support 30 / 40 / 70 step budgets for Easy / Medium / Hard respectively, enabling deeper multi-phase strategies, richer explore-exploit tradeoffs, and more realistic crisis timelines.


๐Ÿง  Environment Design

๐ŸŽฎ Action Space

Action Type Parameters Description Trade-off Conditions Step Reward
request_info zone_id Reveals true severity and demand for a zone Costs 1 step but reduces uncertainty Available for unrevealed zones +0.02
allocate_resource zone_id, resource_type, amount Allocates a specific quantity of resource to a zone Irreversible; bounded by global supply Must have sufficient resources; amount > 0 +0.05 (if severity โ‰ฅ 4)
finalize None Terminates the episode and triggers final grading Premature finalization reduces achievable score Can be called at any time -0.005 per step taken

๐Ÿ‘๏ธ Observation Space

At each step, the agent receives:

  • zones: Array of zone objects containing:
    • urgency_signal (noisy indicator)
    • revealed (boolean)
    • known_severity (if revealed)
    • known_demand (if revealed)
  • available_resources: Remaining global stockpile of Food, Water, Medicine
  • step_count / max_steps: Tracks remaining decision budget
  • data_completeness: Fraction of zones with revealed ground-truth data

๐Ÿ’ฐ Reward & Grader System

This environment utilizes a deterministic, multi-axis evaluation system designed to strictly penalize pattern matching and reward strategic, resource-aware decision-making.

1. Reward System Overview

Component Type Purpose
Step rewards Dense Provides intermediate feedback on information gain and allocation quality.
Action Costs Sparse Implicitly penalizes dithering and inefficient pathing via the step budget.
Final Reward Terminal The "Ground Truth" score produced by the 3-axis deterministic grader.

2. Step-Level Reward Matrix

Action Reward Range Strategic Purpose
request_info +0.02 Rewards active exploration and uncertainty reduction.
allocate_resource +0.05 Rewards high-impact allocation (Severity โ‰ฅ 4).
step_increment -0.005 Implicit penalty for every action taken, enforcing efficiency.

Tip

Reward Engineering: These signals are balanced to ensure that "blindly" allocating (without info) or "spamming" info requests result in a lower aggregate score than focused, informed execution.

3. Final Grader Formula

The terminal evaluation is computed using the following weighted objective function:

$$Final Score = (0.35 ร— Prioritization) + (0.40 ร— Efficiency) + (0.25 ร— Utilization)$$
  • Safety Guard: Final scores are strictly bounded to [0.01, 0.99] for OpenEnv compliance.
  • Reproducibility: Evaluation is stateless and purely deterministic (Fixed Input โ†’ Fixed Output).

4. Grader Components (The 3 Axes)

Metric Weight Description Behavior Encouraged
Prioritization 35% Evaluates if critical zones (Severity 4-5) were served first. Severity-Aware Triage
Efficiency 40% Measures demand satisfaction accuracy per resource type. High-Precision Logistics
Utilization 25% Penalizes resource waste and excessive over-allocation. Stewardship under Scarcity

5. Mathematical Definitions

A. Severity-Weighted Prioritization

To simulate high-stakes triage, zone importance is scaled exponentially:

$$Zone Weight = 2 ^ {true\_severity}$$

Effect: A Severity 5 zone is 16x more important than a Severity 1 zone, forcing the agent to ignore low-priority background noise.

B. Allocation Efficiency

Measures the ratio of useful allocation vs. the total environmental demand:

$$Efficiency = \frac{\sum \min(allocated, demand)}{\sum demand}$$

C. Resource Utilization (Waste Management)

Stewardship is measured by minimizing useless "Waste" (allocation exceeding demand):

$$Waste = \sum \max(allocated - demand, 0)$$ $$Utilization = 1 - \left( \frac{Waste}{Total Allocated} \right)$$

6. Key Properties of the Grader

  • Un-gameable: Final evaluation uses true "Hidden" values that the agent never sees directly.
  • Randomness Recovery: Blind guessing or random allocation distributions yield scores below 0.15.
  • Sustainability: Heavy over-allocation (e.g., dumping all food on one zone) causes the Utilization score to collapse, capping the total reward.
  • Dominance: High-severity zones dominate the weighted average; ignoring a Severity 5 zone makes a score > 0.6 impossible.

๐Ÿš€ Why This Grader is Different

Traditional RL benchmarks (Atari, Gym) typically optimize for a single scalar reward. The Disaster Triage environment introduces Resource-Constrained Optimization through three competing metrics:

  1. Multi-Objective Optimization: Agents must balance meeting demand (Efficiency) against not overspending (Stewardship).
  2. Strategic Reasoning: Because actions are irreversible, the agent must commit to a plan based on its revealed knowledge.
  3. Real-World Modeling: By using severity-weighted scoring, we simulate the ethical and operational realities of a Disaster Response Coordinator.

๐Ÿค– Baseline Agent Execution Phases

Phase 1: Information Gathering

  • Request info for unrevealed zones with highest urgency
  • Prioritize by urgency_signal (noisy proxy for severity)
  • Skip this phase in Easy mode

Phase 2: Allocation

  • Sort revealed zones by severity (descending)
  • Allocate resources in priority order: medicine > water > food
  • For each zone: amount = min(demand - already_allocated, available)
  • Continue until resources exhausted or target steps reached

Phase 3: Finalization

  • Call finalize() at target step count
  • Triggers grader computation
  • Returns final episode reward

๐Ÿ“Š Sample Output

[START] task=medium env=disaster-triage-env model=llama-3.1-8b-instant
[STEP] step=1 action={"action_type":"request_info","zone_id":"Z2"} reward=0.0200 done=false error=null
[STEP] step=2 action={"action_type":"allocate_resource","zone_id":"Z2","resource_type":"food","amount":35.0} reward=0.0500 done=false error=null
[END] success=true steps=10 score=0.684 rewards=0.02,0.05,0.45...

๐Ÿ”น Setup & Usage

๐Ÿš€ Local Setup

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure variables (see .env.example)
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_token_here"

# 3. Start Environment Server
python server/app.py

# 4. Run Baseline Inference
python inference.py --task medium

๐Ÿณ Docker Instructions

docker build -t disaster-triage .
docker run -p 7860:7860 --env-file .env disaster-triage

๐Ÿ”น API Endpoints

  • POST /reset: Initialize or restart a session (returns first observation).
  • POST /step: Execute an agent action.
  • GET /state: Global ground-truth view (for debugging/diagnostic only).
  • GET /health: Server liveness and active session summary.

๐Ÿ”น Performance Constraints

  • Runtime: Full evaluation (Easy, Medium, Hard) runs in < 20 minutes.
  • Hardware: Designed to run on standard 2 vCPU / 8GB RAM instances.

๐Ÿ”น Creativity & Uniqueness

  • Beyond Toy Problems: This environment forces agents to contend with genuine operational fog, simulating logistics rather than arcade physics.
  • Agentic Stress Test: The tight step budgets and exploration costs separate simple chatbots from capable action-oriented agents.
  • Logistics Modeling: Every detail, from the resource types to the noise in the urgency signal, is designed to mirror real-world supply chain optimization during a crisis.

Disaster Triage Environment โ€” Moving Intelligence from Chat to Execution.

About

Disaster Triage Environment Deterministic OpenEnv benchmark for evaluating AI agents on resource-constrained decision-making under uncertainty in disaster response scenarios.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors