| title | HelixDesk OpenEnv | ||||
|---|---|---|---|---|---|
| emoji | 📧 | ||||
| colorFrom | blue | ||||
| colorTo | indigo | ||||
| sdk | docker | ||||
| pinned | false | ||||
| tags |
|
HelixDesk OpenEnv is a complete, real-world Gymnasium-style reinforcement learning environment where an AI agent named HelixDesk learns to manage customer email queues by interacting with a realistic simulation of a company's complaint management system. It is fully compatible with standard RL libraries including Stable-Baselines3, RLlib, and CleanRL.
graph TD
A[Incoming Emails] --> B[HelixDesk AI Agent]
B -->|Action: Classify| C[Query / Complaint / Flag]
B -->|Action: Priority| D[Critical - Normal]
B -->|Action: Assign| E[Employee 1-5]
B -->|Action: Secondary| F[KB Auto-reply / GM Alert]
C --> G((Environment State))
D --> G
E --> G
F --> G
G -->|Reward| B
style B fill:#4f46e5,stroke:#312e81,stroke-width:2px,color:#fff
style G fill:#0891b2,stroke:#164e63,stroke-width:2px,color:#fff
State: A 42-dimensional observation vector encoding the current email's features (sentiment, category, customer tier, keyword flags), the support queue state (priority counts, overdue tickets), team workload (5 employees' loads and resolve times), SLA pressure, complaint volume trends, simulated time, and episode progress.
Action: A 4-part decision for each incoming email—classification (query / complaint / flag for review), priority assignment (critical / high / medium / normal), employee assignment (5 employees or none), and a secondary action (auto-reply from KB / alert GM / none).
Reward: A composite signal from 12 distinct components: correct classification (+0.5), timely resolution (+1.0), high CSAT (+0.8), trend prevention (+0.6), workload balance (+0.4), KB updates (+0.3), and penalties for missed deadlines (−1.0), bad auto-replies (−0.8), unnecessary escalations (−0.6), misclassification (−0.5), reopened complaints (−0.4), and missed keyword flags (−0.3). Total reward is clipped to [−1.0, +1.0] per step.
cd helixdesk-openenv
pip install -r requirements.txtpython train.py --agent rule --episodes 100python train.py --agent random --episodes 100pip install stable-baselines3
python train.py --agent sb3 --episodes 500python evaluate.py --agent rule --episodes 100HelixDeskEnv passes gymnasium.utils.env_checker.check_env() with 0 errors:
from gymnasium.utils.env_checker import check_env
from helixdesk import HelixDeskEnv
env = HelixDeskEnv()
check_env(env) # ✓ passesCompatible with any Gymnasium-based training library:
- Stable-Baselines3:
PPO("MlpPolicy", HelixDeskEnv()) - CleanRL: use the env like any standard gymnasium env
- RLlib: register with
gymnasium.register()
| Group | Dims | Description |
|---|---|---|
| Current Email | 0–9 | Sentiment, keyword flag, customer tier (3-hot), category (5-slot overflow encoding) |
| Queue State | 10–14 | Normalized counts: critical, high, medium, normal, pending review |
| Team State | 15–24 | 5 employees × (load_norm, avg_resolve_norm) |
| SLA State | 25–28 | Overdue norm, near-deadline norm, SLA pressure, critical overdue flag |
| Trend State | 29–36 | 8 categories × growth rate fraction [−1, 1] |
| Time State | 37–38 | Hour of day / 24, day of week / 7 |
| Episode Progress | 39–41 | Steps remaining norm, episode reward norm, agent confidence |
All values normalized to [-1.0, 1.0]. Observation space: Box(low=-1, high=1, shape=(42,), dtype=float32).
| Dim | Choices | Description |
|---|---|---|
| 0: Classification | 0=query, 1=complaint, 2=flag_for_review | How to classify the current email |
| 1: Priority | 0=critical, 1=high, 2=medium, 3=normal | Priority level (complaints only) |
| 2: Assignment | 0–4=employee_0..4, 5=no_assignment | Who handles it (complaints only) |
| 3: Secondary | 0=auto_reply_from_kb, 1=alert_gm, 2=none | Additional action |
Rule: If classification = flag_for_review, dims 1/2/3 are forced to (normal, no_assignment, none).
| Signal | Value | Condition |
|---|---|---|
resolve_on_time |
+1.0 | Employee resolves ticket within SLA |
csat_high |
+0.8 | CSAT score ≥ 4 on resolved ticket |
trend_prevented |
+0.6 | GM alerted during category surge |
correct_classification |
+0.5 | Classification matches ground truth |
balanced_assignment |
+0.4 | Workload std decreased |
kb_updated |
+0.3 | Knowledge base learned new entry |
missed_deadline |
−1.0 | Ticket missed SLA deadline |
bad_autoreply |
−0.8 | CSAT score ≤ 2 |
unnecessary_escalation |
−0.6 | Flagged for review despite low complexity |
misclassification |
−0.5 | Classification doesn't match ground truth |
complaint_reopened |
−0.4 | Complaint reopened after resolution |
keyword_flag_missed |
−0.3 | Keyword-flagged email not treated as complaint/critical |
- In
config.yaml, setenv.n_employees: 6 - The observation space grows by 2 dims (new employee load + resolve time)
- Update
spaces.pyaccordingly (increaseOBS_SIZEand add employee dims) - Update action space dim 2 to
7(6 employees + no_assignment)
- Add the category name to
email_gen.categoriesinconfig.yaml - Add 5 query + 5 complaint templates in
email_gen.py - Add 3 KB entries in
knowledge_base.py - Trend state dims grow by 1
All parameters in config.yaml propagate through without code changes:
- Adjust
episode_emailsfor longer/shorter episodes - Modify reward weights to shape different agent behaviours
- Change
sla.*_hoursto tighten or relax deadlines - Adjust
employee_sim.base_resolve_ratefor harder/easier simulation
helixdesk-openenv/
├── helixdesk/
│ ├── __init__.py # exports HelixDeskEnv
│ ├── env.py # main environment class
│ ├── models.py # Pydantic typed wrappers (HelixObservation, HelixAction, HelixReward)
│ ├── spaces.py # observation & action space definitions
│ ├── rewards.py # reward function
│ ├── simulator/ # simulation components
│ │ ├── clock.py # simulated time
│ │ ├── email_gen.py # synthetic email generation
│ │ ├── employee_sim.py # employee behaviour model
│ │ ├── knowledge_base.py # KB lookup & auto-learn
│ │ └── trend_watchdog.py # volume surge detection
│ ├── agents/ # baseline agents
│ │ ├── base_agent.py # abstract agent interface
│ │ ├── random_agent.py # random baseline
│ │ └── rule_agent.py # deterministic rule-based agent
│ └── monitor/ # logging & visualization
│ ├── episode_logger.py # CSV per-step logger
│ └── terminal_dashboard.py # Rich live dashboard
├── tasks/ # graded task definitions
│ ├── easy_classify.py # keyword-flag classification (easy)
│ ├── medium_sla.py # SLA compliance rate (medium)
│ ├── hard_trend.py # trend detection + CSAT (hard)
│ └── expert_full.py # full expert evaluation (expert)
├── tests/ # pytest test suite
├── train.py # training entry point
├── evaluate.py # evaluation with rich table output
├── baseline.py # GPT-4o + rule + random baseline runner
├── inference.py # mandatory hackathon inference script
├── config.yaml # all configurable parameters
├── openenv.yaml # OpenEnv manifest
├── Dockerfile # container image
├── requirements.txt # Python dependencies
└── README.md # this file
HelixDesk OpenEnv ships with 4 graded tasks of increasing difficulty. Each task's grade(env, agent) function returns a score in [0.0, 1.0].
| Task | Difficulty | Scoring Criteria |
|---|---|---|
easy |
🟢 Easy | Run 20 emails. Score = fraction of keyword-flagged emails correctly classified as complaint with critical priority. |
medium |
🟡 Medium | Run 1 full episode (100 emails). Score = fraction of tickets resolved within SLA deadline. |
hard |
🔴 Hard | Run 1 full episode. Score = avg of (trend alerts caught / surge events, CSAT / 4.5, overdue control). |
expert |
⚫ Expert | Geometric mean of keyword score × classification accuracy × review abuse rate. One weakness tanks the whole score. |
# Run all tasks against rule + random baselines
python baseline.pyWe evaluated the baseline agents across 3 exact seeds (42, 100, 2026) to ensure reproducibility. The results clearly demonstrate that while a deterministic rule agent performs well on simple classification, it struggles on adversarial routing tasks (Hard / Expert) due to intentionally injected conflicting signals (ambiguous texts) and delayed consequence penalties in the environment.
| Task | Random Agent (n=3) | Rule-based Agent (n=3) | Metric Type |
|---|---|---|---|
| easy | 0.040 ± 0.02 | 1.000 ± 0.00 | Strict priority assignment |
| medium | 0.354 ± 0.04 | 0.865 ± 0.03 | SLA Compliance % |
| hard | 0.455 ± 0.06 | 0.490 ± 0.20 | Trend isolation & Ambiguity resolution |
| expert | 0.210 ± 0.05 | 0.550 ± 0.15 | Geometric mean of workload balance + SLAs |
Run python baseline.py to reproduce these precise evaluation traces locally.
# Build
docker build -t helixdesk-openenv .
# Start the web dashboard and API server
docker run --rm -p 7860:7860 helixdesk-openenv
# Run evaluation instead
docker run --rm -p 7860:7860 helixdesk-openenv python evaluate.py --agent rule --episodes 100Run our pre-configured test suite to verify full compliance with the Meta PyTorch OpenEnv harness requirements.
$ python -m pytest tests/test_validation.py -v
collected 4 items
tests/test_validation.py::test_endpoints PASSED [ 25%]
tests/test_validation.py::test_manifest_validation PASSED [ 50%]
tests/test_validation.py::test_inference_script_format PASSED [ 75%]
tests/test_validation.py::test_grader_consistency PASSED [100%]
======================== 4 passed in 22.78s ========================- Inference Script Format: The stdout logs rigorously follow the
[START],[STEP], and[END]syntax required by the OpenEnv validation harness. - Grader Consistency: Graders execute deterministically based on seed injections, returning strict, reproducible bounds in
[0, 1]. - API Endpoints: The FastAPI application inside the Docker entry point (
app:app) properly handles POST/reset, POST/step, and POST/grader.
pytest tests/ -vLive demo: https://huggingface.co/spaces/nottherajyk/helixdesk-openenv
The Space runs the rule-based and random agents interactively in your browser. No install required.
MIT