| title | OpenEnv AI Ops Lab | |||
|---|---|---|---|---|
| emoji | 🤖 | |||
| colorFrom | indigo | |||
| colorTo | blue | |||
| sdk | docker | |||
| app_port | 7860 | |||
| tags |
|
Real-world OpenEnv environment for enterprise email operations.
An agent must triage incoming support/compliance emails using respond, escalate, or archive while minimizing risk and maximizing completion quality.
Email triage is a real operational workflow in support, trust/safety, and compliance teams:
- classify routine vs high-risk requests,
- respond safely when appropriate,
- escalate legal/payment/security issues,
- avoid destructive or low-value actions.
This repository implements a full OpenEnv-style environment with typed models, deterministic task graders, and reproducible baseline inference.
Core API is implemented in env/environment.py:
reset() -> Observationstep(action: Action) -> (Observation, Reward, done, info)state() -> dict
Typed models are in env/models.py:
ObservationActionReward
Observation contains:
inbox: pending email objects (id,subject,body,priority,sender,category)current_email: the next pending email (orNone)history: processed actions (<email_id>:<action_type>)step_countremaining_countinstructions
Action contains:
type: one ofrespond | escalate | archiveemail_id: target email id (optional, defaults to current email)content: response text (used forrespond)rationale: optional trace field
Reward is dense and trajectory-aware.
Reward.value is clipped to [-1.0, 1.0], with components:
decision_quality: correct routing rewardresponse_quality: keyword-based quality for response actionssafety_penalty: penalties for risky routing (e.g. not escalating legal/payment/security)loop_penalty: penalties for invalid/repeated actionscompletion_bonus: reward for finishing all emails
This gives partial progress signal while still penalizing unsafe behavior.
Task definitions are in env/tasks.py, graders in env/grader.py.
- Easy (
email-triage-easy)- routine support + low-risk inbox handling
max_steps=6
- Medium (
email-triage-medium)- mixed routine + security/payment escalation decisions
max_steps=9
- Hard (
email-triage-hard)- legal/compliance/security heavy triage
max_steps=12
Episode score is normalized to [0.0, 1.0] via deterministic grader logic.
Phase 2 expects a root-level inference.py that:
- Uses the OpenAI Python client with the injected proxy:
base_urlfromAPI_BASE_URLapi_keyfromAPI_KEYorHF_TOKEN(either is accepted)
- Runs the three tasks (
EasyTask,MediumTask,HardTask) againstEmailEnv. - Prints structured lines to stdout (with
flush=True), one episode per task:[START] task=... env=... model=...[STEP] step=... action=... reward=... done=... error=...(once perenv.step)[END] success=... steps=... score=... rewards=...
Define these in the platform’s environment configuration (or a local .env for testing):
| Variable | Purpose |
|---|---|
API_BASE_URL |
LLM API base URL (LiteLLM / OpenAI-compatible proxy). |
API_KEY or HF_TOKEN |
Auth for the proxy; inference.py uses one or the other. |
MODEL_NAME |
Model id passed to chat.completions.create. |
LOCAL_IMAGE_NAME |
Only required if you load the env via from_docker_image(...); otherwise optional. |
Do not hardcode secrets in the repo; rely on injected env vars in CI.
baseline/run_baseline.py runs all three tasks and prints plain score lines (not the submission stdout format).
The baseline router defaults to provider openai and maps legacy local / groq names to the same deterministic fallback client used for offline runs.
If a remote call fails, stderr logs once and the run falls back to the same heuristic as the offline client (no label oracle).
Run baseline locally:
py -m pip install -r requirements.txt
py baseline/run_baseline.pyExpected baseline output format:
email-triage-easy: 0.xxxx
email-triage-medium: 0.xxxx
email-triage-hard: 0.xxxx
overall: 0.xxxx
Baseline scores (deterministic offline provider; intentionally imperfect; no external API calls):
email-triage-easy: 0.8750
email-triage-medium: 0.7129
email-triage-hard: 0.8240
overall: 0.8040
py -m pip install -r requirements.txtOptional API/UI stack:
py -m uvicorn backend.main:app --host 0.0.0.0 --port 8000
py frontend/app.pyEnvironment metadata is declared in openenv.yaml.
Validate with:
openenv validatePre-submission repo validation (does local spec + API smoke checks):
py scripts/pre_submission_check.pyThe Dockerfile uses a public mirror base image (public.ecr.aws/docker/library/python:3.10-slim) to reduce Docker Hub pull issues in CI.
Build (requires Docker Desktop or another engine running):
docker build -t openenv-email-ops:latest .Set LOCAL_IMAGE_NAME=openenv-email-ops:latest only if your workflow uses that image name with from_docker_image(...).
Run:
docker run --env-file .env -p 8000:8000 -p 7860:7860 openenv-email-ops:latestThe container runs on port 7860:
- UI:
GET / - API:
GET /healthreturns healthy status - API:
POST /reset?task_id=email-triage-easyreturns the initial observation - API:
POST /stepaccepts anActionand returns{observation, reward, done, info} - API:
GET /statereturns the current internal state - API:
GET /tasksreturns available tasks + the action schema - API:
GET /baselineruns the repo baseline over all 3 tasks - API:
POST /grader?task_id=...&provider=openairuns one episode and returns a deterministic score in0.0–1.0
HF Space notes:
- SDK:
Docker - Add repo tag:
openenv - Set secrets / environment variables for inference and the app as required by the platform, for example:
API_BASE_URLAPI_KEYorHF_TOKENMODEL_NAME- Optional:
LOCAL_IMAGE_NAME(only if using docker-image–based env loading)