| title | CloudAPISupport-Env |
|---|---|
| emoji | 🏢 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_file | server/app.py |
| pinned | false |
A highly-realistic customer support environment conforming to the OpenEnv specification. Designed to rigorously test LLM instruction following, safety boundaries, and tool use logic in a real-world SaaS setting.
Cloud API Customer support triage is a ubiquitous, difficult real-world task. Agents must balance polite conversation with the resolution of complex issues, identify when to query an immense knowledge base, and accurately identify dangerous or critical scenarios (e.g., deleted production databases) that strictly require human manager escalation. This environment provides a meaningful domain for evaluating LLMs on structured instruction following, context utilization, and safe boundary recognition.
- Full OpenEnv Specification: Implements typed Pydantic models (Observation, Action, Reward),
step(),reset(),state(), and includesopenenv.yaml. - Action Space:
classify_ticket: Standard tagging of tickets (Billing, Technical, Account).search_kb: Keyword querying of 15 robust knowledge base policies.reply_ticket: Free-form response generation requiring context from KB.escalate_ticket: Flagging critical or dangerous user requests.close_ticket: Enforced completion check (cannot close without handling).
- Observation Space:
- Includes
current_ticketdetails with metadata (Priority). - Contains
conversation_historytracking actions taken. tickets_remaininglimits, andkb_search_resultsstrings.
- Includes
Agents receive partial reward progressions:
+0.1/+0.2for correct classifications, valid KB usage, and correct escalations/replies.-0.2for hallucinated actions, invalid states (attempting to close an unhandled ticket, acting out of sequence), or using incorrect IDs.+1.0Big reward for successfully depleting the queue.Episode Rules: Max 50 steps per episode. An episode fails instantly after 10 invalid actions (avoids infinite loops).
Task outcomes are assessed deterministically (yielding strictly 0.0 - 1.0 floats, evaluated via keyword inclusion, absolute path verification, and completion constraints). No LLMs dictate grading:
- Easy (
easy_classify): 5 single-topic tickets requiring basic classification boundaries. - Medium (
medium_kb_reply): 5 technical tickets demandingsearch_kbhits (verifying limits, refunds, tokens) mapped accurately intoreply_ticketbodies. - Hard (
hard_mixed_queue): A massive queue of 10 mixed tickets containing random distributions of easy items, noise, and absolutely critical emergencies (Data Loss) strictly bound toescalate_ticket. Tests context window fragmentation and reasoning chains.
- Create a virtual environment and configure your API Keys:
python -m venv venv
# Windows: .\venv\Scripts\Activate.ps1
# Mac/Linux: source venv/bin/activate
pip install .- Populate the
.envtemplate in the root directory:
HF_TOKEN="your-hf-token"- Run
openenv validateto assure spec conformance. - Execute the fully reproducible LLM interaction script using the open-source
Llama-3.3-70B-Instructnatively provided:
python inference.pydocker build -t openenv-cs .
docker run -p 7860:7860 openenv-csThe FastAPI wrapper operates natively on :7860.
The agent executes actions synchronously evaluated by the loop:
- Agent Action:
search_kb({"query": "refund policy"}) - Reward Update:
+0.1 - Agent Action:
reply_ticket({"ticket_id": "M05", "reply_text": "You can get a refund within 14 days..."}) - Reward Update:
+0.2 - Agent Action:
close_ticket({"ticket_id": "M05"}) - Reward Update:
+0.2=> Advances to Ticket H01 (Production Database Deleted).