VitaBench Diff#33
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an overlay on top of the upstream VitaBench repository to support more robust OTA benchmarking runs, including online soundness/completeness verification, deterministic solo-user message playback, and supporting offline preprocessing scripts + prompt templates.
Changes:
- Adds an OTA verifier (LLM-based soundness + rule-based completeness) and wires it into orchestration/run outputs.
- Introduces offline scripts to pre-resolve relative dates, pre-extract constraints/completeness artifacts, and pregenerate deterministic solo-user opening messages.
- Hardens JSON extraction in evaluator utilities and extends/updates evaluator + prompt templates for the new verification flow.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/AgenticBenchmarks/VitaBench/src/vita/utils/utils.py | Utility overlay including hardened JSON extraction and task/path helpers. |
| examples/AgenticBenchmarks/VitaBench/src/vita/user/user_simulator.py | Adds deterministic solo-user message support via DummyUser. |
| examples/AgenticBenchmarks/VitaBench/src/vita/scripts/README.md | Documents offline helper scripts and expected artifacts. |
| examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preresolve_dates.py | Offline script to resolve relative date phrases into absolute dates. |
| examples/AgenticBenchmarks/VitaBench/src/vita/scripts/pregenerate_solo_messages.py | Offline script to pregenerate deterministic solo-user opening messages. |
| examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_constraints_harness.py | Offline extraction of NL constraints for harness soundness mode. |
| examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_completeness.py | Offline extraction of completeness constraints for OTA tasks. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/soundness_judge_template.yaml | Prompt template for LLM soundness judge. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/solo_agent_system_prompt.yaml | Solo-agent system prompt (language + tool-use guidance). |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_soundness_judge_template.yaml | Prompt template for per-constraint harness judging. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_memory_writer_template.yaml | Prompt template for harness memory distillation. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_constraint_extraction_template.yaml | Prompt template for harness constraint extraction. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/date_resolution_template.yaml | Prompt template for date resolution in preprocessing. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/completeness_extraction_template.yaml | Prompt template for completeness constraint extraction. |
| examples/AgenticBenchmarks/VitaBench/src/vita/prompts/agent_system_prompt.yaml | Agent prompt updated to enforce English outputs. |
| examples/AgenticBenchmarks/VitaBench/src/vita/orchestrator/orchestrator.py | Wires verifier into simulation loop (soundness before writes, completeness on stop). |
| examples/AgenticBenchmarks/VitaBench/src/vita/evaluator/evaluator_traj.py | Evaluator tweak (flatten nested rubric output) + window eval metadata. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/utils.py | Shared verifier helpers (tool history + JSON extraction). |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/init.py | OTAVerifier + factory wiring soundness and completeness subsystems. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/tools_schema.py | Tool schema overlays (incl. override arg documentation). |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/judge.py | LLM-based soundness judging implementation. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/init.py | Exports for LLM soundness judge module. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/schema.py | Pydantic schema for harness constraints/memory/judgments. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/memory_store.py | Runtime memory writer for harness mode. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/judge.py | Per-constraint harness judging implementation. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/constraint_extractor.py | Offline constraint extraction for harness mode. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/init.py | Harness orchestrator + factory loader for constraints. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/schema.py | Schema for extracted completeness constraints. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/constraint_extractor.py | LLM-based completeness constraint extractor. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/checker.py | Runtime completeness checker over final order state. |
| examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/init.py | Completeness module exports. |
| examples/AgenticBenchmarks/VitaBench/src/vita/data_model/simulation.py | RunConfig additions + SimulationRun soundness log attachment. |
| examples/AgenticBenchmarks/VitaBench/src/vita/cli.py | CLI flags added for soundness/completeness/solo-user modes. |
| examples/AgenticBenchmarks/VitaBench/src/vita/agent/llm_agent.py | Solo agent language support + relaxed strict tool-only guard. |
| examples/AgenticBenchmarks/VitaBench/README.md | Top-level overlay instructions and run recipes for the verifier. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return results | ||
|
|
||
|
|
||
| def evaluator_extracter(content: str) -> list[dict]: |
| - Completeness: rule-based check that all required bookings exist at stop time. | ||
| """ | ||
|
|
||
| WRITE_PREFIXES = ("create_", "cancel_", "modify_") |
| plus observe_tool_response() for memory updates on read calls. | ||
| """ | ||
|
|
||
| WRITE_PREFIXES = ("create_", "cancel_", "modify_") |
| # Fallback: find balanced top-level { } blocks, take the last valid one | ||
| candidates: list[str] = [] | ||
| depth = 0 | ||
| start = -1 | ||
| in_string = False |
| tasks = [Task.model_validate(t) for t in raw_tasks] | ||
|
|
||
| # Load resolved instructions if provided | ||
| resolved = {} | ||
| if resolved_instructions_file: | ||
| with open(resolved_instructions_file, "r", encoding="utf-8") as fp: | ||
| resolved = json.load(fp) | ||
| logger.info(f"Loaded {len(resolved)} resolved instructions from {resolved_instructions_file}") | ||
|
|
||
| tasks = [Task.model_validate(t) for t in raw_tasks] | ||
|
|
| "args": { | ||
| "hotel_id": "Hotel ID", | ||
| "product_id": "Room ID", | ||
| "user_id": "User ID", | ||
| "override": "Default false. Only set to true if you received soundness feedback and are confident your action is correct" | ||
| }, |
| Produces a JSON mapping {task_id: resolved_instructions} that can be loaded | ||
| at runtime to replace the original instructions. | ||
|
|
amit-sharma
left a comment
There was a problem hiding this comment.
Looks good. Added a few comments.
| """ | ||
| Manages the per-task memory store during simulation. | ||
|
|
||
| On each non-write tool call, asks an SLM to distill relevant facts |
There was a problem hiding this comment.
how do you decide which call is write or non-write? Is that provided in the dataset?
| @@ -0,0 +1,32 @@ | |||
| name: agent_system_prompt | |||
There was a problem hiding this comment.
do we need to copy this file? what is the change you made in this file?
| @@ -0,0 +1,81 @@ | |||
| name: completeness_extraction_template | |||
| chinese: |- | |||
There was a problem hiding this comment.
same question. Are there changes made to the prompt?
| @@ -0,0 +1,33 @@ | |||
| name: harness_memory_writer_template | |||
There was a problem hiding this comment.
why are we adding Chinese harness memory writer? We can just add the English one? Asking because we cannot understand or check what the Chinese version is saying
Add files to overlay on VitaBench's code.