Skip to content

VitaBench Diff#33

Open
ashmitkx wants to merge 2 commits into
microsoft:mainfrom
ashmitkx:vitabench
Open

VitaBench Diff#33
ashmitkx wants to merge 2 commits into
microsoft:mainfrom
ashmitkx:vitabench

Conversation

@ashmitkx

Copy link
Copy Markdown
Collaborator

Add files to overlay on VitaBench's code.

Copilot AI review requested due to automatic review settings June 24, 2026 14:06
@ashmitkx ashmitkx marked this pull request as ready for review June 24, 2026 14:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an overlay on top of the upstream VitaBench repository to support more robust OTA benchmarking runs, including online soundness/completeness verification, deterministic solo-user message playback, and supporting offline preprocessing scripts + prompt templates.

Changes:

  • Adds an OTA verifier (LLM-based soundness + rule-based completeness) and wires it into orchestration/run outputs.
  • Introduces offline scripts to pre-resolve relative dates, pre-extract constraints/completeness artifacts, and pregenerate deterministic solo-user opening messages.
  • Hardens JSON extraction in evaluator utilities and extends/updates evaluator + prompt templates for the new verification flow.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
examples/AgenticBenchmarks/VitaBench/src/vita/utils/utils.py Utility overlay including hardened JSON extraction and task/path helpers.
examples/AgenticBenchmarks/VitaBench/src/vita/user/user_simulator.py Adds deterministic solo-user message support via DummyUser.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/README.md Documents offline helper scripts and expected artifacts.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preresolve_dates.py Offline script to resolve relative date phrases into absolute dates.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/pregenerate_solo_messages.py Offline script to pregenerate deterministic solo-user opening messages.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_constraints_harness.py Offline extraction of NL constraints for harness soundness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_completeness.py Offline extraction of completeness constraints for OTA tasks.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/soundness_judge_template.yaml Prompt template for LLM soundness judge.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/solo_agent_system_prompt.yaml Solo-agent system prompt (language + tool-use guidance).
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_soundness_judge_template.yaml Prompt template for per-constraint harness judging.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_memory_writer_template.yaml Prompt template for harness memory distillation.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_constraint_extraction_template.yaml Prompt template for harness constraint extraction.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/date_resolution_template.yaml Prompt template for date resolution in preprocessing.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/completeness_extraction_template.yaml Prompt template for completeness constraint extraction.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/agent_system_prompt.yaml Agent prompt updated to enforce English outputs.
examples/AgenticBenchmarks/VitaBench/src/vita/orchestrator/orchestrator.py Wires verifier into simulation loop (soundness before writes, completeness on stop).
examples/AgenticBenchmarks/VitaBench/src/vita/evaluator/evaluator_traj.py Evaluator tweak (flatten nested rubric output) + window eval metadata.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/utils.py Shared verifier helpers (tool history + JSON extraction).
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/init.py OTAVerifier + factory wiring soundness and completeness subsystems.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/tools_schema.py Tool schema overlays (incl. override arg documentation).
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/judge.py LLM-based soundness judging implementation.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/init.py Exports for LLM soundness judge module.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/schema.py Pydantic schema for harness constraints/memory/judgments.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/memory_store.py Runtime memory writer for harness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/judge.py Per-constraint harness judging implementation.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/constraint_extractor.py Offline constraint extraction for harness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/init.py Harness orchestrator + factory loader for constraints.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/schema.py Schema for extracted completeness constraints.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/constraint_extractor.py LLM-based completeness constraint extractor.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/checker.py Runtime completeness checker over final order state.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/init.py Completeness module exports.
examples/AgenticBenchmarks/VitaBench/src/vita/data_model/simulation.py RunConfig additions + SimulationRun soundness log attachment.
examples/AgenticBenchmarks/VitaBench/src/vita/cli.py CLI flags added for soundness/completeness/solo-user modes.
examples/AgenticBenchmarks/VitaBench/src/vita/agent/llm_agent.py Solo agent language support + relaxed strict tool-only guard.
examples/AgenticBenchmarks/VitaBench/README.md Top-level overlay instructions and run recipes for the verifier.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/AgenticBenchmarks/VitaBench/src/vita/utils/utils.py
return results


def evaluator_extracter(content: str) -> list[dict]:
- Completeness: rule-based check that all required bookings exist at stop time.
"""

WRITE_PREFIXES = ("create_", "cancel_", "modify_")
plus observe_tool_response() for memory updates on read calls.
"""

WRITE_PREFIXES = ("create_", "cancel_", "modify_")
Comment on lines +24 to +28
# Fallback: find balanced top-level { } blocks, take the last valid one
candidates: list[str] = []
depth = 0
start = -1
in_string = False
Comment on lines +56 to +66
tasks = [Task.model_validate(t) for t in raw_tasks]

# Load resolved instructions if provided
resolved = {}
if resolved_instructions_file:
with open(resolved_instructions_file, "r", encoding="utf-8") as fp:
resolved = json.load(fp)
logger.info(f"Loaded {len(resolved)} resolved instructions from {resolved_instructions_file}")

tasks = [Task.model_validate(t) for t in raw_tasks]

Comment on lines +494 to +499
"args": {
"hotel_id": "Hotel ID",
"product_id": "Room ID",
"user_id": "User ID",
"override": "Default false. Only set to true if you received soundness feedback and are confident your action is correct"
},
Comment on lines +8 to +10
Produces a JSON mapping {task_id: resolved_instructions} that can be loaded
at runtime to replace the original instructions.

Comment thread examples/AgenticBenchmarks/VitaBench/src/vita/utils/utils.py

@amit-sharma amit-sharma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added a few comments.

"""
Manages the per-task memory store during simulation.

On each non-write tool call, asks an SLM to distill relevant facts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you decide which call is write or non-write? Is that provided in the dataset?

@@ -0,0 +1,32 @@
name: agent_system_prompt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to copy this file? what is the change you made in this file?

@@ -0,0 +1,81 @@
name: completeness_extraction_template
chinese: |-

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question. Are there changes made to the prompt?

@@ -0,0 +1,33 @@
name: harness_memory_writer_template

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding Chinese harness memory writer? We can just add the English one? Asking because we cannot understand or check what the Chinese version is saying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants