VitaBench Diff by ashmitkx · Pull Request #33 · microsoft/interwhen

ashmitkx · 2026-06-24T14:06:58Z

Add files to overlay on VitaBench's code.

Copilot

Pull request overview

This PR adds an overlay on top of the upstream VitaBench repository to support more robust OTA benchmarking runs, including online soundness/completeness verification, deterministic solo-user message playback, and supporting offline preprocessing scripts + prompt templates.

Changes:

Adds an OTA verifier (LLM-based soundness + rule-based completeness) and wires it into orchestration/run outputs.
Introduces offline scripts to pre-resolve relative dates, pre-extract constraints/completeness artifacts, and pregenerate deterministic solo-user opening messages.
Hardens JSON extraction in evaluator utilities and extends/updates evaluator + prompt templates for the new verification flow.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
examples/AgenticBenchmarks/VitaBench/src/vita/utils/utils.py	Utility overlay including hardened JSON extraction and task/path helpers.
examples/AgenticBenchmarks/VitaBench/src/vita/user/user_simulator.py	Adds deterministic solo-user message support via `DummyUser`.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/README.md	Documents offline helper scripts and expected artifacts.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preresolve_dates.py	Offline script to resolve relative date phrases into absolute dates.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/pregenerate_solo_messages.py	Offline script to pregenerate deterministic solo-user opening messages.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_constraints_harness.py	Offline extraction of NL constraints for harness soundness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/scripts/preextract_completeness.py	Offline extraction of completeness constraints for OTA tasks.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/soundness_judge_template.yaml	Prompt template for LLM soundness judge.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/solo_agent_system_prompt.yaml	Solo-agent system prompt (language + tool-use guidance).
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_soundness_judge_template.yaml	Prompt template for per-constraint harness judging.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_memory_writer_template.yaml	Prompt template for harness memory distillation.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/harness_constraint_extraction_template.yaml	Prompt template for harness constraint extraction.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/date_resolution_template.yaml	Prompt template for date resolution in preprocessing.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/completeness_extraction_template.yaml	Prompt template for completeness constraint extraction.
examples/AgenticBenchmarks/VitaBench/src/vita/prompts/agent_system_prompt.yaml	Agent prompt updated to enforce English outputs.
examples/AgenticBenchmarks/VitaBench/src/vita/orchestrator/orchestrator.py	Wires verifier into simulation loop (soundness before writes, completeness on stop).
examples/AgenticBenchmarks/VitaBench/src/vita/evaluator/evaluator_traj.py	Evaluator tweak (flatten nested rubric output) + window eval metadata.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/utils.py	Shared verifier helpers (tool history + JSON extraction).
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/verifier/init.py	OTAVerifier + factory wiring soundness and completeness subsystems.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/tools_schema.py	Tool schema overlays (incl. override arg documentation).
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/judge.py	LLM-based soundness judging implementation.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_llm/init.py	Exports for LLM soundness judge module.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/schema.py	Pydantic schema for harness constraints/memory/judgments.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/memory_store.py	Runtime memory writer for harness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/judge.py	Per-constraint harness judging implementation.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/constraint_extractor.py	Offline constraint extraction for harness mode.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/soundness_judge_harness/init.py	Harness orchestrator + factory loader for constraints.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/schema.py	Schema for extracted completeness constraints.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/constraint_extractor.py	LLM-based completeness constraint extractor.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/checker.py	Runtime completeness checker over final order state.
examples/AgenticBenchmarks/VitaBench/src/vita/domains/ota/completeness/init.py	Completeness module exports.
examples/AgenticBenchmarks/VitaBench/src/vita/data_model/simulation.py	RunConfig additions + SimulationRun soundness log attachment.
examples/AgenticBenchmarks/VitaBench/src/vita/cli.py	CLI flags added for soundness/completeness/solo-user modes.
examples/AgenticBenchmarks/VitaBench/src/vita/agent/llm_agent.py	Solo agent language support + relaxed strict tool-only guard.
examples/AgenticBenchmarks/VitaBench/README.md	Top-level overlay instructions and run recipes for the verifier.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    return results
+
+
+def evaluator_extracter(content: str) -> list[dict]:


+      - Completeness: rule-based check that all required bookings exist at stop time.
+    """
+
+    WRITE_PREFIXES = ("create_", "cancel_", "modify_")


+    plus observe_tool_response() for memory updates on read calls.
+    """
+
+    WRITE_PREFIXES = ("create_", "cancel_", "modify_")


+    # Fallback: find balanced top-level { } blocks, take the last valid one
+    candidates: list[str] = []
+    depth = 0
+    start = -1
+    in_string = False


+    tasks = [Task.model_validate(t) for t in raw_tasks]
+
+    # Load resolved instructions if provided
+    resolved = {}
+    if resolved_instructions_file:
+        with open(resolved_instructions_file, "r", encoding="utf-8") as fp:
+            resolved = json.load(fp)
+        logger.info(f"Loaded {len(resolved)} resolved instructions from {resolved_instructions_file}")
+
+    tasks = [Task.model_validate(t) for t in raw_tasks]
+


+        "args": {
+            "hotel_id": "Hotel ID",
+            "product_id": "Room ID",
+            "user_id": "User ID",
+            "override": "Default false. Only set to true if you received soundness feedback and are confident your action is correct"
+        },


+Produces a JSON mapping {task_id: resolved_instructions} that can be loaded
+at runtime to replace the original instructions.
+


amit-sharma

Looks good. Added a few comments.

amit-sharma · 2026-06-25T13:16:04Z

+    """
+    Manages the per-task memory store during simulation.
+
+    On each non-write tool call, asks an SLM to distill relevant facts


how do you decide which call is write or non-write? Is that provided in the dataset?

amit-sharma · 2026-06-25T13:18:38Z

@@ -0,0 +1,32 @@
+name: agent_system_prompt


do we need to copy this file? what is the change you made in this file?

amit-sharma · 2026-06-25T13:19:24Z

@@ -0,0 +1,81 @@
+name: completeness_extraction_template
+chinese: |-


same question. Are there changes made to the prompt?

amit-sharma · 2026-06-25T13:20:26Z

@@ -0,0 +1,33 @@
+name: harness_memory_writer_template


why are we adding Chinese harness memory writer? We can just add the English one? Asking because we cannot understand or check what the Chinese version is saying

Copilot AI review requested due to automatic review settings June 24, 2026 14:06

ashmitkx marked this pull request as ready for review June 24, 2026 14:07

Copilot started reviewing on behalf of ashmitkx June 24, 2026 14:07 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

ashmitkx added 2 commits June 25, 2026 06:54

add base vita files for overlay

bf60e9d

add vitabench diff

9e19c8d

ashmitkx force-pushed the vitabench branch from b024e94 to 9e19c8d Compare June 25, 2026 07:03

amit-sharma reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VitaBench Diff#33

VitaBench Diff#33
ashmitkx wants to merge 2 commits into
microsoft:mainfrom
ashmitkx:vitabench

ashmitkx commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit-sharma left a comment

Uh oh!

Uh oh!

amit-sharma Jun 25, 2026

Uh oh!

amit-sharma Jun 25, 2026

Uh oh!

amit-sharma Jun 25, 2026

Uh oh!

amit-sharma Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return results


		def evaluator_extracter(content: str) -> list[dict]:

		Produces a JSON mapping {task_id: resolved_instructions} that can be loaded
		at runtime to replace the original instructions.

		@@ -0,0 +1,81 @@
		name: completeness_extraction_template
		chinese: \|-

Uh oh!

Conversation

ashmitkx commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit-sharma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amit-sharma Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

amit-sharma Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

amit-sharma Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

amit-sharma Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants