Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions examples/AgenticBenchmarks/VitaBench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Running the OTA verifier on VitaBench

---

**Note:** The code provided in this folder is built on top of the original code for VitaBench, found at [https://github.com/meituan-longcat/vitabench](https://github.com/meituan-longcat/vitabench). In each file, we have mentioned the changes we have made, and the code we have used verbatim, relative to the same file in the original VitaBench repo.

## 1. Clone the upstream VitaBench repo

```bash
git clone https://github.com/meituan-longcat/vitabench.git
cd vitabench
```

This README and the files alongside it are an overlay on top of the upstream
`main` branch. The overlay keeps the upstream directory layout, so every file
lives at the same path it would occupy inside a VitaBench checkout.

## 2. Apply the overlay

Because the overlay mirrors the upstream layout, copy its `src/` tree straight
over your clone — files at matching paths are replaced, new files are added:

```bash
SRC=/path/to/this/overlay # the directory containing this README
DST=/path/to/vitabench # your upstream clone

cp -r "$SRC/src" "$DST/" # merge the overlay sources into the clone
```

### Modified files

| Path | What changed |
|---|---|
| `src/vita/cli.py` | Adds the `--soundness-mode`, `--completeness-mode`, `--solo-user-mode` and `--solo-user-file` run flags. |
| `src/vita/data_model/simulation.py` | Adds the matching `RunConfig` fields (+ validation) and a `soundness_log` field on `SimulationRun`. |
| `src/vita/run.py` | Threads the new flags through the run pipeline, builds the OTA verifier, and resolves solo user messages. |
| `src/vita/orchestrator/orchestrator.py` | Runs the verifier inline: blocking soundness check before each tool call, and a completeness check on stop. |
| `src/vita/agent/llm_agent.py` | Solo agent honours `language`; relaxes the tool-call-only guard so the orchestrator can nudge instead of crashing. |
| `src/vita/user/user_simulator.py` | `DummyUser` can replay a pregenerated opening message instead of calling the LLM each run. |
| `src/vita/domains/ota/tools.py` | Adds an optional `override` flag to every OTA WRITE tool so the agent can bypass a soundness block when confident. |
| `src/vita/domains/ota/tools_schema.py` | Documents the new `override` argument (Chinese + English). |
| `src/vita/evaluator/evaluator_traj.py` | Flattens nested-list LLM rubric output before scoring. |
| `src/vita/utils/utils.py` | Hardens `evaluator_extracter` JSON extraction (think-block stripping, fenced/balanced-block fallback). |
| `src/vita/prompts/agent_system_prompt.yaml` | Adds an "always respond in English" instruction. |
| `src/vita/prompts/solo_agent_system_prompt.yaml` | Adds an "always respond in English" instruction. |

### New files

| Path | Purpose |
|---|---|
| `src/vita/domains/ota/verifier/` | `OTAVerifier` + `create_verifier()` factory that wires the soundness and completeness checks together. |
| `src/vita/domains/ota/soundness_judge_llm/` | LLM-judge soundness checker (`--soundness-mode llm`). |
| `src/vita/domains/ota/soundness_judge_harness/` | NL-constraint "harness" soundness checker with running memory (`--soundness-mode harness`). |
| `src/vita/domains/ota/completeness/` | Completeness checker that compares the final orders against extracted constraints at stop. |
| `src/vita/prompts/*.yaml` | New prompt templates: soundness/harness judges, constraint & completeness extraction, memory writer, and date resolution. |
| `src/vita/scripts/` | Offline preprocessing scripts and their guide — see [`src/vita/scripts/README.md`](src/vita/scripts/README.md). |

## 3. Python environment

Follow the upstream VitaBench README.

## 4. Offline preprocessing (optional)

Some verifier modes consume artifacts produced by the scripts in
`src/vita/scripts/` (resolved dates, extracted constraints, pregenerated solo
user messages). The dependency order and exact commands are documented in
[`src/vita/scripts/README.md`](src/vita/scripts/README.md). You only need these
if you run `--soundness-mode harness`, `--completeness-mode on`, or
`--solo-user-mode file`.

## 5. Environment variables

```bash
# Max times the agent is sent back after a failed completeness check (default 1)
export VITA_MAX_COMPLETENESS_RETRIES=1
```

## 6. Run

Reference command (OTA domain, solo agent, dummy user, harness soundness +
completeness checks on):

```bash
vita run \
--domain ota \
--agent llm_solo_agent \
--user dummy_user \
--agent-llm <model name> \
--evaluator-llm <model name> \
--language english \
--soundness-mode harness \
--completeness-mode on \
--num-tasks 100
```

Flags (the four overlay flags are added by this overlay; the rest are upstream):

| Flag | Meaning |
|---|---|
| `--domain ota` | Run the OTA domain. The verifier only activates for `ota`. |
| `--agent llm_solo_agent` | Solo-mode agent: no user-simulator turn; it works the ticket autonomously via tool calls. |
| `--user dummy_user` | No-op user that only issues the opening message. |
| `--agent-llm <model name>` | Model (from `models.yaml`) the agent runs on. |
| `--evaluator-llm <model name>` | Model used by the rubric evaluator. |
| `--language english` | Prompt/task language (`english` or `chinese`). |
| `--num-tasks 100` | Number of tasks to run. |
| `--soundness-mode {llm,harness,off}` | Soundness checker before each write tool call. `llm` = LLM judge, `harness` = NL-constraint judge with memory, `off` = disabled. Default `off`. |
| `--completeness-mode {on,off}` | When `on`, run a completeness check at stop and send the agent back (up to `VITA_MAX_COMPLETENESS_RETRIES`) if requirements are unmet. Default `off`. |
| `--solo-user-mode {live,file}` | Solo opening message: `live` generates it via LLM each run (introduces variance); `file` loads a deterministic pregenerated message. Default `live`. |
| `--solo-user-file <path>` | JSON mapping `task_id -> message`, required when `--solo-user-mode=file`. Produced by `src/vita/scripts/pregenerate_solo_messages.py`. |

Results are written to `data/simulations/`. See the upstream README for the full
list of base flags (`--num-trials`, `--max-steps`, `--task-ids`, `--csv-output`,
…).
237 changes: 237 additions & 0 deletions examples/AgenticBenchmarks/VitaBench/src/vita/agent/llm_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
"""VitaBench overlay file — modified from the original VitaBench repo
(https://github.com/meituan-longcat/vitabench), at src/vita/agent/llm_agent.py.
Everything is verbatim from the original except for the following changes:

1. ``LLMSoloAgent`` now stores ``self.language`` and builds its system prompt
with ``get_prompts(self.language)`` (was ``get_prompts()``), so solo runs
honour the requested language.
2. Commented out the ``raise ValueError("LLMSoloAgent only supports tool calls
before ###STOP###.")`` guard in ``generate_next_message`` — the orchestrator
now nudges the agent back instead of hard-failing on a stray text turn.
"""
from copy import deepcopy
from typing import List, Optional

from loguru import logger
from pydantic import BaseModel

from vita.agent.base import (
LocalAgent,
ValidAgentInputMessage,
is_valid_agent_history_message,
)
from vita.data_model.message import (
APICompatibleMessage,
AssistantMessage,
Message,
MultiToolMessage,
SystemMessage,
)
from vita.environment.tool import Tool
from vita.utils.llm_utils import generate
from vita.utils.utils import get_now, get_weekday
from vita.prompts import get_prompts


class LLMAgentState(BaseModel):
"""The state of the agent."""

system_messages: list[SystemMessage]
messages: list[APICompatibleMessage]


class LLMAgent(LocalAgent[LLMAgentState]):
"""
An LLM agent that can be used to solve a task.
"""

def __init__(
self,
tools: List[Tool],
domain_policy: str,
llm: Optional[str] = None,
llm_args: Optional[dict] = None,
time=None,
enable_think: bool = False,
language: str = None
):
"""
Initialize the LLMAgent.
"""
super().__init__(tools=tools, domain_policy=domain_policy)
self.llm = llm
self.llm_args = deepcopy(llm_args) if llm_args is not None else {}
self.time = time + " " + get_weekday(time, language)
self.enable_think = enable_think

@property
def system_prompt(self) -> str:
if self.time is not None:
return self.domain_policy.format(
time=self.time
)
return self.domain_policy.format(
time=get_now("%Y-%m-%d %H:%M:%S")
)

def get_init_state(
self, message_history: Optional[list[Message]] = None
) -> LLMAgentState:
"""Get the initial state of the agent.

Args:
message_history: The message history of the conversation.

Returns:
The initial state of the agent.
"""
if message_history is None:
message_history = []
assert all(is_valid_agent_history_message(m) for m in message_history), (
"Message history must contain only AssistantMessage, UserMessage, or ToolMessage to Agent."
)


return LLMAgentState(
system_messages=[SystemMessage(role="system", content=self.system_prompt)],
messages=message_history,
)

def generate_next_message(
self, message: ValidAgentInputMessage, state: LLMAgentState
) -> tuple[AssistantMessage, LLMAgentState]:
"""
Respond to a user or tool message.
"""
if isinstance(message, MultiToolMessage):
state.messages.extend(message.tool_messages)
else:
state.messages.append(message)

messages = state.system_messages + state.messages

assistant_message = generate(
model=self.llm,
tools=self.tools,
messages=messages,
enable_think=self.enable_think,
**self.llm_args,
)
state.messages.append(assistant_message)

return assistant_message, state


def set_seed(self, seed: int):
"""Set the seed for the LLM."""
if self.llm is None:
raise ValueError("LLM is not set")
cur_seed = self.llm_args.get("seed", None)
if cur_seed is not None:
logger.warning(f"Seed is already set to {cur_seed}, resetting it to {seed}")
self.llm_args["seed"] = seed




class LLMSoloAgent(LocalAgent[LLMAgentState]):
"""
An LLM agent that can be used to solve a task without any interaction with the customer.
The task need to specify a ticket format.
"""

def __init__(
self,
tools: List[Tool],
domain_policy: str,
llm: Optional[str] = None,
llm_args: Optional[dict] = None,
time=None,
enable_think: bool = False,
language: str = None
):
"""
Initialize the LLMAgent.
"""
super().__init__(tools=tools, domain_policy=domain_policy)
self.llm = llm
self.llm_args = deepcopy(llm_args) if llm_args is not None else {}
self.time = time + " " + get_weekday(time, language)
self.enable_think = enable_think
self.language = language

@property
def system_prompt(self) -> str:
prompts = get_prompts(self.language)
if self.time is not None:
return prompts.solo_agent_system_prompt.format(
time=self.time
)
return prompts.solo_agent_system_prompt.format(
time=get_now("%Y-%m-%d %H:%M:%S")
)

@classmethod
def is_stop(cls, message: AssistantMessage) -> bool:
"""Check if the message is a stop message."""
if message.content is None:
return False
return cls.STOP_TOKEN in message.content

def get_init_state(
self, message_history: Optional[list[Message]] = None
) -> LLMAgentState:
"""Get the initial state of the agent.

Args:
message_history: The message history of the conversation.

Returns:
The initial state of the agent.
"""
if message_history is None:
message_history = []
assert all(is_valid_agent_history_message(m) for m in message_history), (
"Message history must contain only AssistantMessage, UserMessage, or ToolMessage to Agent."
)
return LLMAgentState(
system_messages=[SystemMessage(role="system", content=self.system_prompt)],
messages=message_history,
)

def generate_next_message(
self, message: Optional[ValidAgentInputMessage], state: LLMAgentState
) -> tuple[AssistantMessage, LLMAgentState]:
"""
Respond to a user or tool message.
"""
# if isinstance(message, UserMessage):
# raise ValueError("LLMSoloAgent does not support user messages.")
if isinstance(message, MultiToolMessage):
state.messages.extend(message.tool_messages)
elif message is None:
assert len(state.messages) == 0, "Message history should be empty"
else:
state.messages.append(message)
messages = state.system_messages + state.messages
assistant_message = generate(
model=self.llm,
tools=self.tools,
messages=messages,
tool_choice="auto",
enable_think=self.enable_think,
**self.llm_args,
)
# if not assistant_message.is_tool_call() and not self.is_stop(assistant_message):
# raise ValueError("LLMSoloAgent only supports tool calls before ###STOP###.")
state.messages.append(assistant_message)
return assistant_message, state

def set_seed(self, seed: int):
"""Set the seed for the LLM."""
if self.llm is None:
raise ValueError("LLM is not set")
cur_seed = self.llm_args.get("seed", None)
if cur_seed is not None:
logger.warning(f"Seed is already set to {cur_seed}, resetting it to {seed}")
self.llm_args["seed"] = seed
Loading