OS-Themis

A Scalable Critic Framework for Generalist GUI Rewards

Official repository for the paper: OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards.

Updates

[2026-03-20] Initial release of the paper and repository.

Overview

OS-Themis is a multi-agent critic framework for evaluating whether a GUI trajectory truly completes a task. Instead of relying on a single holistic judgment, OS-Themis decomposes the trajectory into verifiable milestones, checks the corresponding screenshots, audits potential risks, and then aggregates the evidence into a final decision.

This repository contains the evaluation pipeline used to score GUI trajectories with four specialized roles:

Selector: chooses the key steps that determine task completion.
Verifier: checks selected steps against screenshots and history.
Reviewer: audits risk points such as missing save/submit actions or contradictory evidence.
Judge: produces the final task-level completion verdict.

The framework is designed for OpenAI-compatible multimodal endpoints and supports both single-trajectory evaluation and large-scale benchmark evaluation.

Framework

At a high level, the pipeline works as follows:

Parse trajectory.json and recover step-wise reasoning, actions, and screenshots.
Ask the Selector to identify the most decisive steps.
Ask the Verifier to inspect those steps with visual evidence.
Ask the Reviewer to surface missing evidence and risky actions.
Ask the Judge to issue the final completed / not_completed decision.

Results on OmniGUIRewardBench

OGRBench is available at Hugging Face: lizh1/OmniGUIRewardBench.

Model	Ubuntu Acc	Mobile Acc	Windows Acc	macOS Acc	Web Acc	Overall Acc	Overall Prec	Overall Rec
Qwen3-VL-8B	77.2	85.6	72.8	85.7	86.3	79.3	86.3	69.4
Qwen3-VL-30B-A3B	79.5	84.6	76.5	88.3	80.5	80.3	84.7	73.7
Qwen3-VL-32B	77.6	83.5	75.1	88.3	84.7	79.6	92.2	64.3
Qwen3-VL-235B	88.1	92.3	77.5	94.8	92.1	88.0	92.8	82.3
Qwen3-VL-235B-Thinking	83.4	89.4	85.5	93.5	84.7	85.2	89.3	79.9
GPT-5-mini	68.8	65.4	76.5	87.0	75.8	71.5	95.4	44.7
GPT-5	82.5	80.3	84.5	88.3	95.9	82.9	93.4	70.6
Gemini-3-Flash	85.0	91.0	86.9	93.5	82.6	86.2	93.2	78.0
Mean	80.3	84.0	79.4	89.9	85.3	81.6	90.9	70.4

Environment Setup

Installation

Python 3.10 or newer is required.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

All model and framework settings are read from config.yaml.

model:
  base_url: "http://your-openai-compatible-endpoint/v1"
  api_key: "your-api-key"
  model_name: "qwen3-vl-235b"
  temperature: 0
  top_p: 0.1
  max_tokens: 8192

framework:
  selector_max_rounds: 6
  selector_initial_retries: 2
  selector_follow_retries: 2
  fallback_key_steps_count: 3
  reviewer_max_rounds: 2

Notes:

base_url should point to an OpenAI-compatible /chat/completions service.
model_name is the only model identifier used for Selector, Verifier, Reviewer, and Judge.
--output-dir is optional. If omitted, outputs default to benchmark_result/<model_name>/ for batch evaluation and multi_agent_eval_outputs/<model_name>/<task_name>/ for single-trajectory evaluation.
The same backend is shared by Selector, Verifier, Reviewer, and Judge.

Running Evaluation

1. Single Trajectory

Use eval_traj.py to evaluate one trajectory directory:

python eval_traj.py \
  --traj-dir /path/to/trajectory_dir \
  --config config.yaml

If you want a custom output directory, you can pass:

python eval_traj.py \
  --traj-dir /path/to/trajectory_dir \
  --output-dir /path/to/custom_output_dir \
  --config config.yaml

2. Full Benchmark or Custom Benchmark File

Use evaluation.py for batch evaluation:

python evaluation.py \
  --output-dir os_themis \
  --benchmark-file /path/to/OmniGUIRewardBench.json \
  --result-root benchmark_result \
  --config config.yaml

Trajectory Format

Each trajectory directory should contain at least:

trajectory_dir/
├── trajectory.json
├── meta.json                # optional
├── screenshot_1.png         # optional if screenshot_file is recorded in trajectory.json
├── screenshot_2.png
└── ...

Supported conventions:

trajectory.json can be either a list of steps or a dictionary with a trajectory field.
Each step should provide step or step_num.
The model response can come from response, prediction, or raw_response.
Screenshots are resolved from:
- screenshot_file entries inside trajectory.json
- files named like screenshot_*.png
- image files under an images/ subdirectory
meta.json is optional but useful for fields such as instruction, task_id, and application.

Citation

If you find this repository useful, please cite:

@misc{li2026osthemisscalablecriticframework,
      title={OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards},
      author={Zehao Li and Zhenyu Wu and Yibo Zhao and Bowen Yang and Jingjing Xie and Zhaoyang Liu and Zhoumianze Liu and Kaiming Jin and Jianze Liang and Zonglin Li and Feng Wu and Bowen Zhou and Zun Wang and Zichen Ding},
      year={2026},
      eprint={2603.19191},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.19191},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PROMPT		PROMPT
agents		agents
assets		assets
trajectory_reward		trajectory_reward
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
eval_traj.py		eval_traj.py
evaluation.py		evaluation.py
requirements.txt		requirements.txt
reward_function.py		reward_function.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OS-Themis

A Scalable Critic Framework for Generalist GUI Rewards

Table of Contents

Updates

Overview

Framework

Results on OmniGUIRewardBench

Environment Setup

Installation

Configuration

Running Evaluation

1. Single Trajectory

2. Full Benchmark or Custom Benchmark File

Trajectory Format

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OS-Themis

A Scalable Critic Framework for Generalist GUI Rewards

Table of Contents

Updates

Overview

Framework

Results on OmniGUIRewardBench

Environment Setup

Installation

Configuration

Running Evaluation

1. Single Trajectory

2. Full Benchmark or Custom Benchmark File

Trajectory Format

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages