Official repository for the paper: OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards.
- Updates
- Overview
- Framework
- Results on OGRBench
- Environment Setup
- Configuration
- Running Evaluation
- Trajectory Format
- Citation
- [2026-03-20] Initial release of the paper and repository.
OS-Themis is a multi-agent critic framework for evaluating whether a GUI trajectory truly completes a task. Instead of relying on a single holistic judgment, OS-Themis decomposes the trajectory into verifiable milestones, checks the corresponding screenshots, audits potential risks, and then aggregates the evidence into a final decision.
This repository contains the evaluation pipeline used to score GUI trajectories with four specialized roles:
- Selector: chooses the key steps that determine task completion.
- Verifier: checks selected steps against screenshots and history.
- Reviewer: audits risk points such as missing save/submit actions or contradictory evidence.
- Judge: produces the final task-level completion verdict.
The framework is designed for OpenAI-compatible multimodal endpoints and supports both single-trajectory evaluation and large-scale benchmark evaluation.
At a high level, the pipeline works as follows:
- Parse
trajectory.jsonand recover step-wise reasoning, actions, and screenshots. - Ask the Selector to identify the most decisive steps.
- Ask the Verifier to inspect those steps with visual evidence.
- Ask the Reviewer to surface missing evidence and risky actions.
- Ask the Judge to issue the final
completed/not_completeddecision.
OGRBench is available at Hugging Face: lizh1/OmniGUIRewardBench.
| Model | Ubuntu Acc | Mobile Acc | Windows Acc | macOS Acc | Web Acc | Overall Acc | Overall Prec | Overall Rec |
|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B | 77.2 | 85.6 | 72.8 | 85.7 | 86.3 | 79.3 | 86.3 | 69.4 |
| Qwen3-VL-30B-A3B | 79.5 | 84.6 | 76.5 | 88.3 | 80.5 | 80.3 | 84.7 | 73.7 |
| Qwen3-VL-32B | 77.6 | 83.5 | 75.1 | 88.3 | 84.7 | 79.6 | 92.2 | 64.3 |
| Qwen3-VL-235B | 88.1 | 92.3 | 77.5 | 94.8 | 92.1 | 88.0 | 92.8 | 82.3 |
| Qwen3-VL-235B-Thinking | 83.4 | 89.4 | 85.5 | 93.5 | 84.7 | 85.2 | 89.3 | 79.9 |
| GPT-5-mini | 68.8 | 65.4 | 76.5 | 87.0 | 75.8 | 71.5 | 95.4 | 44.7 |
| GPT-5 | 82.5 | 80.3 | 84.5 | 88.3 | 95.9 | 82.9 | 93.4 | 70.6 |
| Gemini-3-Flash | 85.0 | 91.0 | 86.9 | 93.5 | 82.6 | 86.2 | 93.2 | 78.0 |
| Mean | 80.3 | 84.0 | 79.4 | 89.9 | 85.3 | 81.6 | 90.9 | 70.4 |
Python 3.10 or newer is required.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtAll model and framework settings are read from config.yaml.
model:
base_url: "http://your-openai-compatible-endpoint/v1"
api_key: "your-api-key"
model_name: "qwen3-vl-235b"
temperature: 0
top_p: 0.1
max_tokens: 8192
framework:
selector_max_rounds: 6
selector_initial_retries: 2
selector_follow_retries: 2
fallback_key_steps_count: 3
reviewer_max_rounds: 2Notes:
base_urlshould point to an OpenAI-compatible/chat/completionsservice.model_nameis the only model identifier used for Selector, Verifier, Reviewer, and Judge.--output-diris optional. If omitted, outputs default tobenchmark_result/<model_name>/for batch evaluation andmulti_agent_eval_outputs/<model_name>/<task_name>/for single-trajectory evaluation.- The same backend is shared by Selector, Verifier, Reviewer, and Judge.
Use eval_traj.py to evaluate one trajectory directory:
python eval_traj.py \
--traj-dir /path/to/trajectory_dir \
--config config.yamlIf you want a custom output directory, you can pass:
python eval_traj.py \
--traj-dir /path/to/trajectory_dir \
--output-dir /path/to/custom_output_dir \
--config config.yamlUse evaluation.py for batch evaluation:
python evaluation.py \
--output-dir os_themis \
--benchmark-file /path/to/OmniGUIRewardBench.json \
--result-root benchmark_result \
--config config.yamlEach trajectory directory should contain at least:
trajectory_dir/
├── trajectory.json
├── meta.json # optional
├── screenshot_1.png # optional if screenshot_file is recorded in trajectory.json
├── screenshot_2.png
└── ...
Supported conventions:
trajectory.jsoncan be either a list of steps or a dictionary with atrajectoryfield.- Each step should provide
steporstep_num. - The model response can come from
response,prediction, orraw_response. - Screenshots are resolved from:
screenshot_fileentries insidetrajectory.json- files named like
screenshot_*.png - image files under an
images/subdirectory
meta.jsonis optional but useful for fields such asinstruction,task_id, andapplication.
If you find this repository useful, please cite:
@misc{li2026osthemisscalablecriticframework,
title={OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards},
author={Zehao Li and Zhenyu Wu and Yibo Zhao and Bowen Yang and Jingjing Xie and Zhaoyang Liu and Zhoumianze Liu and Kaiming Jin and Jianze Liang and Zonglin Li and Feng Wu and Bowen Zhou and Zun Wang and Zichen Ding},
year={2026},
eprint={2603.19191},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.19191},
}