Skip to content

OS-Copilot/OS-Themis

Repository files navigation

OS-Themis

A Scalable Critic Framework for Generalist GUI Rewards

Official repository for the paper: OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards.

arXiv Hugging Face Dataset Python License

Table of Contents

Updates

  • [2026-03-20] Initial release of the paper and repository.

Overview

OS-Themis is a multi-agent critic framework for evaluating whether a GUI trajectory truly completes a task. Instead of relying on a single holistic judgment, OS-Themis decomposes the trajectory into verifiable milestones, checks the corresponding screenshots, audits potential risks, and then aggregates the evidence into a final decision.

This repository contains the evaluation pipeline used to score GUI trajectories with four specialized roles:

  • Selector: chooses the key steps that determine task completion.
  • Verifier: checks selected steps against screenshots and history.
  • Reviewer: audits risk points such as missing save/submit actions or contradictory evidence.
  • Judge: produces the final task-level completion verdict.

The framework is designed for OpenAI-compatible multimodal endpoints and supports both single-trajectory evaluation and large-scale benchmark evaluation.

Framework

OS-Themis framework

At a high level, the pipeline works as follows:

  1. Parse trajectory.json and recover step-wise reasoning, actions, and screenshots.
  2. Ask the Selector to identify the most decisive steps.
  3. Ask the Verifier to inspect those steps with visual evidence.
  4. Ask the Reviewer to surface missing evidence and risky actions.
  5. Ask the Judge to issue the final completed / not_completed decision.

Results on OmniGUIRewardBench

OGRBench is available at Hugging Face: lizh1/OmniGUIRewardBench.

Model Ubuntu Acc Mobile Acc Windows Acc macOS Acc Web Acc Overall Acc Overall Prec Overall Rec
Qwen3-VL-8B 77.2 85.6 72.8 85.7 86.3 79.3 86.3 69.4
Qwen3-VL-30B-A3B 79.5 84.6 76.5 88.3 80.5 80.3 84.7 73.7
Qwen3-VL-32B 77.6 83.5 75.1 88.3 84.7 79.6 92.2 64.3
Qwen3-VL-235B 88.1 92.3 77.5 94.8 92.1 88.0 92.8 82.3
Qwen3-VL-235B-Thinking 83.4 89.4 85.5 93.5 84.7 85.2 89.3 79.9
GPT-5-mini 68.8 65.4 76.5 87.0 75.8 71.5 95.4 44.7
GPT-5 82.5 80.3 84.5 88.3 95.9 82.9 93.4 70.6
Gemini-3-Flash 85.0 91.0 86.9 93.5 82.6 86.2 93.2 78.0
Mean 80.3 84.0 79.4 89.9 85.3 81.6 90.9 70.4

Environment Setup

Installation

Python 3.10 or newer is required.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

All model and framework settings are read from config.yaml.

model:
  base_url: "http://your-openai-compatible-endpoint/v1"
  api_key: "your-api-key"
  model_name: "qwen3-vl-235b"
  temperature: 0
  top_p: 0.1
  max_tokens: 8192

framework:
  selector_max_rounds: 6
  selector_initial_retries: 2
  selector_follow_retries: 2
  fallback_key_steps_count: 3
  reviewer_max_rounds: 2

Notes:

  • base_url should point to an OpenAI-compatible /chat/completions service.
  • model_name is the only model identifier used for Selector, Verifier, Reviewer, and Judge.
  • --output-dir is optional. If omitted, outputs default to benchmark_result/<model_name>/ for batch evaluation and multi_agent_eval_outputs/<model_name>/<task_name>/ for single-trajectory evaluation.
  • The same backend is shared by Selector, Verifier, Reviewer, and Judge.

Running Evaluation

1. Single Trajectory

Use eval_traj.py to evaluate one trajectory directory:

python eval_traj.py \
  --traj-dir /path/to/trajectory_dir \
  --config config.yaml

If you want a custom output directory, you can pass:

python eval_traj.py \
  --traj-dir /path/to/trajectory_dir \
  --output-dir /path/to/custom_output_dir \
  --config config.yaml

2. Full Benchmark or Custom Benchmark File

Use evaluation.py for batch evaluation:

python evaluation.py \
  --output-dir os_themis \
  --benchmark-file /path/to/OmniGUIRewardBench.json \
  --result-root benchmark_result \
  --config config.yaml

Trajectory Format

Each trajectory directory should contain at least:

trajectory_dir/
├── trajectory.json
├── meta.json                # optional
├── screenshot_1.png         # optional if screenshot_file is recorded in trajectory.json
├── screenshot_2.png
└── ...

Supported conventions:

  • trajectory.json can be either a list of steps or a dictionary with a trajectory field.
  • Each step should provide step or step_num.
  • The model response can come from response, prediction, or raw_response.
  • Screenshots are resolved from:
    • screenshot_file entries inside trajectory.json
    • files named like screenshot_*.png
    • image files under an images/ subdirectory
  • meta.json is optional but useful for fields such as instruction, task_id, and application.

Citation

If you find this repository useful, please cite:

@misc{li2026osthemisscalablecriticframework,
      title={OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards},
      author={Zehao Li and Zhenyu Wu and Yibo Zhao and Bowen Yang and Jingjing Xie and Zhaoyang Liu and Zhoumianze Liu and Kaiming Jin and Jianze Liang and Zonglin Li and Feng Wu and Bowen Zhou and Zun Wang and Zichen Ding},
      year={2026},
      eprint={2603.19191},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.19191},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages