EVPV-PRM: Explicit Visual Premise Verification for Multimodal Process Reward Models

Qwen Large Model Application Team, Alibaba

Figure 1. Teaser Figure: EVPV — Explicit Visual Premise Verification for Reliable Multimodal Process Reward Modeling.

Figure 2. Overview of the EVPV-PRM framework. Structured image description, visual-dependency checklist, and reliability gate modulate step-level rewards.

⚙️ 1. Setup and Installation

From the repository root:

pip install -r requirements.txt

Models required (download separately):

Role	Recommended model
EVPV-PRM verifier	Qwen2.5-VL-7B-Instruct
Policy model	InternVL2.5-8B / 26B / 38B
Teacher model	Qwen3-VL-235b-a22b-Instruct

📥 2. Data and Model Preparation

Data layout:

VisualProcessBench: data/visualprocessbench/visualprocessbench.jsonl and data/visualprocessbench/images/
General benchmarks (MathVista, etc.): data/benchmarks/benchmarks.jsonl, data/benchmarks/mmmu.jsonl and data/benchmarks/images/

Place image files in the corresponding images/ directories. See Data Format below for JSONL schemas.

🔄 3. Training Pipeline (ms-swift)

We use ms-swift for SFT and DPO training. The pipeline below prepares data from VisualPRM400K and runs vision SFT, optional DPO, step-discrimination SFT, and LoRA merge.

Config: Copy scripts/config.example.env to config.env, set your paths and (if using API-based data construction) LLM_API_URL and LLM_API_TOKEN. Source it before running:

cp scripts/config.example.env config.env
# Edit config.env, then:
set -a && source config.env && set +a

Step 1 — Download training data

pip install modelscope
modelscope download --dataset OpenGVLab/VisualPRM400K

Set VISUAL_PRM_ROOT to the directory where the dataset is extracted (contains images/ and annotations/).

Step 2 — Extract preference pairs

From each subset (e.g. GeomVerse), extract (+, −) pairs and write swiftdata.jsonl and images under OUTPUT_BASE_DIR/<subset>/:

export VISUAL_PRM_ROOT=/path/to/VisualPRM400K
export IMAGE_SUBSET=GeomVerse
export OUTPUT_BASE_DIR=/path/to/evpv_data
python scripts/data/extract_preference_pairs.py

Repeat for other subsets by changing IMAGE_SUBSET, or extend the script to loop over subsets.

Step 3 — Build vision-understanding SFT data (optional, API)

If you have an LLM API for image description, generate sft_vision_data.jsonl:

export LLM_API_URL=https://your-llm-api/v1/api/chat
export LLM_API_TOKEN=your_token
export OUTPUT_BASE_DIR=/path/to/evpv_data
export DATA_SUB_DIRS=Geo170K,GeometryData,TabMWP,UniGeo,GeomVerse,GEOS,MAVIS-Geometry
python scripts/data/build_vision_sft_data.py

Output: $OUTPUT_BASE_DIR/sft_vision_data.jsonl.

Step 4 — Merge image descriptions into per-subset JSONL

Attach descriptions from Step 3 to each subset’s swiftdata so step-judge scripts can read them:

python scripts/data/merge_vision_descriptions.py

This creates swiftdata_image_describe.jsonl in each subset dir under OUTPUT_BASE_DIR.

Step 5 — Build step-discrimination SFT data (optional, API)

Run the step-judge data builder (reads swiftdata_image_describe.jsonl from each dir, calls LLM API):

export LLM_API_URL=... LLM_API_TOKEN=...
export OUTPUT_BASE_DIR=/path/to/evpv_data
python scripts/data/build_step_judge_sft_data.py

Output: $OUTPUT_BASE_DIR/sft_processed_data_multithread.jsonl (and a progress file for resume).

Step 6 — Vision SFT training (ms-swift)

export OUTPUT_BASE_DIR=/path/to/evpv_data
export WORKSPACE_DIR=/path/to/workspace
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash scripts/train/01_sft_vision.sh

Checkpoints are saved under $WORKSPACE_DIR/Qwen2.5VL-VisionOutput (or VISION_SFT_OUTPUT_DIR).

Step 7 — Vision DPO training (optional)

Prepare dpo_mm_data.jsonl (preference pairs in ms-swift DPO format), then:

export DPO_DATASET=/path/to/evpv_data/dpo_mm_data.jsonl
export WORKSPACE_DIR=/path/to/workspace
bash scripts/train/02_dpo_vision.sh

Step 8 — Merge LoRA into base model

After vision SFT, merge a checkpoint into the base model so you can run step-judge SFT (Step 9) or deploy:

export WORKSPACE_DIR=/path/to/workspace
export BASE_MODEL=Qwen/Qwen2.5-VL-7B-Instruct
export TRAIN_OUTPUT_DIR=$WORKSPACE_DIR/Qwen2.5VL-VisionOutput
export CKPT_NAME=checkpoint-564
bash scripts/train/04_merge_lora.sh

Merged model is written to $WORKSPACE_DIR/Qwen2.5VL-VisionOutput-merged (or MERGED_MODEL_DIR).

Step 9 — Step-discrimination SFT training (ms-swift)

Use the merged vision model from Step 8 as base:

export MERGED_VISION_MODEL=/path/to/workspace/Qwen2.5VL-VisionOutput-merged
export STEP_JUDGE_DATASET=/path/to/evpv_data/sft_processed_data_multithread.jsonl
bash scripts/train/03_sft_step_judge.sh

Script	Purpose
`scripts/data/extract_preference_pairs.py`	Extract (+, −) pairs from VisualPRM400K → swiftdata.jsonl
`scripts/data/build_vision_sft_data.py`	Build vision SFT JSONL via optional LLM API
`scripts/data/merge_vision_descriptions.py`	Merge descriptions into swiftdata_image_describe.jsonl
`scripts/data/build_step_judge_sft_data.py`	Build step-judge SFT JSONL via optional LLM API
`scripts/train/01_sft_vision.sh`	Vision SFT (ms-swift)
`scripts/train/02_dpo_vision.sh`	Vision DPO (ms-swift)
`scripts/train/03_sft_step_judge.sh`	Step-judge SFT (ms-swift, from merged vision model)
`scripts/train/04_merge_lora.sh`	Merge LoRA checkpoint to full model

🚀 4. Inference and Evaluation (EVPV-PRM)

All scripts are run as Python modules from the repository root so that relative imports (from .prompts import …) resolve correctly.

Task	Script	Description
VisualProcessBench (local)	`python -m evpv_prm.step_verifier_local`	Step verification with local vLLM
VisualProcessBench (API)	`python -m evpv_prm.step_verifier_api`	Step verification via remote API (set `API_URL` / `API_HEADERS` in script)
Best-of-N (general)	`policy_inference` → `evpv_prm_inference` → `compute_bon_metrics`	Generate 8 candidates, score with EVPV-PRM, compute Pass@k / BoN@k
Best-of-N (MMMU)	`policy_inference_mmmu` → `evpv_prm_inference_mmmu` → `compute_bon_metrics`	Same pipeline for MMMU
Perception intervention	`perception_intervention_inference` → `evpv_prm_perception_intervention_eval`	Causal study of visual-evidence quality
Constraint corruption	`constraint_corruption_ablation`	DROP/FLIP constraint noise ablation
VPBench ablation	`vpbench_ablation_runner`	27-config ablation on VisualProcessBench

Example — Best-of-N reranking (general benchmarks):

# Step 1: generate 8 candidates per question
python -m evpv_prm.policy_inference

# Step 2: score candidates with EVPV-PRM
python -m evpv_prm.evpv_prm_inference

# Step 3: compute Pass@k and BoN@k
python -m evpv_prm.compute_bon_metrics

Example — VisualProcessBench (local vLLM):

python -m evpv_prm.step_verifier_local

📁 Repository Structure

EVPV-PRM/
├── README.md
├── requirements.txt
├── scripts/
│   ├── config.example.env                  # Example env config (copy to config.env)
│   ├── data/
│   │   ├── extract_preference_pairs.py     # Extract (+, −) pairs from VisualPRM400K
│   │   ├── build_vision_sft_data.py        # Build vision SFT data (optional API)
│   │   ├── merge_vision_descriptions.py   # Merge descriptions → swiftdata_image_describe
│   │   └── build_step_judge_sft_data.py    # Build step-judge SFT data (optional API)
│   └── train/
│       ├── 01_sft_vision.sh                 # Vision SFT (ms-swift)
│       ├── 02_dpo_vision.sh                # Vision DPO (ms-swift)
│       ├── 03_sft_step_judge.sh            # Step-judge SFT (ms-swift)
│       └── 04_merge_lora.sh                # Merge LoRA into base model
├── evpv_prm/
│   ├── __init__.py
│   ├── prompts.py                          # All prompt templates (single source of truth)
│   ├── utils.py                            # Shared JSON parsing and IO helpers
│   ├── step_verifier_local.py              # Step verification — local vLLM
│   ├── step_verifier_api.py                # Step verification — remote API
│   ├── policy_inference.py                 # Policy: generate 8 candidates (general benchmarks)
│   ├── policy_inference_mmmu.py            # Policy: generate 8 candidates (MMMU)
│   ├── evpv_prm_inference.py               # EVPV-PRM scoring pipeline (general benchmarks)
│   ├── evpv_prm_inference_mmmu.py          # EVPV-PRM scoring pipeline (MMMU)
│   ├── compute_bon_metrics.py              # Compute Pass@k / BoN@k metrics
│   ├── perception_intervention_inference.py
│   ├── evpv_prm_perception_intervention_eval.py
│   ├── constraint_corruption_ablation.py   # Constraint noise vs. accuracy
│   └── vpbench_ablation_runner.py          # Ablation suite on VisualProcessBench
└── data/
    ├── visualprocessbench/
    │   ├── visualprocessbench.jsonl
    │   └── images/
    └── benchmarks/
        ├── benchmarks.jsonl
        ├── mmmu.jsonl
        └── images/

Core modules: prompts.py (templates), evpv_prm_inference.py / evpv_prm_inference_mmmu.py (three-stage EVPV-PRM scoring), compute_bon_metrics.py (Pass@k / BoN@k).

Module	Purpose
`prompts.py`	Single source of truth for all prompt templates and builder functions
`utils.py`	Robust JSON parsing, image path resolution, thread-safe JSONL IO
`step_verifier_*.py`	Step-level verification on VisualProcessBench
`policy_inference*.py`	Generate diverse candidate solutions with InternVL
`evpv_prm_inference*.py`	Three-stage EVPV-PRM scoring with checkpoint support
`compute_bon_metrics.py`	Compute Pass@k / BoN@k from scored outputs
`perception_intervention_*.py`	Causal study of visual-evidence quality
`constraint_corruption_ablation.py`	DROP/FLIP constraint noise experiments
`vpbench_ablation_runner.py`	27-configuration ablation on VisualProcessBench

🙏 Acknowledgements

We build on and thank the open-source communities behind InternVL, vLLM, and the benchmark datasets (MathVista, MathVision, MathVerse, VisualProcessBench, MMMU, etc.).

📜 Citation

If you find our work useful, please consider citing:

@misc{wang2026groundingscoreexplicitvisual,
      title={Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models}, 
      author={Junxin Wang and Dai Guan and Weijie Qiu and Zhihang Li and Yongbo Gai and Zhengyi Yang and Mengyu Zhou and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2603.16253},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16253}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EVPV-PRM: Explicit Visual Premise Verification for Multimodal Process Reward Models

⚙️ 1. Setup and Installation

📥 2. Data and Model Preparation

🔄 3. Training Pipeline (ms-swift)

Step 1 — Download training data

Step 2 — Extract preference pairs

Step 3 — Build vision-understanding SFT data (optional, API)

Step 4 — Merge image descriptions into per-subset JSONL

Step 5 — Build step-discrimination SFT data (optional, API)

Step 6 — Vision SFT training (ms-swift)

Step 7 — Vision DPO training (optional)

Step 8 — Merge LoRA into base model

Step 9 — Step-discrimination SFT training (ms-swift)

🚀 4. Inference and Evaluation (EVPV-PRM)

📁 Repository Structure

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
evpv_prm		evpv_prm
figure		figure
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EVPV-PRM: Explicit Visual Premise Verification for Multimodal Process Reward Models

⚙️ 1. Setup and Installation

📥 2. Data and Model Preparation

🔄 3. Training Pipeline (ms-swift)

Step 1 — Download training data

Step 2 — Extract preference pairs

Step 3 — Build vision-understanding SFT data (optional, API)

Step 4 — Merge image descriptions into per-subset JSONL

Step 5 — Build step-discrimination SFT data (optional, API)

Step 6 — Vision SFT training (ms-swift)

Step 7 — Vision DPO training (optional)

Step 8 — Merge LoRA into base model

Step 9 — Step-discrimination SFT training (ms-swift)

🚀 4. Inference and Evaluation (EVPV-PRM)

📁 Repository Structure

🙏 Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages