Skip to content

Qwen-Applications/EVPV-PRM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EVPV-PRM: Explicit Visual Premise Verification for Multimodal Process Reward Models

Github arXiv Github License

Qwen Large Model Application Team, Alibaba

Teaser Figure

Figure 1. Teaser Figure: EVPV — Explicit Visual Premise Verification for Reliable Multimodal Process Reward Modeling.

EVPV-PRM framework

Figure 2. Overview of the EVPV-PRM framework. Structured image description, visual-dependency checklist, and reliability gate modulate step-level rewards.

⚙️ 1. Setup and Installation

From the repository root:

pip install -r requirements.txt

Models required (download separately):

Role Recommended model
EVPV-PRM verifier Qwen2.5-VL-7B-Instruct
Policy model InternVL2.5-8B / 26B / 38B
Teacher model Qwen3-VL-235b-a22b-Instruct

📥 2. Data and Model Preparation

Data layout:

  • VisualProcessBench: data/visualprocessbench/visualprocessbench.jsonl and data/visualprocessbench/images/
  • General benchmarks (MathVista, etc.): data/benchmarks/benchmarks.jsonl, data/benchmarks/mmmu.jsonl and data/benchmarks/images/

Place image files in the corresponding images/ directories. See Data Format below for JSONL schemas.


🔄 3. Training Pipeline (ms-swift)

We use ms-swift for SFT and DPO training. The pipeline below prepares data from VisualPRM400K and runs vision SFT, optional DPO, step-discrimination SFT, and LoRA merge.

Config: Copy scripts/config.example.env to config.env, set your paths and (if using API-based data construction) LLM_API_URL and LLM_API_TOKEN. Source it before running:

cp scripts/config.example.env config.env
# Edit config.env, then:
set -a && source config.env && set +a

Step 1 — Download training data

pip install modelscope
modelscope download --dataset OpenGVLab/VisualPRM400K

Set VISUAL_PRM_ROOT to the directory where the dataset is extracted (contains images/ and annotations/).

Step 2 — Extract preference pairs

From each subset (e.g. GeomVerse), extract (+, −) pairs and write swiftdata.jsonl and images under OUTPUT_BASE_DIR/<subset>/:

export VISUAL_PRM_ROOT=/path/to/VisualPRM400K
export IMAGE_SUBSET=GeomVerse
export OUTPUT_BASE_DIR=/path/to/evpv_data
python scripts/data/extract_preference_pairs.py

Repeat for other subsets by changing IMAGE_SUBSET, or extend the script to loop over subsets.

Step 3 — Build vision-understanding SFT data (optional, API)

If you have an LLM API for image description, generate sft_vision_data.jsonl:

export LLM_API_URL=https://your-llm-api/v1/api/chat
export LLM_API_TOKEN=your_token
export OUTPUT_BASE_DIR=/path/to/evpv_data
export DATA_SUB_DIRS=Geo170K,GeometryData,TabMWP,UniGeo,GeomVerse,GEOS,MAVIS-Geometry
python scripts/data/build_vision_sft_data.py

Output: $OUTPUT_BASE_DIR/sft_vision_data.jsonl.

Step 4 — Merge image descriptions into per-subset JSONL

Attach descriptions from Step 3 to each subset’s swiftdata so step-judge scripts can read them:

python scripts/data/merge_vision_descriptions.py

This creates swiftdata_image_describe.jsonl in each subset dir under OUTPUT_BASE_DIR.

Step 5 — Build step-discrimination SFT data (optional, API)

Run the step-judge data builder (reads swiftdata_image_describe.jsonl from each dir, calls LLM API):

export LLM_API_URL=... LLM_API_TOKEN=...
export OUTPUT_BASE_DIR=/path/to/evpv_data
python scripts/data/build_step_judge_sft_data.py

Output: $OUTPUT_BASE_DIR/sft_processed_data_multithread.jsonl (and a progress file for resume).

Step 6 — Vision SFT training (ms-swift)

export OUTPUT_BASE_DIR=/path/to/evpv_data
export WORKSPACE_DIR=/path/to/workspace
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash scripts/train/01_sft_vision.sh

Checkpoints are saved under $WORKSPACE_DIR/Qwen2.5VL-VisionOutput (or VISION_SFT_OUTPUT_DIR).

Step 7 — Vision DPO training (optional)

Prepare dpo_mm_data.jsonl (preference pairs in ms-swift DPO format), then:

export DPO_DATASET=/path/to/evpv_data/dpo_mm_data.jsonl
export WORKSPACE_DIR=/path/to/workspace
bash scripts/train/02_dpo_vision.sh

Step 8 — Merge LoRA into base model

After vision SFT, merge a checkpoint into the base model so you can run step-judge SFT (Step 9) or deploy:

export WORKSPACE_DIR=/path/to/workspace
export BASE_MODEL=Qwen/Qwen2.5-VL-7B-Instruct
export TRAIN_OUTPUT_DIR=$WORKSPACE_DIR/Qwen2.5VL-VisionOutput
export CKPT_NAME=checkpoint-564
bash scripts/train/04_merge_lora.sh

Merged model is written to $WORKSPACE_DIR/Qwen2.5VL-VisionOutput-merged (or MERGED_MODEL_DIR).

Step 9 — Step-discrimination SFT training (ms-swift)

Use the merged vision model from Step 8 as base:

export MERGED_VISION_MODEL=/path/to/workspace/Qwen2.5VL-VisionOutput-merged
export STEP_JUDGE_DATASET=/path/to/evpv_data/sft_processed_data_multithread.jsonl
bash scripts/train/03_sft_step_judge.sh
Script Purpose
scripts/data/extract_preference_pairs.py Extract (+, −) pairs from VisualPRM400K → swiftdata.jsonl
scripts/data/build_vision_sft_data.py Build vision SFT JSONL via optional LLM API
scripts/data/merge_vision_descriptions.py Merge descriptions into swiftdata_image_describe.jsonl
scripts/data/build_step_judge_sft_data.py Build step-judge SFT JSONL via optional LLM API
scripts/train/01_sft_vision.sh Vision SFT (ms-swift)
scripts/train/02_dpo_vision.sh Vision DPO (ms-swift)
scripts/train/03_sft_step_judge.sh Step-judge SFT (ms-swift, from merged vision model)
scripts/train/04_merge_lora.sh Merge LoRA checkpoint to full model

🚀 4. Inference and Evaluation (EVPV-PRM)

All scripts are run as Python modules from the repository root so that relative imports (from .prompts import …) resolve correctly.

Task Script Description
VisualProcessBench (local) python -m evpv_prm.step_verifier_local Step verification with local vLLM
VisualProcessBench (API) python -m evpv_prm.step_verifier_api Step verification via remote API (set API_URL / API_HEADERS in script)
Best-of-N (general) policy_inferenceevpv_prm_inferencecompute_bon_metrics Generate 8 candidates, score with EVPV-PRM, compute Pass@k / BoN@k
Best-of-N (MMMU) policy_inference_mmmuevpv_prm_inference_mmmucompute_bon_metrics Same pipeline for MMMU
Perception intervention perception_intervention_inferenceevpv_prm_perception_intervention_eval Causal study of visual-evidence quality
Constraint corruption constraint_corruption_ablation DROP/FLIP constraint noise ablation
VPBench ablation vpbench_ablation_runner 27-config ablation on VisualProcessBench

Example — Best-of-N reranking (general benchmarks):

# Step 1: generate 8 candidates per question
python -m evpv_prm.policy_inference

# Step 2: score candidates with EVPV-PRM
python -m evpv_prm.evpv_prm_inference

# Step 3: compute Pass@k and BoN@k
python -m evpv_prm.compute_bon_metrics

Example — VisualProcessBench (local vLLM):

python -m evpv_prm.step_verifier_local

📁 Repository Structure

EVPV-PRM/
├── README.md
├── requirements.txt
├── scripts/
│   ├── config.example.env                  # Example env config (copy to config.env)
│   ├── data/
│   │   ├── extract_preference_pairs.py     # Extract (+, −) pairs from VisualPRM400K
│   │   ├── build_vision_sft_data.py        # Build vision SFT data (optional API)
│   │   ├── merge_vision_descriptions.py   # Merge descriptions → swiftdata_image_describe
│   │   └── build_step_judge_sft_data.py    # Build step-judge SFT data (optional API)
│   └── train/
│       ├── 01_sft_vision.sh                 # Vision SFT (ms-swift)
│       ├── 02_dpo_vision.sh                # Vision DPO (ms-swift)
│       ├── 03_sft_step_judge.sh            # Step-judge SFT (ms-swift)
│       └── 04_merge_lora.sh                # Merge LoRA into base model
├── evpv_prm/
│   ├── __init__.py
│   ├── prompts.py                          # All prompt templates (single source of truth)
│   ├── utils.py                            # Shared JSON parsing and IO helpers
│   ├── step_verifier_local.py              # Step verification — local vLLM
│   ├── step_verifier_api.py                # Step verification — remote API
│   ├── policy_inference.py                 # Policy: generate 8 candidates (general benchmarks)
│   ├── policy_inference_mmmu.py            # Policy: generate 8 candidates (MMMU)
│   ├── evpv_prm_inference.py               # EVPV-PRM scoring pipeline (general benchmarks)
│   ├── evpv_prm_inference_mmmu.py          # EVPV-PRM scoring pipeline (MMMU)
│   ├── compute_bon_metrics.py              # Compute Pass@k / BoN@k metrics
│   ├── perception_intervention_inference.py
│   ├── evpv_prm_perception_intervention_eval.py
│   ├── constraint_corruption_ablation.py   # Constraint noise vs. accuracy
│   └── vpbench_ablation_runner.py          # Ablation suite on VisualProcessBench
└── data/
    ├── visualprocessbench/
    │   ├── visualprocessbench.jsonl
    │   └── images/
    └── benchmarks/
        ├── benchmarks.jsonl
        ├── mmmu.jsonl
        └── images/

Core modules: prompts.py (templates), evpv_prm_inference.py / evpv_prm_inference_mmmu.py (three-stage EVPV-PRM scoring), compute_bon_metrics.py (Pass@k / BoN@k).

Module Purpose
prompts.py Single source of truth for all prompt templates and builder functions
utils.py Robust JSON parsing, image path resolution, thread-safe JSONL IO
step_verifier_*.py Step-level verification on VisualProcessBench
policy_inference*.py Generate diverse candidate solutions with InternVL
evpv_prm_inference*.py Three-stage EVPV-PRM scoring with checkpoint support
compute_bon_metrics.py Compute Pass@k / BoN@k from scored outputs
perception_intervention_*.py Causal study of visual-evidence quality
constraint_corruption_ablation.py DROP/FLIP constraint noise experiments
vpbench_ablation_runner.py 27-configuration ablation on VisualProcessBench

🙏 Acknowledgements

We build on and thank the open-source communities behind InternVL, vLLM, and the benchmark datasets (MathVista, MathVision, MathVerse, VisualProcessBench, MMMU, etc.).

📜 Citation

If you find our work useful, please consider citing:

@misc{wang2026groundingscoreexplicitvisual,
      title={Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models}, 
      author={Junxin Wang and Dai Guan and Weijie Qiu and Zhihang Li and Yongbo Gai and Zhengyi Yang and Mengyu Zhou and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2603.16253},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16253}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors