Physics-IQ and Physics-IQ Verified: Benchmarking physical understanding in generative video models

Leaderboard | Quick Start | Physics-IQ Verified Workflow | Citation | License

Physics-IQ and Physics-IQ Verified: Benchmarking physical understanding in generative video models

Physics-IQ is a high-quality, realistic, and comprehensive benchmark dataset for evaluating physical understanding in generative video models. Building on this foundation, Physics-IQ Verified contains improvements w.r.t. prompt and metric quality.

This repository contains the workflow for both Physics-IQ Verified (recommended benchmark variant) and the original Physics-IQ benchmark.

Original Physics-IQ website: physics-iq.github.io
Physics-IQ Verified website: physics-iq-verified.anates.ai

Key Features:

Real-world videos: All videos are captured with high-quality cameras, not rendered.
Diverse scenarios: Covers a wide range of physical phenomena, including collisions, fluid dynamics, gravity, material properties, light, shadows, magnetism, and more.
Multiple perspectives: Each scenario is filmed from 3 different angles.
Variations: Each scenario is recorded twice to capture natural physical variations.
High resolution and frame rate: Videos are recorded at 3840 × 2160 resolution and 30 frames per second.

Leaderboard

The best possible score on Physics-IQ is 100.0%, this score would be achieved by physically realistic videos that differ only in physical randomness but adhere to all tested principles of physics.

Physics-IQ Verified Leaderboard

If you test your model on Physics-IQ Verified and would like your score/paper/model to be featured here in this table, feel free to open a pull request that adds a row to the table and we'll be happy to include it!

The leaderboard is also hosted at: physics-iq-verified.anates.ai

#	Model	input type	Physics-IQ verified	date added (YYYY-MM-DD)
1	Magi-1 24B + GeoPhys (BoN) (op)	multiframe (v2v)	58.2 ± 1.8 🥇 v2v	2026-06-19
2	Magi-1 24B (op)	multiframe (v2v)	48.4 ± 1.1 🥈 v2v	2026-06-19
3	Cosmos3-Super-Image2Video	i2v	39.5 ± 0.8 🥇 i2v	2026-06-18
4	Grok Imagine Video	i2v	34.8 ± 0.6 🥈 i2v	2026-06-17
5	Magi-1 24B + GeoPhys (BoN) (op)	i2v	33.7 ± 1.4 🥉 i2v	2026-06-19
6	Hunyuan Video 1.5	i2v	33.4 ± 0.8	2026-06-17
7	Wan 2.2	i2v	32.2 ± 0.6	2026-06-17
8	Cosmos3-Nano	i2v	30.3 ± 0.6	2026-06-18
9	Magi-1 24B (op)	i2v	30.2 ± 1.1	2026-06-19
10	Sora 2	i2v	26.5 ± 0.8	2026-06-17
11	P-Video	i2v	25.3 ± 1.8	2026-06-17

For details on the Physics-IQ Verified metrics, see the arXiv report.

Unless specified by op for original prompt in the entry, all reported scores use best-practice-prompts (bpp) based on a custom templater for each specific model.

Rules:

One run is sufficient to be included on the verified leaderboard. In general, we recommend to use 4 runs reporting mean and standard deviation. To claim SOTA, reporting standard deviation across 4 runs is required.

Physics-IQ Original Leaderboard

If you test your model on Physics-IQ Original and would like your score/paper/model to be featured here in this table, feel free to open a pull request that adds a row to the table and we'll be happy to include it!

#	Model	input type	Physics-IQ score	date added (YYYY-MM-DD)
1	Magi-1 + GeoPhys (BoN) reported here	multiframe (v2v)	64.5 % 🥇 v2v	2026-06-17
2	Cosmos3-Super + WMReward (BoN) reported here	multiframe (v2v)	63.4 % 🥈 v2v	2026-05-26
3	Magi-1 + WMReward (BoN) reported here	multiframe (v2v)	62.6 % 🥉 v2v	2025-10-28
4	Cosmos3-Super reported here	multiframe (v2v)	59.7 %	2026-05-26
5	Cosmos3-Nano + WMReward (BoN) reported here	multiframe (v2v)	57.7 %	2026-05-26
6	Magi-1 reported here	multiframe (v2v)	56.0 %	2025-04-21
7	Cosmos3-Nano reported here	multiframe (v2v)	50.2 %	2026-05-26
8	Cosmos3-Super + WMReward (BoN) reported here	i2v	48.9 % 🥇 i2v	2026-05-26
9	Sora2 + WMReward (BoN) reported here	i2v	46.4 % 🥈 i2v	2026-04-01
10	Wan2.2 + WMReward (BoN) reported here	i2v	44.4 % 🥉 i2v	2026-04-01
11	Cosmos3-Super reported here	i2v	43.8 %	2026-05-26
12	Cosmos3-Nano + WMReward (BoN) reported here	i2v	43.8 %	2026-05-26
13	Sora2 reported here	i2v	42.3 %	2026-04-01
14	Cosmos3-Nano reported here	i2v	40.2 %	2026-05-26
15	Magi-1 + GeoPhys (BoN) reported here	i2v	38.6 %	2026-06-17
16	Wan2.2 reported here	i2v	38.3 %	2026-04-01
17	Magi-1 + WMReward (BoN) reported here	i2v	36.9 %	2025-10-28
18	Video-GPT reported here	multiframe (v2v)	35.0 %	2025-05-22
19	CogVideoX-5B + GeoPhys (BoN) reported here	i2v	34.1 %	2026-06-17
20	Wan2.1 14B + GeoPhys (BoN) reported here	i2v	34.0 %	2026-06-17
21	Magi-1 4.5B + GeoPhys (BoN) reported here	i2v	34.0 %	2026-06-17
22	CogVideoX-5b reported here	i2v	32.3 %	2026-01-06
23	Magi-1 reported here	i2v	30.2 %	2025-04-21
24	VideoPoet reported here	multiframe (v2v)	29.5 %	2025-02-19
25	Lumiere reported here	multiframe (v2v)	23.0 %	2025-02-19
26	Runway Gen 3 reported here	i2v	22.8 %	2025-02-19
27	VideoPoet reported here	i2v	20.3 %	2025-02-19
28	Lumiere reported here	i2v	19.0 %	2025-02-19
29	Stable Video Diffusion reported here	i2v	14.8 %	2025-02-19
30	Pika reported here	i2v	13.0 %	2025-02-19
31	Sora reported here	i2v	10.0 %	2025-02-19

Note to early adopters of the benchmark: results from the paper were finalized on February 19, 2025; if you used the toolbox before please re-run since we changed and improved a few aspects. Likewise, if you downloaded the dataset before that date, it is recommended to re-download it, ensuring the ground truth video masks have a duration of five seconds.

Quick Start

Choose one benchmark:

Physics-IQ Verified Workflow: recommended benchmark with improved prompts, masks, and scoring. This is the default when running physiq/run_physics_iq.py.
Physics-IQ Original Workflow: original Physics-IQ benchmark. Use --original_physics_iq when evaluating.

Physics-IQ Verified Workflow

Details

A. Download Physics-IQ Verified

Download the verified benchmark from the Physics-IQ Verified Hugging Face dataset.

Note: Access requests are approved automatically.

Install the Hugging Face CLI if it is not already present:

pip install -U huggingface_hub

Download Physics-IQ Verified into the desired destination folder:

hf download Anates-Labs-Research/Physics-IQ-Verified \
  --repo-type dataset \
  --local-dir physics-IQ-benchmark-verified

Ensure you have downloaded and placed the physics-IQ-benchmark-verified dataset in your working directory. This dataset must include 30FPS videos and can optionally include your desired FPS. If you downloaded the dataset from the link above, it should contain all provided FPS variants (30FPS, 24FPS, 16FPS, 8FPS). If your desired FPS does not exist in the dataset already, it will be automatically generated. The folder should have the following structure:

physics-IQ-benchmark-verified/
├── full-videos/
│   └── take-1/
│       └── 30FPS/
│           ├── 0001_full-videos_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
│           ├── 0002_full-videos_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
│           └── ...
├── split-videos/
│   └── testing/
│       └── 30FPS/
│           ├── 0001_testing-videos_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
│           ├── 0002_testing-videos_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
│           └── ...
├── switch-frames/
│   ├── 0001_switch-frames_anyFPS_perspective-left_trimmed-ball-and-block-fall.jpg
│   ├── 0002_switch-frames_anyFPS_perspective-center_trimmed-ball-and-block-fall.jpg
│   └── ...
└── video-masks/
    └── real/
        └── 30FPS/
            ├── 0001_video-masks_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
            ├── 0002_video-masks_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
            └── ...

B. Set Up Environment

Option A — uv (recommended):

uv sync

Installing uv

Install uv according to [Astral documentation](https://docs.astral.sh/uv/getting-started/installation):

wget -qO- https://astral.sh/uv/install.sh | sh

or via pip:

pip install uv

Option B — pip:

pip install .

To also install development tools (formatter, test runner, notebooks):

pip install ".[dev]"

Contributors who need an editable install can use pip install -e ".[dev]" instead.

System requirements: tested on Linux; requires ffprobe (install with sudo apt-get install ffmpeg).

Note for pip users: replace uv run with python in all commands below.

C. Choose Prompt Template

C1. Why this matters.

Prompting conventions differ across video models. To evaluate models fairly, use the prompt template that best matches each model's expected input style instead of forcing every model into the same wording. You can either use one of the existing templates below or write your own model-specific templater. For example, OpenAI provides an excellent Sora 2 prompting guide that can be used as a reference when designing a templater.

C2. Prompt settings.

Physics-IQ Verified uses two prompt settings:

bpp uses a model-specific prompt (or the base version) produced by a templater stored inside descriptions/best_practice.
op uses the original descriptions/descriptions_original.csv prompts.

C3. Existing templates.

For the bpp settings, the base templated descriptions can be found in descriptions/best_practice/descriptions_base.csv. For models with specific prompting guidelines, model-optimised descriptions can be generated using uv run physiq/generate_descriptions.py {model_name}:

File	Optimised for
`descriptions_pvideo.csv`	P-Video (Pruna AI)
`descriptions_sora2.csv`	Sora 2 (OpenAI)

C4. Add a new templater (optional, recommended for new models).

Adding a new templater for your model

Open physiq/templater/physiq_verified.py and add a class decorated with @register("name"):

from templater.base import BaseTemplater, register

@register("mymodel")
class MyModelTemplater(BaseTemplater):
    def generate_prompt(self, identifier) -> str:
        action = self.get_subjectaction_description(identifier)
        scene = self.get_scene_description(identifier)
        setup = self.get_scenesetup_description(identifier)
        # compose however your model expects it
        return f"{action} {scene} {setup}"

Generate the descriptions CSV:

uv run physiq/generate_descriptions.py mymodel
# writes descriptions/best_practice/descriptions_mymodel.csv

Available helper methods on BaseTemplater:

get_subjectaction_description(id) — what happens in the scene
get_scene_description(id) — static scene setup
get_scenesetup_description(id) — pre-action state (optional, may be empty)
self.camera_description / self.style_description / self.action_description — fixed boilerplate strings

C5. Generate a descriptions CSV.

To regenerate or add a new variant:

uv run physiq/generate_descriptions.py sora2   # or pvideo, base

This writes a model-specific descriptions CSV, for example:

descriptions/best_practice/descriptions_sora2.csv

with the same evaluation columns as the base descriptions file:

scenario,description,category,generated_video_name
0001_perspective-left_take-1_trimmed-ball-and-block-fall.mp4,"Style: ...",Solid Mechanics,0001_perspective-left_trimmed-ball-and-block-fall.mp4

D. Generate Videos

D1. Choose input mode.

First choose the input mode used by your model.

Image-to-video models (I2V)

Use initial frames from physics-IQ-benchmark-verified/switch-frames.
If your model uses text input, use the descriptions CSV selected or generated in Step C. Only the first 198 rows marked as take-1 are needed for generation.
Save generated videos with the benchmark ID prefix:

<model_run_folder>/0001_perspective-left_trimmed-ball-and-block-fall.mp4

Multiframe-to-video models (V2V)

Use conditioning videos from physics-IQ-benchmark-verified/split-videos/conditioning-videos.
If your model also accepts text input, use the descriptions CSV selected or generated in Step C.
Ensure the frame rate matches the benchmark FPS you will evaluate at.
Save generated videos with the benchmark ID prefix:

<model_run_folder>/0001_perspective-left_trimmed-ball-and-block-fall.mp4

D2. Name each model-run folder.

Save generated videos in one directory per model run. For leaderboard-style reporting, generate four independent runs for each model and prompt setting. The aggregate leaderboard score in Step G is computed as the mean ± standard deviation across these four runs. Use the folder name to encode both the prompt setting and the run number:

<model_name>-<prompt_setting>-run_<run_number>

The prompt setting should be bpp for model-specific benchmark prompts or op for original prompts. The run number should use run_01 through run_04 for the standard four-run benchmark setup. Filenames may vary, but each video must keep the unique ID prefix from the benchmark (0001_, ..., 0198_). Using descriptive benchmark-style names is recommended.

E. Trim Videos

Before running evaluation, trim all generated videos to exactly 5 seconds. Videos of any other duration are incompatible with the benchmark. If you are running V2V, do not include the 3-second conditioning segment, only the generated 5 seconds.

You can use the repo-local generated_videos_5s/ folder for trimmed outputs or store them externally and pass those folders to --input_folders.

Example trimmed video folder:

generated_videos_5s/
├── <model_name>-bpp-run_01/
│   ├── 0001_perspective-left_trimmed-ball-and-block-fall.mp4
│   ├── 0002_perspective-center_trimmed-ball-and-block-fall.mp4
│   └── ...
├── <model_name>-bpp-run_02/
│   └── ...
├── <model_name>-bpp-run_03/
│   └── ...
└── <model_name>-bpp-run_04/
    └── ...

Original-prompt (`op`) trimmed folder example

generated_videos_5s/
├── <model_name>-op-run_01/
│   ├── 0001_perspective-left_trimmed-ball-and-block-fall.mp4
│   ├── 0002_perspective-center_trimmed-ball-and-block-fall.mp4
│   └── ...
├── <model_name>-op-run_02/
│   └── ...
├── <model_name>-op-run_03/
│   └── ...
└── <model_name>-op-run_04/
    └── ...

mkdir -p generated_videos_5s/<model_name>-bpp-run_01

for v in generated_videos/<model_name>-bpp-run_01/*.mp4; do
  ffmpeg -y -i "$v" \
    -t 5 \
    -r 24 \
    "generated_videos_5s/<model_name>-bpp-run_01/$(basename "$v")"
done

F. Run Evaluation

Verified evaluation is the default behavior of physiq/run_physics_iq.py. This step reports two per-run score variants for each input folder: the original score and the verified score. For Physics-IQ Verified leaderboard reporting, use the verified score.

uv run physiq/run_physics_iq.py \
  --input_folders \
    generated_videos_5s/<model_name>-bpp-run_01 \
    generated_videos_5s/<model_name>-bpp-run_02 \
    generated_videos_5s/<model_name>-bpp-run_03 \
    generated_videos_5s/<model_name>-bpp-run_04 \
  --output_folder <output_dir> \
  --descriptions_file <descriptions_file> \
  --benchmark_base_folder <folder_containing_physics-IQ-benchmark-verified>

Parameters:

--input_folders: directories containing generated .mp4 videos, with one directory per model run.
--output_folder: directory where result CSV files and plots will be saved.
--descriptions_file: path to the descriptions CSV used for the benchmark.
--benchmark_base_folder: parent folder containing physics-IQ-benchmark-verified.

The evaluator writes one result CSV and one metrics JSON per input folder, using the input folder name as the file stem:

<output_dir>/
└── physics-IQ-benchmark-verified/
    └── results/
        ├── <model_name>-bpp-run_01.csv
        ├── <model_name>-bpp-run_01_metrics.json
        ├── <model_name>-bpp-run_02.csv
        ├── <model_name>-bpp-run_02_metrics.json
        ├── <model_name>-bpp-run_03.csv
        ├── <model_name>-bpp-run_03_metrics.json
        ├── <model_name>-bpp-run_04.csv
        ├── <model_name>-bpp-run_04_metrics.json
        ├── physics_IQ_score_Original_barplot.pdf # return the original score
        └── physics_IQ_score_Verified_barplot.pdf # returns the verified score for the verified leaderboard.

The verified score printed by the evaluator is stored as final_score_view in each _metrics.json file.

G. Aggregate Leaderboard Scores

Step F reports per-run original and verified score variants. To report a Physics-IQ Verified leaderboard score, use the verified score from each run and compute the mean and standard deviation across the standard four runs. Report this as score ± std in the leaderboard table.

To do this, use aggregate_runs_from_csvs.py can be used as follows:

uv run physiq/aggregate_runs_from_csvs.py \
  <path>/<model_name>-bpp-run_01.csv \
  <path>/<model_name>-bpp-run_02.csv \
  <path>/<model_name>-bpp-run_03.csv \
  <path>/<model_name>-bpp-run_04.csv \
  --score-type verified

We also accept single run results, but we do recommend using 4 runs.

Physics-IQ Original Workflow

Details

A. Download Physics-IQ Original

Download the original benchmark from the Physics-IQ Google Cloud Storage link, or install the gcloud SDK and run:

uv run physiq/download_physics_iq_data.py \
  --fps 30 --original_physics_iq\
  --benchmark_base_folder <download_parent>

Ensure you have downloaded and placed the physics-IQ-benchmark dataset in your working directory. This dataset must include 30FPS videos and can optionally include your desired FPS. If you downloaded the dataset from the link above, it should contain all provided FPS variants (30FPS, 24FPS, 16FPS, 8FPS). If your desired FPS does not exist in the dataset already, it will be automatically generated. The folder should have the following structure:

physics-IQ-benchmark/
├── full-videos/
│   └── take-1/
│       └── 30FPS/
│           └── ...
├── split-videos/
│   ├── conditioning-videos/
│   │   └── 30FPS/
│   │       ├── 0001_conditioning-videos_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
│   │       ├── 0002_conditioning-videos_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
│   │       └── ...
│   └── testing-videos/
│       └── 30FPS/
│           ├── 0001_testing-videos_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
│           ├── 0002_testing-videos_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
│           └── ...
├── switch-frames/
│   ├── 0001_switch-frames_anyFPS_perspective-left_trimmed-ball-and-block-fall.jpg
│   ├── 0002_switch-frames_anyFPS_perspective-center_trimmed-ball-and-block-fall.jpg
│   └── ...
└── video-masks/
    └── real/
        └── 30FPS/
            ├── 0001_video-masks_30FPS_perspective-left_take-1_trimmed-ball-and-block-fall.mp4
            ├── 0002_video-masks_30FPS_perspective-center_take-1_trimmed-ball-and-block-fall.mp4
            └── ...

B. Set Up Environment

Use the same environment setup as the verified workflow.

C. Use Original Prompts

Use descriptions/descriptions_original.csv for original Physics-IQ prompts.

D. Generate Videos

Use the same generated-video folder and filename conventions as the verified workflow, but source frames and conditioning videos from physics-IQ-benchmark/ and use the original (op) descriptions from descriptions/descriptions_original.csv.

E. Trim Videos

Trim generated videos to exactly 5 seconds before evaluation.

F. Run Evaluation

Add --original_physics_iq to evaluate against the original benchmark:

uv run physiq/run_physics_iq.py \
  --input_folders \
    generated_videos_5s/<model_name>
  --output_folder <output_dir> \
  --descriptions_file descriptions/descriptions_original.csv \
  --benchmark_base_folder <folder_containing_physics-IQ-benchmark> \
  --original_physics_iq

The evaluator writes one result CSV and one metrics JSON per input folder, using the input folder name as the file stem:

<output_dir>/
└── physics-IQ-benchmark-verified/
    └── results/
        ├── <model_name>.csv
        ├── <model_name>.json
        ├── physics_IQ_score_Original_barplot.pdf # score for the original leaderboard
        └── physics_IQ_score_Verified_barplot.pdf # verified score on original data

The original Physics-IQ score is then plotted in physics_IQ_score_Original_barplot.pdf and stored inside a correspondingly named json file under: final_score_origround

Citation

If you think this project is helpful, please feel free to leave a star ⭐️

Original Physics-IQ:

@article{motamed2026physics,
  title={Do generative video models understand physical principles?},
  author={Saman Motamed and Laura Culp and Kevin Swersky and Priyank Jaini and Robert Geirhos},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={948--958},
  year={2026}
}

Physics-IQ Verified (which builds on the paper above):

@article{radsch2026verified,
  author  = {Rädsch, Tim and Asano, Yuki M. and Kuehne, Hilde and Bauer, Stefan and Jaini, Priyank and Geirhos, Robert and Lüth, Carsten T.},
  title   = {Physics-IQ Verified},
  journal = {arXiv preprint arXiv:2606.18943},
  year    = {2026},
}

License and disclaimer

Physics-IQ

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.github/workflows		.github/workflows
assets		assets
descriptions		descriptions
generated_videos_5s		generated_videos_5s
physiq		physiq
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pylock.toml		pylock.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Physics-IQ and Physics-IQ Verified: Benchmarking physical understanding in generative video models

Key Features:

Leaderboard

Physics-IQ Verified Leaderboard

Physics-IQ Original Leaderboard

Quick Start

Physics-IQ Verified Workflow

A. Download Physics-IQ Verified

B. Set Up Environment

C. Choose Prompt Template

D. Generate Videos

E. Trim Videos

F. Run Evaluation

G. Aggregate Leaderboard Scores

Physics-IQ Original Workflow

A. Download Physics-IQ Original

B. Set Up Environment

C. Use Original Prompts

D. Generate Videos

E. Trim Videos

F. Run Evaluation

Citation

License and disclaimer

Physics-IQ

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Physics-IQ and Physics-IQ Verified: Benchmarking physical understanding in generative video models

Key Features:

Leaderboard

Physics-IQ Verified Leaderboard

Physics-IQ Original Leaderboard

Quick Start

Physics-IQ Verified Workflow

A. Download Physics-IQ Verified

B. Set Up Environment

C. Choose Prompt Template

D. Generate Videos

E. Trim Videos

F. Run Evaluation

G. Aggregate Leaderboard Scores

Physics-IQ Original Workflow

A. Download Physics-IQ Original

B. Set Up Environment

C. Use Original Prompts

D. Generate Videos

E. Trim Videos

F. Run Evaluation

Citation

License and disclaimer

Physics-IQ

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages