📹 Awesome Video World Models with AR Diffusion

Overview

This repository focuses on Video World Models with Autoregressive (AR) Diffusion, a promising paradigm for scalable, consistent and interactive world modeling (e.g., Genie 3). It aims to serve as a comprehensive and structured resource for researchers, practitioners,i and enthusiasts interested in AR diffusion-based video world modeling. To stay at the forefront of the field, this repository is updated weekly.

🌟 Key Features

Structured Taxonomy: We organize the evolving ecosystem from three complementary perspectives: Algorithmic Foundations, Real-world Applications, and Infrastructure-level Acceleration. Together, these dimensions reflect the full stack of AR diffusion—from modeling design to real-time interactive deployment.
One-Stop Citation Collection: 📚 We provide a consolidated BibTeX file containing all papers listed in this repository. You can easily import it into your LaTeX or Zotero projects with one click!

📬 Contact

This repository is curated and maintained by:

For any questions or suggestions, please feel free to reach out to us.

🎯 We have not yet compiled an exhaustive list of all related work. We apologize for any omissions and welcome pull requests to merge them in.
💡 We also welcome high-level categorization, synthesis, and perspective contributions to improve the organization and clarity of this repository.

1. Algorithm

1.1 AR Diffusion (native pretraining)

These methods focus on basic AR Diffusion (where each chunk/frame is generated via diffusion and the frames are AR).

Diffusion Forcing: "Next-token Prediction Meets Full-Sequence Diffusion".
Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling".
DFoT, "History-Guided Video Diffusion".
AR-Diffusion, "AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion".
PFVG, "Pack and force your memory: Long-form and consistent video generation".
BAgger, "BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models".
Resampling Forcing, "End-to-End Training for Autoregressive Video Diffusion via Self-Resampling".
Helios, "Helios: Real Real-Time Long Video Generation Model".

1.2 🔥 AR Diffusion Distillation for Real-time Generation (post training)

This category of algorithms focuses on distilling multi-step bidirectional diffusion models into few-step AR models, specifically tailored for real-time streaming generation.

From Multi-step Bidirectional Diffusion to Few-step Autoregressive Generators:
- [⭐] CausVid, "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models".
- [⭐] Self Forcing, "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion".
- [⭐] Causal Forcing, "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation".

Further Improvements:
- (Adversarial distillation) Seaweed APT2, "Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation".
- (One-step distillation) ASD, "Towards One-Step Causal Video Generation via Adversarial Self-Distillation".
- (Two-steps distillation) Diagonal Distillation, "Streaming Autoregressive Video Generation via Diagonal Distillation".
- (Reinforcement learning) Reward Forcing, "Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation".
- (Reinforcement learning) WorldCompass, "WorldCompass: Reinforcement Learning for Long-Horizon World Models".
- (Reinforcement learning) AR-CoPO,"AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization".
- (Reinforcement learning) Astrolabe, "Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models".
- (Distribution matching) Salt, "Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation".

1.3 Long Video Generation

From Short-video Generator to Long-video Generator:
- LongLive, "LongLive: Real-time Interactive Long Video Generation".
- Rolling Forcing, "Rolling Forcing: Autoregressive Long Video Diffusion in Real Time".
- Self Forcing++, "Self-Forcing++: Towards Minute-Scale High-Quality Video Generation".
- Infinite Forcing,
- Infinity-RoPE, "Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout".
- Deep Forcing, "Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression".
- LoL, "LoL: Longer than Longer, Scaling Video Generation to Hour".
- FLEX, "Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation".
- LIVE, "LIVE: Long-horizon Interactive Video World Modeling".
- Rolling Sink, "Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion".
- MMM, "Mode Seeking meets Mean Seeking for Fast Long Video Generation".
- MemRoPE, "MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens".
- Anchor Forcing, "Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion".
- ShotStream, "ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling".
- DCARL, "DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation".
- PackForcing, "PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference".
- Grounded Forcing, "Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis".
- TempoMaster, "TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction",
- Hybrid Forcing, "Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation",
Long-term Memory:
- WORLDMEM, "WORLDMEM: Long-term Consistent World Simulation with Memory".
- VRAG, "Learning World Models for Interactive Video Generation".
- Context as Memory, "Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval".
- Memory Forcing, "Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft".
- MemFlow, "MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives".
- StableWorld, "StableWorld: Towards Stable and Consistent Long Interactive Video Generation".
- Infinite-World, "Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory".
- Context Forcing, "Context Forcing: Consistent Autoregressive Video Generation with Long Context".
- ViewRope, "Geometry-Aware Rotary Position Embedding for Consistent Video World Model".
- PERSIST, "Beyond Pixel Histories: World Models with Persistent 3D State".
- MosaicMem, "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models".
- HyDRA, "Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models".

2. Application

Up to 26.03 now. Welcome Pull Requests!

2.1 Open-source AR Video Foundation Models

Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling".
SkyReels, "SkyReels-V2: Infinite-length Film Generative Model".
MAGI-1, "MAGI-1: Autoregressive Video Generation at Scale".
Helios, "Helios: Real Real-Time Long Video Generation Model".

2.2 Interactive Video Action World Model

Genie3.
Yan, "Yan: Foundational Interactive Video Generation".

Matrix-Game 3.0, "Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory".

Matrix-game 2.0, "Matrix-game 2.0: An open-source real-time and streaming interactive world model".
PAN, "PAN: A World Model for General, Interactable, and Long-Horizon World Simulation".
RELIC, "RELIC: Interactive Video World Model with Long-Horizon Memory".
Astra, "Astra: General Interactive World Model with Autoregressive Denoising".
HY-WorldPlay, "WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling".
Yume 1.5, "Yume-1.5: A Text-Controlled Interactive World Generation Model".
LingBot-World, "Advancing Open-source World Models".
Olaf-World, "Olaf-World: Orienting Latent Actions for Video World Modeling".

Solaris, "Solaris: Building a Multiplayer Video World Model in Minecraft".
MagicWorld, "MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration".
OmniRoam, "OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation".
INSPATIO-WORLD, "INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling".
ActionParty, "ActionParty: Multi-Subject Action Binding in Generative Video Games".
MultiWorld, "MultiWorld: Scalable Multi-Agent Multi-View Video World Models".

2.3 Real-time Interactive Avtar & Motion & Physical & Audio Control

MotionStream, "MotionStream: Real-Time Video Generation with Interactive Motion Controls".
RealVideo,
LiveAvatar, "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length".
SoulX-FlashTalk, "SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation".
LiveTalk, "LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation".
Avatar Forcing, "Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation".

Geometry-as-context, "Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context".

DragStream, "Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!".
RealWonder, "RealWonder: Real-Time Physical Action-Conditioned Video Generation".
WMReward, "Inference-time Physics Alignment of Video Generative Models with Latent World Models".
SoulX-LiveAct, "SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory".
OmniForcing, "OmniForcing: Unleashing Real-time Joint Audio-Visual Generation".
PRISM:, "PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition".
DiT as Real-Time Rerenderer, "DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer".

2.4 Egocentric Interaction

This category focuses on first-person (egocentric) video generation, emphasizing hand-object interaction for VR.

Hand2World, "Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures".

Generated Reality, "Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control".

2.5 Embodied AI / Autonomous Driving

Vidarc, "Vidarc: Embodied Video Diffusion Model for Closed-loop Control".
DreamDojo, "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos".
DreamZero, "World Action Models are Zero-shot Policies".
FAR-Drive, "FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving".
ABot-PhysWorld, "ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment".

3 Infrastructure

3.1 Sparse Attention

Dummy Forcing, "Efficient Autoregressive Video Diffusion with Dummy Head".
Light Forcing, "Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention".
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention, "Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention".
TokenTrim, "TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation".
PaFu-KV, "Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion".
MonarchRT, "MonarchRT: Efficient Attention for Real-Time Video Generation".
SCD, "Causality in Video Diffusers is Separable from Denoising".
SVOO, "Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering".

3.2 Caching

FlowCache, "Flow caching for autoregressive video generation".
WorldCache, "WorldCache: Content-Aware Caching for Accelerated Video World Models".

3.3 Quantized Attention

Quant VideoGen, "Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization".

"KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study".

Contributing

We have not yet compiled an exhaustive list of all related work; we apologize for any omissions and welcome pull requests to merge them in. We also welcome high-level categorization and synthesis.

Acknowledgment

We refer to the format of Awesome-World-Models.

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
assets		assets
README.md		README.md
video-world-models.bib		video-world-models.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📹 Awesome Video World Models with AR Diffusion

Overview

🌟 Key Features

📬 Contact

Table of Contents

1. Algorithm

1.1 AR Diffusion (native pretraining)

1.2 🔥 AR Diffusion Distillation for Real-time Generation (post training)

1.3 Long Video Generation

2. Application

2.1 Open-source AR Video Foundation Models

2.2 Interactive Video Action World Model

2.3 Real-time Interactive Avtar & Motion & Physical & Audio Control

2.4 Egocentric Interaction

2.5 Embodied AI / Autonomous Driving

3 Infrastructure

3.1 Sparse Attention

3.2 Caching

3.3 Quantized Attention

Contributing

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📹 Awesome Video World Models with AR Diffusion

Overview

🌟 Key Features

📬 Contact

Table of Contents

1. Algorithm

1.1 AR Diffusion (native pretraining)

1.2 🔥 AR Diffusion Distillation for Real-time Generation (post training)

1.3 Long Video Generation

2. Application

2.1 Open-source AR Video Foundation Models

2.2 Interactive Video Action World Model

2.3 Real-time Interactive Avtar & Motion & Physical & Audio Control

2.4 Egocentric Interaction

2.5 Embodied AI / Autonomous Driving

3 Infrastructure

3.1 Sparse Attention

3.2 Caching

3.3 Quantized Attention

Contributing

Acknowledgment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages