This repository focuses on Video World Models with Autoregressive (AR) Diffusion, a promising paradigm for scalable, consistent and interactive world modeling (e.g., Genie 3). It aims to serve as a comprehensive and structured resource for researchers, practitioners,i and enthusiasts interested in AR diffusion-based video world modeling. To stay at the forefront of the field, this repository is updated weekly.
- Structured Taxonomy: We organize the evolving ecosystem from three complementary perspectives: Algorithmic Foundations, Real-world Applications, and Infrastructure-level Acceleration. Together, these dimensions reflect the full stack of AR diffusion—from modeling design to real-time interactive deployment.
- One-Stop Citation Collection: 📚 We provide a consolidated BibTeX file containing all papers listed in this repository. You can easily import it into your LaTeX or Zotero projects with one click!
This repository is curated and maintained by:
- Min Zhao (gracezhao1997@gmail.com)
- Hongzhou Zhu (suinibian74@gmail.com)
- Wenqiang Sun (sunwq0814@gmail.com)
- Bokai Yan (2386721886@qq.com)
For any questions or suggestions, please feel free to reach out to us.
- 🎯 We have not yet compiled an exhaustive list of all related work. We apologize for any omissions and welcome pull requests to merge them in.
- 💡 We also welcome high-level categorization, synthesis, and perspective contributions to improve the organization and clarity of this repository.
These methods focus on basic AR Diffusion (where each chunk/frame is generated via diffusion and the frames are AR).
- Diffusion Forcing: "Next-token Prediction Meets Full-Sequence Diffusion".
- Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling".
- DFoT, "History-Guided Video Diffusion".
- AR-Diffusion, "AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion".
- PFVG, "Pack and force your memory: Long-form and consistent video generation".
- BAgger, "BAgger: Backwards Aggregation for
Mitigating Drift in Autoregressive Video Diffusion Models".
- Resampling Forcing, "End-to-End Training for Autoregressive Video Diffusion via Self-Resampling".
- Helios, "Helios: Real Real-Time Long Video Generation Model".
This category of algorithms focuses on distilling multi-step bidirectional diffusion models into few-step AR models, specifically tailored for real-time streaming generation.
- From Multi-step Bidirectional Diffusion to Few-step Autoregressive Generators:
- [⭐] CausVid, "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models".
- [⭐] Self Forcing, "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion".
- [⭐] Causal Forcing, "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation".
- [⭐] CausVid, "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models".
- Further Improvements:
- (Adversarial distillation) Seaweed APT2, "Autoregressive Adversarial Post-Training
for Real-Time Interactive Video Generation".
- (One-step distillation) ASD, "Towards One-Step Causal Video Generation via Adversarial Self-Distillation".
- (Two-steps distillation) Diagonal Distillation, "Streaming Autoregressive Video Generation via Diagonal Distillation".
- (Reinforcement learning) Reward Forcing, "Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation".
- (Reinforcement learning) WorldCompass, "WorldCompass: Reinforcement Learning for Long-Horizon World Models".
- (Adversarial distillation) Seaweed APT2, "Autoregressive Adversarial Post-Training
for Real-Time Interactive Video Generation".
-
From Short-video Generator to Long-video Generator:
- LongLive, "LongLive: Real-time Interactive Long Video Generation".
- Rolling Forcing, "Rolling Forcing: Autoregressive Long Video Diffusion in Real Time".
- Self Forcing++, "Self-Forcing++: Towards Minute-Scale High-Quality Video Generation".
- Infinite Forcing,
- Infinity-RoPE, "Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout".
- Deep Forcing, "Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression".
- LoL, "LoL: Longer than Longer, Scaling Video Generation to Hour".
- FLEX, "Train Short, Inference Long: Training-free Horizon Extension for
Autoregressive Video Generation".
- LIVE, "LIVE: Long-horizon Interactive Video World Modeling".
- Rolling Sink, "Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion".
- MMM, "Mode Seeking meets Mean Seeking for Fast Long Video Generation".
- MemRoPE, "MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens".
- Anchor Forcing, "Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion".
- ShotStream, "ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling".
- DCARL, "DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation".
- PackForcing, "PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference".
- Grounded Forcing, "Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis".
- TempoMaster, "TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction",
- Hybrid Forcing, "Long-Horizon Streaming Video Generation via
Hybrid Attention with Decoupled Distillation",
- LongLive, "LongLive: Real-time Interactive Long Video Generation".
-
Long-term Memory:
- WORLDMEM, "WORLDMEM: Long-term Consistent
World Simulation with Memory".
- VRAG, "Learning World Models for Interactive Video Generation".
- Context as Memory, "Context as Memory: Scene-Consistent Interactive Long Video
Generation with Memory Retrieval".
- Memory Forcing, "Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft".
- MemFlow, "MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives".
- StableWorld, "StableWorld: Towards Stable and Consistent Long Interactive Video Generation".
- Infinite-World, "Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory".
- Context Forcing, "Context Forcing: Consistent Autoregressive Video Generation with Long Context".
- ViewRope, "Geometry-Aware Rotary Position Embedding for Consistent Video World Model".
- PERSIST, "Beyond Pixel Histories: World Models with Persistent 3D State".
- WORLDMEM, "WORLDMEM: Long-term Consistent
World Simulation with Memory".
Up to 26.03 now. Welcome Pull Requests!
- Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling".
- SkyReels, "SkyReels-V2: Infinite-length Film Generative Model".
- MAGI-1, "MAGI-1: Autoregressive Video Generation at Scale".
- Helios, "Helios: Real Real-Time Long Video Generation Model".
- Matrix-Game 3.0, "Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory".
-
Matrix-game 2.0, "Matrix-game 2.0: An open-source real-time and streaming interactive world model".
-
PAN, "PAN: A World Model for General, Interactable, and Long-Horizon World Simulation".
-
RELIC, "RELIC: Interactive Video World Model with Long-Horizon Memory".
-
Astra, "Astra: General Interactive World Model with Autoregressive Denoising".
-
HY-WorldPlay, "WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling".
-
Yume 1.5, "Yume-1.5: A Text-Controlled Interactive World Generation Model".
-
Olaf-World, "Olaf-World: Orienting Latent Actions for Video World Modeling".
-
Solaris, "Solaris: Building a Multiplayer Video World Model in Minecraft".
-
MagicWorld, "MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration".
-
OmniRoam, "OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation".
-
INSPATIO-WORLD, "INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling".
-
ActionParty, "ActionParty: Multi-Subject Action Binding in Generative Video Games".
-
MultiWorld, "MultiWorld: Scalable Multi-Agent Multi-View Video World Models".
- MotionStream, "MotionStream: Real-Time Video Generation with Interactive Motion Controls".
- RealVideo,
- LiveAvatar, "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length".
- SoulX-FlashTalk, "SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation".
- LiveTalk, "LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation".
- Avatar Forcing, "Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation".
- Geometry-as-context, "Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context".
-
DragStream, "Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!".
-
RealWonder, "RealWonder: Real-Time Physical Action-Conditioned Video Generation".
-
WMReward, "Inference-time Physics Alignment of Video Generative Models with Latent World Models".
-
SoulX-LiveAct, "SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory".
-
OmniForcing, "OmniForcing: Unleashing Real-time Joint Audio-Visual Generation".
-
PRISM:, "PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition".
-
DiT as Real-Time Rerenderer, "DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer".
This category focuses on first-person (egocentric) video generation, emphasizing hand-object interaction for VR.
- Hand2World, "Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space
Hand Gestures".
- Generated Reality, "Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control".
-
Vidarc, "Vidarc: Embodied Video Diffusion Model for Closed-loop Control".
-
DreamDojo, "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos".
-
FAR-Drive, "FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving".
-
ABot-PhysWorld, "ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment".
- Dummy Forcing, "Efficient Autoregressive Video Diffusion with Dummy Head".
- Light Forcing, "Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention".
- Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention, "Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention".
- TokenTrim, "TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation".
- PaFu-KV, "Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion".
- MonarchRT, "MonarchRT: Efficient Attention for Real-Time Video Generation".
- SCD, "Causality in Video Diffusers is Separable from Denoising".
- SVOO, "Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering".
- FlowCache, "Flow caching for autoregressive video generation".
- WorldCache, "WorldCache: Content-Aware Caching for Accelerated Video World Models".
- Quant VideoGen, "Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization".
We have not yet compiled an exhaustive list of all related work; we apologize for any omissions and welcome pull requests to merge them in. We also welcome high-level categorization and synthesis.
We refer to the format of Awesome-World-Models.
