Skip to content

gracezhao1997/Awesome-Video-World-Models-with-AR-Diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

155 Commits
 
 
 
 
 
 

Repository files navigation

📹 Awesome Video World Models with AR Diffusion

Awesome visitors WeChat

Overview

This repository focuses on Video World Models with Autoregressive (AR) Diffusion, a promising paradigm for scalable, consistent and interactive world modeling (e.g., Genie 3). It aims to serve as a comprehensive and structured resource for researchers, practitioners,i and enthusiasts interested in AR diffusion-based video world modeling. To stay at the forefront of the field, this repository is updated weekly.

🌟 Key Features

  • Structured Taxonomy: We organize the evolving ecosystem from three complementary perspectives: Algorithmic Foundations, Real-world Applications, and Infrastructure-level Acceleration. Together, these dimensions reflect the full stack of AR diffusion—from modeling design to real-time interactive deployment.
  • One-Stop Citation Collection: 📚 We provide a consolidated BibTeX file containing all papers listed in this repository. You can easily import it into your LaTeX or Zotero projects with one click!

📬 Contact

This repository is curated and maintained by:

For any questions or suggestions, please feel free to reach out to us.

  • 🎯 We have not yet compiled an exhaustive list of all related work. We apologize for any omissions and welcome pull requests to merge them in.
  • 💡 We also welcome high-level categorization, synthesis, and perspective contributions to improve the organization and clarity of this repository.

Table of Contents

1. Algorithm

1.1 AR Diffusion (native pretraining)

These methods focus on basic AR Diffusion (where each chunk/frame is generated via diffusion and the frames are AR).

  • Diffusion Forcing: "Next-token Prediction Meets Full-Sequence Diffusion". arXiv Website Code BibTeX
  • Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling". arXiv Website Code BibTeX
  • DFoT, "History-Guided Video Diffusion". arXiv Website Code BibTeX
  • AR-Diffusion, "AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion". arXiv Code BibTeX
  • PFVG, "Pack and force your memory: Long-form and consistent video generation". arXiv Website Code BibTeX
  • BAgger, "BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models". arXiv Website BibTeX
  • Resampling Forcing, "End-to-End Training for Autoregressive Video Diffusion via Self-Resampling". arXiv Website BibTeX
  • Helios, "Helios: Real Real-Time Long Video Generation Model". arXiv Website Code BibTeX

1.2 🔥 AR Diffusion Distillation for Real-time Generation (post training)

This category of algorithms focuses on distilling multi-step bidirectional diffusion models into few-step AR models, specifically tailored for real-time streaming generation.

  • From Multi-step Bidirectional Diffusion to Few-step Autoregressive Generators:
    • [⭐] CausVid, "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models". arXiv Website Code BibTeX
    • [⭐] Self Forcing, "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion". arXiv Website Code BibTeX
    • [⭐] Causal Forcing, "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation". arXiv Website Code BibTeX
image

  • Further Improvements:
    • (Adversarial distillation) Seaweed APT2, "Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation". arXiv Website BibTeX
    • (One-step distillation) ASD, "Towards One-Step Causal Video Generation via Adversarial Self-Distillation". arXiv Code BibTeX
    • (Two-steps distillation) Diagonal Distillation, "Streaming Autoregressive Video Generation via Diagonal Distillation". arXiv Website Code BibTeX
    • (Reinforcement learning) Reward Forcing, "Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation". arXiv Website Code BibTeX
    • (Reinforcement learning) WorldCompass, "WorldCompass: Reinforcement Learning for Long-Horizon World Models". arXiv Website BibTeX
    • (Reinforcement learning) AR-CoPO,"AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization". arXiv BibTeX
    • (Reinforcement learning) Astrolabe, "Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models". arXiv Website Code BibTeX
    • (Distribution matching) Salt, "Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation". arXiv Code BibTeX

1.3 Long Video Generation

  • From Short-video Generator to Long-video Generator:

    • LongLive, "LongLive: Real-time Interactive Long Video Generation". arXiv Website Code BibTeX
    • Rolling Forcing, "Rolling Forcing: Autoregressive Long Video Diffusion in Real Time". arXiv Website Code BibTeX
    • Self Forcing++, "Self-Forcing++: Towards Minute-Scale High-Quality Video Generation". arXiv Website Code BibTeX
    • Infinite Forcing, Code BibTeX
    • Infinity-RoPE, "Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout". arXiv Website Code BibTeX
    • Deep Forcing, "Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression". arXiv Website Code BibTeX
    • LoL, "LoL: Longer than Longer, Scaling Video Generation to Hour". arXiv BibTeX
    • FLEX, "Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation". arXiv Website Code BibTeX
    • LIVE, "LIVE: Long-horizon Interactive Video World Modeling". arXiv Website BibTeX
    • Rolling Sink, "Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion". arXiv Website Code BibTeX
    • MMM, "Mode Seeking meets Mean Seeking for Fast Long Video Generation". arXiv Website BibTeX
    • MemRoPE, "MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens". arXiv Website CodeBibTeX
    • Anchor Forcing, "Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion". arXiv CodeBibTeX
    • ShotStream, "ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling". arXiv Website CodeBibTeX
    • DCARL, "DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation". arXiv Website BibTeX
    • PackForcing, "PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference". arXiv CodeBibTeX
    • Grounded Forcing, "Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis". arXiv BibTeX
    • TempoMaster, "TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction", arXiv BibTeX
    • Hybrid Forcing, "Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation", arXiv BibTeX
  • Long-term Memory:

    • WORLDMEM, "WORLDMEM: Long-term Consistent World Simulation with Memory". arXiv Website Code BibTeX
    • VRAG, "Learning World Models for Interactive Video Generation". arXiv Website Code BibTeX
    • Context as Memory, "Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval". arXiv Website BibTeX
    • Memory Forcing, "Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft". arXiv Website BibTeX
    • MemFlow, "MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives". arXiv Website Code BibTeX
    • StableWorld, "StableWorld: Towards Stable and Consistent Long Interactive Video Generation". arXiv Website Code BibTeX
    • Infinite-World, "Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory". arXiv BibTeX
    • Context Forcing, "Context Forcing: Consistent Autoregressive Video Generation with Long Context". arXiv Website Code BibTeX
    • ViewRope, "Geometry-Aware Rotary Position Embedding for Consistent Video World Model". arXiv BibTeX
    • PERSIST, "Beyond Pixel Histories: World Models with Persistent 3D State". arXiv Website Code BibTeX
    • MosaicMem, "MosaicMem: Hybrid Spatial Memory for Controllable Video World Models". arXiv Website BibTeX
    • HyDRA, "Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models". arXiv Code BibTeX

2. Application

Up to 26.03 now. Welcome Pull Requests!

2.1 Open-source AR Video Foundation Models

  • Pyramid Flow: "Pyramidal Flow Matching for Efficient Video Generative Modeling". arXiv Website Code BibTeX
  • SkyReels, "SkyReels-V2: Infinite-length Film Generative Model". arXiv Website Code BibTeX
  • MAGI-1, "MAGI-1: Autoregressive Video Generation at Scale". arXiv Website Code BibTeX
  • Helios, "Helios: Real Real-Time Long Video Generation Model". arXiv Website Code BibTeX

2.2 Interactive Video Action World Model

  • Genie3. Website BibTeX

  • Yan, "Yan: Foundational Interactive Video Generation". arXiv Website BibTeX

  • Matrix-Game 3.0, "Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory". arXiv Website Code BibTeX
  • Matrix-game 2.0, "Matrix-game 2.0: An open-source real-time and streaming interactive world model". arXiv Website Code BibTeX

  • PAN, "PAN: A World Model for General, Interactable, and Long-Horizon World Simulation". arXiv Website BibTeX

  • RELIC, "RELIC: Interactive Video World Model with Long-Horizon Memory". arXiv Website BibTeX

  • Astra, "Astra: General Interactive World Model with Autoregressive Denoising". arXiv Code BibTeX

  • HY-WorldPlay, "WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling". arXiv Website Code BibTeX

  • Yume 1.5, "Yume-1.5: A Text-Controlled Interactive World Generation Model". arXiv Website Code BibTeX

  • LingBot-World, "Advancing Open-source World Models". arXiv Website Code BibTeX

  • Olaf-World, "Olaf-World: Orienting Latent Actions for Video World Modeling". arXiv Website Code BibTeX

  • Solaris, "Solaris: Building a Multiplayer Video World Model in Minecraft". arXiv Website Code BibTeX

  • MagicWorld, "MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration". arXiv Website Code BibTeX

  • OmniRoam, "OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation". arXiv Website Code BibTeX

  • INSPATIO-WORLD, "INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling". arXiv Website Code BibTeX

  • ActionParty, "ActionParty: Multi-Subject Action Binding in Generative Video Games". arXiv Website BibTeX

  • MultiWorld, "MultiWorld: Scalable Multi-Agent Multi-View Video World Models". arXiv Website BibTeX

2.3 Real-time Interactive Avtar & Motion & Physical & Audio Control

  • MotionStream, "MotionStream: Real-Time Video Generation with Interactive Motion Controls". arXiv Website Code BibTeX
  • RealVideo, Website Code
  • LiveAvatar, "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length". arXiv Website Code BibTeX
  • SoulX-FlashTalk, "SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation". arXiv Website Code BibTeX
  • LiveTalk, "LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation". arXiv Code BibTeX
  • Avatar Forcing, "Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation". arXiv Website Code BibTeX
  • Geometry-as-context, "Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context". arXiv BibTeX
  • DragStream, "Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!". arXiv Website Code BibTeX

  • RealWonder, "RealWonder: Real-Time Physical Action-Conditioned Video Generation". arXiv Website Code BibTeX

  • WMReward, "Inference-time Physics Alignment of Video Generative Models with Latent World Models". arXiv Code BibTeX

  • SoulX-LiveAct, "SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory". arXiv Website Code BibTeX

  • OmniForcing, "OmniForcing: Unleashing Real-time Joint Audio-Visual Generation". arXiv Website Code BibTeX

  • PRISM:, "PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition". arXiv Code BibTeX

  • DiT as Real-Time Rerenderer, "DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer". arxiv BibTeX

2.4 Egocentric Interaction

This category focuses on first-person (egocentric) video generation, emphasizing hand-object interaction for VR.

  • Hand2World, "Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures". arXiv Website BibTeX
  • Generated Reality, "Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control". arXiv Website BibTeX

2.5 Embodied AI / Autonomous Driving

  • Vidarc, "Vidarc: Embodied Video Diffusion Model for Closed-loop Control". arXiv BibTeX

  • DreamDojo, "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos". arXiv Website Code BibTeX

  • DreamZero, "World Action Models are Zero-shot Policies". arXiv Website Code BibTeX

  • FAR-Drive, "FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving". arXiv BibTeX

  • ABot-PhysWorld, "ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment". arXiv Code BibTeX

3 Infrastructure

3.1 Sparse Attention

  • Dummy Forcing, "Efficient Autoregressive Video Diffusion with Dummy Head". arXiv Website Code BibTeX
  • Light Forcing, "Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention". arXiv Code BibTeX
  • Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention, "Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention". arXiv Website BibTeX
  • TokenTrim, "TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation". arXiv Website BibTeX
  • PaFu-KV, "Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion". arXiv BibTeX
  • MonarchRT, "MonarchRT: Efficient Attention for Real-Time Video Generation". arXiv BibTeX
  • SCD, "Causality in Video Diffusers is Separable from Denoising". arXiv BibTeX
  • SVOO, "Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering". arXiv BibTeX

3.2 Caching

  • FlowCache, "Flow caching for autoregressive video generation". arXiv Code BibTeX
  • WorldCache, "WorldCache: Content-Aware Caching for Accelerated Video World Models". arXiv CodeBibTeX

3.3 Quantized Attention

  • Quant VideoGen, "Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization". arXiv BibTeX
  • "KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study". arXiv BibTeX

Contributing

We have not yet compiled an exhaustive list of all related work; we apologize for any omissions and welcome pull requests to merge them in. We also welcome high-level categorization and synthesis.

Acknowledgment

We refer to the format of Awesome-World-Models.

About

A Curated List of Awesome Video World Models with AR Diffusion: Covering Algorithms, Applications, and Infrastructure, Aimed at Serving as a Comprehensive Resource for Researchers, Practitioners, and Enthusiasts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages