Skip to content

ddxsg24/daVinci-MagiHuman

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

cover


daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR Β &Β  Sand.ai

Paper Demo Models License Python PyTorch

✨ Highlights

  • 🧠 Single-Stream Transformer β€” A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
  • 🎭 Exceptional Human-Centric Quality β€” Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
  • 🌍 Multilingual β€” Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
  • ⚑ Blazing Fast Inference β€” Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
  • πŸ† State-of-the-Art Results β€” Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
  • πŸ“¦ Fully Open Source β€” We release the complete model stack: base model, distilled model, super-resolution model, and inference code.

🎬 Demo

video_1.mp4
video_2.mp4
video_3.mp4
video_4.mp4
video_5.MP4
video_6.mp4
video_7.mp4

πŸ—οΈ Architecture

daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.

Key design choices:

Component Description
πŸ₯ͺ Sandwich Architecture First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities
πŸ• Timestep-Free Denoising No explicit timestep embeddings β€” the model infers the denoising state directly from input latents
πŸ”€ Per-Head Gating Learned scalar gates with sigmoid activation on each attention head for training stability
πŸ”— Unified Conditioning Denoising and reference signals handled through a minimal unified interface β€” no dedicated conditioning branches

πŸ“Š Performance

Quantitative Quality Benchmark

Model Visual Quality ↑ Text Alignment ↑ Physical Consistency ↑ WER ↓
OVI 1.1 4.73 4.10 4.41 40.45%
LTX 2.3 4.76 4.12 4.56 19.23%
daVinci-MagiHuman 4.80 4.18 4.52 14.60%

Human Evaluation (2,000 Pairwise Comparisons)

Matchup daVinci-MagiHuman Win Tie Opponent Win
vs Ovi 1.1 80.0% 8.2% 11.8%
vs LTX 2.3 60.9% 17.2% 21.9%

Inference Speed (5-second video)

Resolution Base (s) Super-Res (s) Decode (s) Total (s)
256p 1.6 β€” 0.4 2.0
540p 1.6 5.1 1.3 8.0
1080p 1.6 31.0 5.8 38.4

πŸš€ Efficient Inference Techniques

  • ⚑ Latent-Space Super-Resolution β€” Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
  • πŸ”„ Turbo VAE Decoder β€” A lightweight re-trained decoder that substantially reduces decoding overhead.
  • πŸ”§ Full-Graph Compilation β€” MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
  • πŸ’¨ Distillation β€” DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.

πŸ“¦ Getting Started

Option 1: Docker (Recommended)

# Pull the MagiCompiler Docker image
docker pull sandai/magi-compiler:latest

# Launch container
docker run -it --gpus all -v /path/to/models:/models sandai/magi-compiler:latest bash

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman

Option 2: Conda

# Create environment
conda create -n davinci python=3.12
conda activate davinci

# Install PyTorch
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0

# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt

Download Model Checkpoints

Download the complete model stack from HuggingFace and update the paths in the config files under example/.

🎯 Usage

Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.

Base Model (256p)

bash example/base/run.sh

Distilled Model (256p, 8 steps, no CFG)

bash example/distill/run.sh

Super-Resolution to 540p

bash example/sr_540p/run.sh

Super-Resolution to 1080p

bash example/sr_1080p/run.sh

πŸ™ Acknowledgements

We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.

πŸ“„ License

This project is released under the Apache License 2.0.

πŸ“– Citation

@misc{davinci-magihuman-2026,
  title   = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
  author  = {SII-GAIR and Sand.ai},
  year    = {2026},
  url     = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.6%
  • Shell 1.4%