Official implementation of "Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation"
Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
Yingjie Chen,
Shilun Lin,
Cai Xing,
Qixin Yan,
Wenjing Wang,
Dingming Liu,
Hao Liu,
Chen Li,
Jing LYU
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework.
- (2026-03-18) The project page, demo video and technical report are released.
- Release inference code and model weights for single-subject scenarios
- Release inference code and model weights for multi-subject scenarios
$ pip install -r requirements.txtPlease download the following pretrained models and place them in the ckpts directory: MMAudio, Wan2.2-TI2V-5B, Identity-as-Presence
After downloading, ensure all model files are placed in the ckpts directory and properly configured.
$ bash infer.shThe results will be saved in results directory.
1.mp4 |
2.mp4 |
3.mp4 |
4.mp4 |
1.mp4 |
2.mp4 |
3.mp4 |
4.mp4 |
1.mp4
1-1.mp4
1-2.mp4 |
2.mp4
2-1.mp4
2-2.mp4 |
3.mp4
3-1.mp4
3-2.mp4 |
4.mp4
4-1.mp4
4-2.mp4 |
For more details, please refer to our project page.
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{chen2026identity,
title={Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation},
author={Chen, Yingjie and Lin, Shilun and Xing, Cai and Binxin, Yang and Long, Zhou and Yan, Qixin and Wang, Wenjing and Liu, Dingming and Liu, Hao and Li, Chen and LYU, Jing},
journal={arXiv preprint arXiv:2603.17889},
website={https://chen-yingjie.github.io/projects/Identity-as-Presence/index.html},
year={2026}}We would like to thank the contributors to various open-source projects for their research and exploration.








