This repo is the official implementation of "Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning" accepted by AAAI 2025.
-
Updated
Apr 14, 2026 - Python
This repo is the official implementation of "Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning" accepted by AAAI 2025.
A transformer-based system that generates time-synchronized captions, speaker-attributed transcripts, and abstractive summaries from videos by integrating audio and visual modalities. It leverages CLIP and Whisper embeddings with cross-attention fusion and T5-based generation to produce accurate, context-aware outputs..
Add a description, image, and links to the multimodal-transformers topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-transformers topic, visit your repo's landing page and select "manage topics."