Multi-modal Analysis for Video Extraction, Retrieval, and Indexing with Knowledge
MAVERIK is an end-to-end open-source framework for processing, indexing, and intelligently querying long-form videos using multiple modalities:
- Audio → Whisper (English Speach-to-Text), AI4Bharat Indic Conformer Model (Hindi Speech-to-Text) + Gemini 2.0 Flash (Translation)
- Visual objects → YOLOv11m + ByteTrack + Blip
- On-screen text → DeepSeek OCR
- Visual captioning → Gemini-flash-2.0
- Dense multi-modal vector database using ChromaDB
- Agentic query engine (LangChain + Gemini) that plans, searches, evaluates, and refines until high-quality results are obtained
- Initialize and activate Conda environment with Python 3.12.8
conda create --name maverick python=3.12.8
conda activate maverick- Install dependencies:
pip install -r requirements.txt-
Create
./datadirectory for the Django database -
Run database migrations:
python manage.py migrate- Create
secrets.jsonfor Gemini and HF API Keys
{
"gemini_api_key": "AI__KEY__",
"huggingface_api_key": "hf__KEY__"
}- Install
ffmpeg
# (On Ubuntu / Debian-based)
sudo apt install ffmpegStart both services:
./run_ui.shOpen in browser: