Turn YouTube videos into queryable knowledge sessions — extract transcripts, key frames, or presentation slides, then ask anything about them with your own files as context.
Works as a standalone CLI and as an MCP server compatible with any MCP client (Claude Code, Cursor, Windsurf, Continue, custom agents, etc.).
You can — and for some videos that's fine. video-analyzer starts there (transcript-only mode is free and instant) but goes further when you need it:
- Visual context matters. Transcripts say "look at this chart" or "notice the pattern here" — without the actual frame, an LLM is guessing. video-analyzer extracts the exact frames being referenced and describes them with a vision model, so nothing is lost.
- Slide extraction. Pull out the key completed visuals (diagrams, code, summaries) as standalone images you can reference or share — something a transcript alone can never give you.
- Persistent sessions. Process a video once, query it forever. No re-pasting, no token waste, no hitting context limits on long videos.
- Context injection. Ask questions with your own project files injected alongside the video knowledge — "implement what this video describes, using my existing interfaces." A raw transcript paste can't do that cleanly.
- Source transparency. Every answer tells you whether it's based on transcript alone or transcript + visual analysis, so you know what you're getting.
The tool gives you a spectrum: start with transcript-only (fast, free), upgrade to visual analysis when the content demands it.
- Extracts a video's transcript, key frames, or presentation slides — choose the level of analysis you need
- Describes each frame with a vision-capable LLM, cross-referenced against the transcript
- Persists a session at
~/.video-analyzer/{video_id}/— runs once, cached forever - Answers any free-form question about the video, optionally with your own files as context
video-analyzer/
├── main.py CLI — extract / slides / ask / sessions subcommands
├── server.py MCP server — 5 tools for any MCP client
├── analyzer.py Vision LLM describes frames in batches
├── asker.py LLM answers questions with session + context
├── downloader.py yt-dlp video download + youtube-transcript-api fetch
├── frame_extractor.py ffmpeg scene-change detection + targeted extraction
├── transcript_selector.py Transcript analysis for frame/slide selection
├── session.py Session persistence at ~/.video-analyzer/
├── context.py Universal context loader (files, dirs, URLs, stdin)
├── config.py Model names, thresholds, batch sizes
├── requirements.txt
├── .env.example
└── .gitignore
Session storage (global, outside the repo):
~/.video-analyzer/
└── {video_id}/
├── session.json # metadata + transcript + frame descriptions
├── frames/ # extracted PNG frames (from extract)
├── slides/ # presentation-quality frames (from slides)
└── video.* # downloaded video file
- Python 3.11+
ffmpeg
Linux/WSL:
sudo apt update && sudo apt install ffmpegmacOS:
brew install ffmpeggit clone git@github.com:ashmitb95/video-analyzer-llm.git
cd video-analyzer-llm
python3 -m venv venv
source venv/bin/activate # Windows WSL: same command
pip install -r requirements.txt
cp .env.example .env
# Edit .env — add your Anthropic API key:
# ANTHROPIC_API_KEY=sk-ant-...Two modes, from lightest to heaviest:
source venv/bin/activate
# Transcript only — fast, free, no video download
python main.py extract "https://youtu.be/dQw4w9WgXcQ" --transcript-only
# Full extraction — downloads video, extracts frames, describes with Vision
python main.py extract "https://youtu.be/dQw4w9WgXcQ"Options:
--transcript-only Fetch transcript only — no video download or frame analysis
--threshold 0.1 Scene change sensitivity 0–1 (default 0.1, lower = more frames)
--interval 3.0 Min seconds between frames (default 3.0)
--max-frames 25 Max frames from transcript analysis (default 25)
--force Re-extract even if session already exists
--resume Resume from last completed step
python main.py slides "https://youtu.be/dQw4w9WgXcQ"Identifies complete diagrams, charts, code snippets, and summaries that would work as standalone slides. Extracts them into ~/.video-analyzer/{id}/slides/.
Options:
--max-slides 15 Max slides to extract (default 15)
--force Re-extract even if slides already exist
python main.py ask <session_id> "What are the main concepts covered?"
# Inject your own project files as context:
python main.py ask <session_id> "Implement the pattern from the video" \
--context ./src/app.py \
--context ./src/utils/
# Pipe in notes via stdin:
python main.py ask <session_id> "Compare with my notes" --stdin < notes.md--context accepts: file paths, directory paths, HTTP/HTTPS URLs, or raw text strings. Repeatable.
python main.py sessionsThe MCP server exposes video-analyzer as a set of tools that any MCP-compatible client can call. The server handles video knowledge; the client provides conversation context.
Works with: Claude Code, Cursor, Windsurf, Continue, custom MCP agents, or any client that speaks the Model Context Protocol.
| Tool | Description |
|---|---|
extract_transcript(url) |
Fetch transcript only — fast, free, no API cost. Default for most questions. |
extract_video(url) |
Full visual processing — download, extract frames, describe with Vision. Use when visual analysis is needed. |
extract_slides(url) |
Extract presentation-quality slide frames. Returns paths to PNGs on disk. |
get_session(session_id) |
Return session content with analysis_source metadata (transcript-only vs transcript+video). |
list_sessions() |
List all processed videos. |
The client picks the right tool based on context. For most questions extract_transcript is sufficient; extract_video when visual analysis is needed; extract_slides for screenshots or a deck.
Add the server to your MCP client's config. Examples for common clients:
Claude Code (~/.claude.json):
"mcpServers": {
"video-analyzer": {
"command": "/path/to/video-analyzer-llm/venv/bin/python",
"args": ["/path/to/video-analyzer-llm/server.py"]
}
}Cursor (.cursor/mcp.json in your project):
{
"mcpServers": {
"video-analyzer": {
"command": "/path/to/video-analyzer-llm/venv/bin/python",
"args": ["/path/to/video-analyzer-llm/server.py"]
}
}
}WSL (MCP client running on Windows, server on Linux):
"mcpServers": {
"video-analyzer": {
"command": "wsl",
"args": [
"-d", "Ubuntu-24.04",
"/home/<you>/projects/video-analyzer-llm/venv/bin/python",
"/home/<you>/projects/video-analyzer-llm/server.py"
]
}
}The server loads ANTHROPIC_API_KEY from the .env file automatically. Restart your client after updating the config.
Open your MCP client inside any project, then ask naturally:
"Summarise the key points from https://youtu.be/dQw4w9WgXcQ"
"Extract slides from https://youtu.be/dQw4w9WgXcQ and list what each one covers"
"Based on the architecture explained in https://youtu.be/dQw4w9WgXcQ,
refactor my src/api/router.py to follow that pattern"
The client will:
- Call
extract_transcript(url)orextract_video(url)depending on what's needed — cached after the first run - Call
get_session(id)— loads transcript (+ frame descriptions if available), withanalysis_sourcemetadata indicating what data the answer is based on - Read your project files for context
- Answer your question, citing whether it drew from transcript alone or transcript + visual analysis
| Setting | Default | Description |
|---|---|---|
SCENE_THRESHOLD |
0.1 |
Scene change sensitivity. Lower = more frames captured. |
MIN_FRAME_INTERVAL |
3.0s |
Minimum gap between frames. |
TRANSCRIPT_WINDOW |
15.0s |
Transcript context pulled around each frame (±15s). |
CLAUDE_MODEL |
claude-sonnet-4-6 |
Model for frame descriptions (vision). |
SYNTHESIS_MODEL |
claude-opus-4-6 |
Model for ask synthesis (8192 tokens). |
FRAME_SELECTION_MODEL |
claude-haiku-4-5 |
Cheap text model for transcript-driven frame/slide selection. |
FRAME_SELECTION_MAX |
25 |
Max frames from transcript analysis. |
SLIDE_SELECTION_MAX |
15 |
Max slides to extract. |
SLIDE_SELECTION_MIN_INTERVAL |
10.0s |
Min gap between slides (wider than frames for diversity). |
MAX_FRAMES_PER_BATCH |
8 |
Frames per vision API call. |
IMAGE_MAX_WIDTH |
1280px |
Frames resized to this width before API call. |
Models are configurable in config.py. The defaults use Anthropic's Claude, but can be swapped for any provider supported by the Anthropic SDK or a compatible wrapper.
The ANTHROPIC_API_KEY is used by:
extract(full mode) — vision model for frame descriptionsslides— one cheap text model call for transcript analysisask— LLM for answering questions
These commands make no API calls and need no key:
extract --transcript-only— fetches from YouTube onlyget_session/list_sessions— reads from disk
- Sessions are machine-local (
~/.video-analyzer/). Cloning the repo on a new machine means re-runningextractonce per video. - The
output/directory in the project root is legacy and ignored by git. All current session data lives at~/.video-analyzer/. youtube-transcript-apiv1.2.4+ uses an instance-based API:YouTubeTranscriptApi().fetch(video_id). If you seeAttributeError: get_transcript, you're on an old version —pip install -U youtube-transcript-api.