A queryable knowledge base for your video archive.
Turn a scattered video archive — across multiple SSDs and years — into a portable, plain-text knowledge base. Each clip gets a .description.md sidecar with GPS location + place name, a speaker-diarized multilingual transcript, an English translation (if needed), face detection, and an AI vision scene description with a keep/review/cull rating.
Sidecars live next to the videos. Originals are never modified. Local-first, non-destructive, resumable.
framedex is a Claude Code skill. It installs the vidx command-line tool.
# Clone into your Claude Code skills directory
git clone git@github.com:Simbastack-hq/framedex.git ~/.claude/skills/framedex
# Install deps + pre-download the Whisper + face-detection models
python3 ~/.claude/skills/framedex/scripts/setup.py# 1. Get a Hugging Face token + accept pyannote terms (one-time, for diarization)
# https://huggingface.co/pyannote/speaker-diarization-3.1 (click Agree)
# https://huggingface.co/pyannote/segmentation-3.0 (click Agree)
# https://huggingface.co/settings/tokens (create read token)
export HF_TOKEN=hf_yourTokenHere
# 2. (Optional) Set an Anthropic API key — only needed for --backend api
export ANTHROPIC_API_KEY=sk-ant-...
# 3. Add aliases to ~/.zshrc
alias vidx="python3 $HOME/.claude/skills/framedex/scripts/index_videos.py"
alias vidx-summary="python3 $HOME/.claude/skills/framedex/scripts/trip_summary.py"
alias vidx-master="python3 $HOME/.claude/skills/framedex/scripts/master_index.py"
alias vidx-query="python3 $HOME/.claude/skills/framedex/scripts/query.py"
# 4. Test on 5 clips before unleashing on a full drive
vidx /Volumes/SSD-2024 --max-files 5
# 5. Inspect the sidecars. If happy, run the full drive.
vidx /Volumes/SSD-2024
# 6. After indexing, generate folder summaries + a master index
vidx-summary /Volumes/SSD-2024
vidx-master /Volumes/SSD-2024ffprobe→ metadata (duration, codec, resolution, creation date)exiftool→ GPS lat/lon/altitude- Nominatim → reverse-geocoded place name (rate-limited 1/sec, polite UA)
ffmpeg→ 5 evenly-spaced JPEG frames (≤1920px wide)ffmpeg→ mono 16k WAV- WhisperX → Whisper transcribe + word-level alignment + pyannote diarization
- WhisperX translate mode → English translation (non-English only)
insightface→ face detection + 512-dim embeddings on the same frames- Vision model → single-call structured description (Scene/Subjects/Action/Mood/Shot type/Use cases) + keep/review/cull rating
- Write
[filename].description.mdnext to the video
---
file: IMG_4827.mov
path: /Volumes/SSD-2024/2024-08-construction/drone/IMG_4827.mov
parent_folder: drone
duration_seconds: 12.3
resolution: 3840x2160
codec: hvc1
size_bytes: 245678912
creation_time: 2024-08-14T07:23:11Z
location:
lat: 37.7456
lon: -119.5936
altitude_m: 1842.5
place: "Yosemite Valley, Mariposa County, USA"
language_detected: es
speaker_count: 2
rating: keep
indexed_at: 2026-05-17T14:32:01
---
# IMG_4827.mov
## Description
**Scene:** Wide drone aerial of a construction site at golden hour...
**Subjects:** Three workers in high-vis vests near a partially-built structure...
**Action:** Drone slowly orbits; workers carry materials between two structures.
**Mood:** Industrious, expansive, hopeful.
**Shot type:** Drone aerial, slow orbit.
**Use cases:**
- Construction milestone post
- "From the ground up" origin-story reel
- B-roll behind a voiceover
## Transcript (es, 2 speakers)
[SPEAKER_00] (00:00:01) Pon esta viga aquí primero.
[SPEAKER_01] (00:00:04) Sí, vale.
[SPEAKER_00] (00:00:07) Cuidado con el ángulo.
## English translation
Place this beam here first. Yes, OK. Careful with the angle.Drop .video-context.md at the root of any scan target to give the vision model better priors:
/Volumes/SSD-2024/.video-context.md
---
This drive contains construction-site footage, 2023-2026. Many clips
are drone aerials, crew training, and site walkthroughs. Languages mix
English and Spanish.
---
Without it, descriptions are generic.
A .video-context.md can also carry a line of names Whisper should spell correctly:
**Whisper proper nouns:** Yosemite, El Capitan, Half Dome, ...
These get passed to Whisper as initial_prompt + hotwords so place names and people names in speech don't come back garbled. A second regex pass (~/.framedex/whisper_fixes.json) catches anything the prompt bias misses.
Run on each drive separately:
vidx /Volumes/SSD-2023
vidx /Volumes/SSD-2024
vidx /Volumes/SSD-2025Each drive ends up self-contained with its own sidecars + _INDEX.json. Knowledge travels with the data. The face DB at ~/.framedex/faces.db is centralized so cross-drive person queries work.
| Flag | Purpose |
|---|---|
--dry-run |
Show what would be processed; no API/model calls |
--max-files N |
Stop after N clips (testing) |
--force |
Re-process clips even if a sidecar exists |
--whisper-model large-v3 |
Higher quality, slower (default is large-v3-turbo) |
--no-diarize |
Skip speaker diarization (faster; no HF_TOKEN needed) |
--no-faces |
Skip face detection + embeddings |
--no-geocode |
Skip Nominatim reverse geocoding (GPS still recorded) |
--max-duration MINUTES |
Skip clips longer than N minutes (default: 30; 0 = no limit) |
--exclude PATTERN |
Skip paths matching substring (repeatable) |
--backend cli|api|local |
Vision backend (see below) |
--vision-model haiku|sonnet |
Claude model for cli/api. Default haiku |
--local-base-url URL |
Override LM Studio endpoint (default http://localhost:1234/v1) |
--local-model NAME |
Specify which loaded model to use when LM Studio has multiple |
--no-whisper-prompt |
Disable proper-noun biasing |
--whisper-fixes PATH |
Override the canonical-name regex fixes file |
| Backend | What it uses | Speed | Cost | Privacy |
|---|---|---|---|---|
cli (default) |
claude -p via a Claude Max subscription |
~10-30s/clip | $0 marginal | Frames sent to Anthropic |
api |
Anthropic SDK with an API key | ~2-3s/clip | ~$0.002/clip (Haiku) | Frames sent to Anthropic |
local |
LM Studio (or any OpenAI-compatible server) | ~3-90s/clip | $0 | Fully local, fully offline |
For huge archives, api is fastest. For routine indexing on a Max plan, cli is free. For full privacy, local keeps everything on-device.
| Component | Local or cloud? |
|---|---|
| ffmpeg, exiftool, Whisper, pyannote, insightface | Local |
| Nominatim reverse geocode | Cloud — sends lat/lon only, never video. Skip with --no-geocode |
Vision (--backend cli/api) |
Cloud — sends 5 JPEG frames + a transcript snippet per clip |
Vision (--backend local) |
Fully local |
Face DB (~/.framedex/faces.db) |
Local only, never uploaded |
Whisper supports 99 languages with auto-detection. For non-English clips the script automatically runs a second translate-mode pass and stores the English version alongside the original transcript. For best quality on important non-English footage:
vidx /Volumes/SSD-2024 --whisper-model large-v3 --forceWhisperX uses pyannote/speaker-diarization-3.1 under the hood. First-time setup requires:
- A Hugging Face account + read token (
HF_TOKENenv var) - Clicking "Agree" on both pyannote model pages (linked in Quick start)
If HF_TOKEN is missing, the script logs a notice and continues without diarization. Transcripts still work; they just won't have speaker labels.
Already-indexed clips are skipped on re-runs (a sidecar existing = done). Ctrl-C any time; a restart picks up where it stopped. --force regenerates everything.
"Missing dependency: whisperx" — Run setup.py.
"Failed to load diarization pipeline" — You didn't accept the pyannote model terms on Hugging Face. Visit the two model pages, click Agree, then re-run.
Whisper model download stalls — setup.py --skip-model-download, then index_videos.py downloads on first use. Make sure you have disk space (~3GB for large-v3, ~1.5GB for turbo).
"No GPS data in this file" — Many clips don't have GPS metadata. The script handles this silently — the frontmatter just omits the location block.
Apple Silicon GPU not used — CTranslate2 (via WhisperX) currently runs on CPU on M-series Macs. For archive indexing, CPU is plenty fast (10-30× realtime).
| Command | Script | Purpose |
|---|---|---|
vidx |
index_videos.py |
Main indexer |
vidx-summary |
trip_summary.py |
Recursive per-folder summaries |
vidx-master |
master_index.py |
Drive-level _INDEX.md + _INDEX.json |
vidx-query |
query.py |
Filter sidecars by rating, lighting, person, keyword, location, language |
vidx-query /Volumes/SSD-2024 --rating keep --time-of-day golden_hour
vidx-query /Volumes/SSD-2024 --rating cull # the cull pile
vidx-query /Volumes/SSD-2024 --keyword drone --keyword landscape
vidx-query /Volumes/SSD-2024 --place-contains California --language es- Frame sampling is evenly-spaced, not scene-detected
- pyannote diarization degrades on heavy ambient noise (wind, music, crowd)
- WhisperX runs on CPU on Apple Silicon
- Face cluster IDs are temporary hashes until the
vidx-faceslabeling tool ships — embeddings are captured now, so no re-indexing will be needed - RAW photo support not yet (videos only)
MIT — see LICENSE.