video-skill-extractor turns narrated videos into structured, timeline-ready skill steps.
Current pipeline supports:
- transcription (OpenAI-compatible Whisper endpoint)
- transcript parsing + chunking
- AI step extraction
- per-step frame extraction (ffmpeg)
- AI enrichment (reasoning + VLM, two-pass visual analysis)
- markdown rendering
- provider health checks
- Python 3.11+
uv- Docker (for local model services, optional)
- ffmpeg binary is handled via
imageio-ffmpegin this project
cd /Users/mg/src/course-step-extractor
uv sync --devSanity checks:
uv run ruff check .
uv run pytest -qIf you want to run this through OpenClaw as a skill:
# install skill from ClawHub into your OpenClaw workspace
npx -y clawhub install video-skill --workdir ~/.openclaw/workspaceThe skill installs to:
~/.openclaw/workspace/skills/video-skill
Then in that directory:
cd ~/.openclaw/workspace/skills/video-skill
uv sync --dev
cp config.example.json config.jsonValidate provider endpoints before first run:
uv run video-skill config-validate --config config.json
uv run video-skill providers-ping --config config.json --path /v1/modelsYou can now run the same commands documented below from this installed skill directory.
./scripts/bootstrap_models.shdocker compose -f deploy/docker-compose.models.yml up -ddocker compose -f deploy/docker-compose.models.yml psCreate from template:
cp config.example.json config.jsonSet the 3 provider roles:
transcription→ Whisper/OpenAI-compatible ASR endpoint- supports optional
language(default"en"; use"auto"to enable autodetect)
- supports optional
reasoning→ reasoning model endpointvlm→ vision-language model endpoint
Use served model IDs from /v1/models (not raw filenames unless the server exposes those as IDs).
Validate + ping:
uv run video-skill config-validate --config config.json
uv run video-skill providers-ping --config config.json --path /v1/modelsuv run video-skill --helpKey commands:
transcribetranscript-parsetranscript-chunksteps-extractframes-extractsteps-enrichmarkdown-render
Example video: datasets/demo/zac-game.mp4
# 1) ASR
uv run video-skill transcribe \
--video datasets/demo/zac-game.mp4 \
--out datasets/demo/zac-game.whisper.json \
--config config.json
# optional override: --language auto (or --language es, --language fr, ...)
# 2) Parse transcript
uv run video-skill transcript-parse \
--input datasets/demo/zac-game.whisper.json \
--out datasets/demo/zac-game.segments.jsonl
# 3) Chunk transcript
uv run video-skill transcript-chunk \
--segments datasets/demo/zac-game.segments.jsonl \
--out datasets/demo/zac-game.chunks.jsonl \
--window-s 120 \
--overlap-s 15
# 4) Extract steps (AI)
uv run video-skill steps-extract \
--segments datasets/demo/zac-game.segments.jsonl \
--clips-manifest datasets/demo/lesson1.clips.jsonl \
--chunks datasets/demo/zac-game.chunks.jsonl \
--mode ai \
--config config.json \
--out datasets/demo/zac-game.steps.ai.jsonl
# 5) Extract per-step frames for VLM grounding
uv run video-skill frames-extract \
--video datasets/demo/zac-game.mp4 \
--steps datasets/demo/zac-game.steps.ai.jsonl \
--out-dir datasets/demo/frames_zac_game \
--manifest-out datasets/demo/zac-game.frames_manifest.jsonl \
--sample-count 2
# 6) Enrich steps (AI, two-pass visual)
uv run video-skill steps-enrich \
--steps datasets/demo/zac-game.steps.ai.jsonl \
--frames-manifest datasets/demo/zac-game.frames_manifest.jsonl \
--out datasets/demo/zac-game.steps.enriched.ai.jsonl \
--mode ai \
--config config.json
# 7) Render markdown
uv run video-skill markdown-render \
--steps datasets/demo/zac-game.steps.enriched.ai.jsonl \
--out datasets/demo/zac-game.md \
--title "Zac Game - Skill Steps"--mode heuristic- no model calls; deterministic baseline
--mode ai-direct- VLM-only enrichment path
--mode ai- reasoning + VLM orchestration (recommended)
steps-enrich prints progress per step/stage and summary telemetry:
parse_errorstransient_recoveredunresolved_final
make verifyThis runs lint + tests with coverage gate (>=90%).
Typical outputs:
*.whisper.json*.segments.jsonl*.chunks.jsonl*.steps.ai.jsonl*.frames_manifest.jsonl*.steps.enriched.ai.jsonl- optional
*.errors.jsonlfor parse/call telemetry
The project is evolving toward a generalized video skill library with OTIO-ready timeline metadata and editor/robotics adapters.