Skip to content

michaelgold/videoskill

Repository files navigation

video-skill-extractor

video-skill-extractor turns narrated videos into structured, timeline-ready skill steps.

Current pipeline supports:

  • transcription (OpenAI-compatible Whisper endpoint)
  • transcript parsing + chunking
  • AI step extraction
  • per-step frame extraction (ffmpeg)
  • AI enrichment (reasoning + VLM, two-pass visual analysis)
  • markdown rendering
  • provider health checks

1) Requirements

  • Python 3.11+
  • uv
  • Docker (for local model services, optional)
  • ffmpeg binary is handled via imageio-ffmpeg in this project

2) Install

cd /Users/mg/src/course-step-extractor
uv sync --dev

Sanity checks:

uv run ruff check .
uv run pytest -q

3) OpenClaw / ClawHub installation

If you want to run this through OpenClaw as a skill:

# install skill from ClawHub into your OpenClaw workspace
npx -y clawhub install video-skill --workdir ~/.openclaw/workspace

The skill installs to:

  • ~/.openclaw/workspace/skills/video-skill

Then in that directory:

cd ~/.openclaw/workspace/skills/video-skill
uv sync --dev
cp config.example.json config.json

Validate provider endpoints before first run:

uv run video-skill config-validate --config config.json
uv run video-skill providers-ping --config config.json --path /v1/models

You can now run the same commands documented below from this installed skill directory.


4) Model setup (local/self-hosted)

A. Download models

./scripts/bootstrap_models.sh

B. Start model stack

docker compose -f deploy/docker-compose.models.yml up -d

C. Verify services are up

docker compose -f deploy/docker-compose.models.yml ps

5) Configure config.json

Create from template:

cp config.example.json config.json

Set the 3 provider roles:

  • transcription → Whisper/OpenAI-compatible ASR endpoint
    • supports optional language (default "en"; use "auto" to enable autodetect)
  • reasoning → reasoning model endpoint
  • vlm → vision-language model endpoint

Use served model IDs from /v1/models (not raw filenames unless the server exposes those as IDs).

Validate + ping:

uv run video-skill config-validate --config config.json
uv run video-skill providers-ping --config config.json --path /v1/models

6) CLI quick usage

uv run video-skill --help

Key commands:

  • transcribe
  • transcript-parse
  • transcript-chunk
  • steps-extract
  • frames-extract
  • steps-enrich
  • markdown-render

7) End-to-end run (manual stages)

Example video: datasets/demo/zac-game.mp4

# 1) ASR
uv run video-skill transcribe \
  --video datasets/demo/zac-game.mp4 \
  --out datasets/demo/zac-game.whisper.json \
  --config config.json
# optional override: --language auto  (or --language es, --language fr, ...)

# 2) Parse transcript
uv run video-skill transcript-parse \
  --input datasets/demo/zac-game.whisper.json \
  --out datasets/demo/zac-game.segments.jsonl

# 3) Chunk transcript
uv run video-skill transcript-chunk \
  --segments datasets/demo/zac-game.segments.jsonl \
  --out datasets/demo/zac-game.chunks.jsonl \
  --window-s 120 \
  --overlap-s 15

# 4) Extract steps (AI)
uv run video-skill steps-extract \
  --segments datasets/demo/zac-game.segments.jsonl \
  --clips-manifest datasets/demo/lesson1.clips.jsonl \
  --chunks datasets/demo/zac-game.chunks.jsonl \
  --mode ai \
  --config config.json \
  --out datasets/demo/zac-game.steps.ai.jsonl

# 5) Extract per-step frames for VLM grounding
uv run video-skill frames-extract \
  --video datasets/demo/zac-game.mp4 \
  --steps datasets/demo/zac-game.steps.ai.jsonl \
  --out-dir datasets/demo/frames_zac_game \
  --manifest-out datasets/demo/zac-game.frames_manifest.jsonl \
  --sample-count 2

# 6) Enrich steps (AI, two-pass visual)
uv run video-skill steps-enrich \
  --steps datasets/demo/zac-game.steps.ai.jsonl \
  --frames-manifest datasets/demo/zac-game.frames_manifest.jsonl \
  --out datasets/demo/zac-game.steps.enriched.ai.jsonl \
  --mode ai \
  --config config.json

# 7) Render markdown
uv run video-skill markdown-render \
  --steps datasets/demo/zac-game.steps.enriched.ai.jsonl \
  --out datasets/demo/zac-game.md \
  --title "Zac Game - Skill Steps"

8) Enrichment modes

  • --mode heuristic
    • no model calls; deterministic baseline
  • --mode ai-direct
    • VLM-only enrichment path
  • --mode ai
    • reasoning + VLM orchestration (recommended)

steps-enrich prints progress per step/stage and summary telemetry:

  • parse_errors
  • transient_recovered
  • unresolved_final

9) Testing and quality gates

make verify

This runs lint + tests with coverage gate (>=90%).


10) Output artifacts

Typical outputs:

  • *.whisper.json
  • *.segments.jsonl
  • *.chunks.jsonl
  • *.steps.ai.jsonl
  • *.frames_manifest.jsonl
  • *.steps.enriched.ai.jsonl
  • optional *.errors.jsonl for parse/call telemetry

11) Next direction

The project is evolving toward a generalized video skill library with OTIO-ready timeline metadata and editor/robotics adapters.

About

video-skill-extractor turns narrated videos into structured, timeline-ready skill steps.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages