A data pipeline and RAG system built on Charlie Follows' YouTube yoga channel. Fetches video metadata and transcripts, stores them in DuckDB, and will power a Claude agent that recommends yoga flows by body area and duration.
- Python ingestion scripts (YouTube Data API v3 + youtube-transcript-api)
- DuckDB as local warehouse
- python-dotenv for config
- uv for package management
src/
db.py # DuckDB connection + schema
fetch_videos.py # Fetches all video metadata from YouTube
fetch_transcripts.py # Fetches + chunks transcripts into 30s windows
data/
raw/
videos/ # Cached video JSON per video_id
transcripts/ # Cached transcript JSON per video_id
processed/ # Reserved for future use
yoga.duckdb # Local DuckDB warehouse (created at runtime)
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtCreate a .env file:
YOUTUBE_API_KEY=your_key_here
CHANNEL_ID=UC5HdAapbvqWN65GIqpWWL3Q# Step 1: fetch all video metadata
python src/fetch_videos.py
# Step 2: fetch and chunk all transcripts
python src/fetch_transcripts.pyBoth scripts are idempotent — re-running them is safe. Raw API responses are cached to data/raw/ so re-runs cost zero API quota and skip already-fetched data.
YouTube rate-limits transcript fetching at scale. Three proxy options are supported via .env:
Webshare (recommended):
WEBSHARE_USERNAME=your_username
WEBSHARE_PASSWORD=your_passwordGeneric HTTP proxy (Bright Data, Tor, etc.):
PROXY_HTTP_URL=http://user:pass@host:port
PROXY_HTTPS_URL=http://user:pass@host:portSwiftshadow (free rotating proxies, less reliable):
USE_SWIFTSHADOW=trueIf none are set, requests go direct.
videos (
video_id VARCHAR PRIMARY KEY,
title VARCHAR,
description VARCHAR,
duration VARCHAR, -- ISO 8601, e.g. PT45M30S
published_at TIMESTAMP,
view_count BIGINT,
like_count BIGINT,
fetched_at TIMESTAMP
)
transcript_chunks (
chunk_id VARCHAR PRIMARY KEY, -- {video_id}_{window_index}
video_id VARCHAR,
window_index INTEGER,
start_seconds DOUBLE,
end_seconds DOUBLE,
text VARCHAR,
fetched_at TIMESTAMP
)- Week 1: ingestion (video metadata + transcripts)
- Week 2: embeddings + vector search
- Week 3: Claude RAG agent