Skip to content

Anu2711/YoutubeYogaRag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yoga-rag

A data pipeline and RAG system built on Charlie Follows' YouTube yoga channel. Fetches video metadata and transcripts, stores them in DuckDB, and will power a Claude agent that recommends yoga flows by body area and duration.

Stack

  • Python ingestion scripts (YouTube Data API v3 + youtube-transcript-api)
  • DuckDB as local warehouse
  • python-dotenv for config
  • uv for package management

Project structure

src/
  db.py                  # DuckDB connection + schema
  fetch_videos.py        # Fetches all video metadata from YouTube
  fetch_transcripts.py   # Fetches + chunks transcripts into 30s windows
data/
  raw/
    videos/              # Cached video JSON per video_id
    transcripts/         # Cached transcript JSON per video_id
  processed/             # Reserved for future use
yoga.duckdb              # Local DuckDB warehouse (created at runtime)

Setup

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Create a .env file:

YOUTUBE_API_KEY=your_key_here
CHANNEL_ID=UC5HdAapbvqWN65GIqpWWL3Q

Running the pipeline

# Step 1: fetch all video metadata
python src/fetch_videos.py

# Step 2: fetch and chunk all transcripts
python src/fetch_transcripts.py

Both scripts are idempotent — re-running them is safe. Raw API responses are cached to data/raw/ so re-runs cost zero API quota and skip already-fetched data.

Proxy options

YouTube rate-limits transcript fetching at scale. Three proxy options are supported via .env:

Webshare (recommended):

WEBSHARE_USERNAME=your_username
WEBSHARE_PASSWORD=your_password

Generic HTTP proxy (Bright Data, Tor, etc.):

PROXY_HTTP_URL=http://user:pass@host:port
PROXY_HTTPS_URL=http://user:pass@host:port

Swiftshadow (free rotating proxies, less reliable):

USE_SWIFTSHADOW=true

If none are set, requests go direct.

Schema

videos (
    video_id     VARCHAR PRIMARY KEY,
    title        VARCHAR,
    description  VARCHAR,
    duration     VARCHAR,   -- ISO 8601, e.g. PT45M30S
    published_at TIMESTAMP,
    view_count   BIGINT,
    like_count   BIGINT,
    fetched_at   TIMESTAMP
)

transcript_chunks (
    chunk_id      VARCHAR PRIMARY KEY,  -- {video_id}_{window_index}
    video_id      VARCHAR,
    window_index  INTEGER,
    start_seconds DOUBLE,
    end_seconds   DOUBLE,
    text          VARCHAR,
    fetched_at    TIMESTAMP
)

Roadmap

  • Week 1: ingestion (video metadata + transcripts)
  • Week 2: embeddings + vector search
  • Week 3: Claude RAG agent

About

Repository with pipeline to ingest Charlie Follows youtube videos for recommendation engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors