Skip to content

Wal-Liu/YoutubeAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube Watch History Data Pipeline

Introduce

A solo project to process and analyze YouTube watch history data collected from Google Takeout. The pipeline uses PySpark for efficient big data cleaning and transformation, with a 3-layer data architecture on MinIO and Delta Lake. Insights on user viewing habits, peak times, and popular content are visualized using ClickHouse and Apache Superset.

Responsibilities and Work

  • Integrated, cleaned, validated, and transformed data entirely using PySpark, enabling faster and more flexible processing compared to traditional Java-based Hadoop systems.
  • Implemented a 3-layer data architecture stored on MinIO and Delta Lake for efficient storage and reliable data management.
  • Visualized data at two key stages: after cleaning and after OLAP system construction.
  • Delivered insights on peak activity times, popular content topics, and viewing habits by week and month.

Technologies and Highlights

  • Spark Cluster: Distributed system for faster large-scale data processing than traditional Python scripts.
  • MinIO + Delta Lake: Optimized storage with data compression, version control, and enhanced security.
  • ClickHouse: High-performance OLAP data warehouse for real-time analytics.
  • Apache Superset: Lightweight, open-source visualization tool with strong integration in the Apache ecosystem.

Achievements

  • Successfully analyzed user behavior from YouTube watch history data.
  • Provided deep insights into peak viewing times, trending topics, and evolving viewing patterns over time.

Architect

alt text

Notes

  • This project is developed solely for educational purposes.
  • The data model is designed to optionally skip or merge the Gold Layer with ClickHouse; this architecture may not suit every use case.
  • Watch history data is highly private and will not be included in this repository. Detailed instructions on how to obtain your own data are provided below.

Step by Step Guide

1. Obtain data from Google Takeout

  • Visit https://takeout.google.com
  • (OPTIONAL) Click Deselect all to only export selected services
  • Scroll to the bottom and find YouTube and YouTube Music, then check the box
  • Select Multiple formats, prioritize choosing JSON format under diary (Your viewing and search logs on YouTube)
  • Complete remaining steps and download the data via email or drive, depending on your selection
  • Save the downloaded data to DockerCompose/spark-apps/data

2. Docker Compose Setup

  • Navigate to the DockerCompose directory:
    cd DockerCompose
  • Build the images:
    docker compose build
  • Start the services in detached mode:
    docker compose up -d
  • Once the services are successfully running, you are ready to proceed

3. Explanation of files and services

  • To move and clean data, run the corresponding Spark scripts in this order:
    • ingest: ingest raw JSON data into the data lakehouse
    • clean: clean and parse the raw data
    • checkclean: load cleaned data into ClickHouse; viewable via ClickHouse UI or Apache Superset
    • transform: transform data into dimensional (dim) and fact tables, loaded directly into ClickHouse
    • Running order: ingest -> clean -> checkclean -> transform
  • Run scripts inside spark-submit container:
docker exec -it spark-submit bash
cd
cd youtube-script
./ingest.sh
#and other file

4. Exploring Your Data

  • Feel free to explore and experiment with your cleaned and transformed data.
  • Use the ClickHouse UI or Apache Superset to create queries, dashboards, and visualizations.
  • Hope you find some interesting insights from your YouTube watch history!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors