YouTube Watch History Data Pipeline

Introduce

A solo project to process and analyze YouTube watch history data collected from Google Takeout. The pipeline uses PySpark for efficient big data cleaning and transformation, with a 3-layer data architecture on MinIO and Delta Lake. Insights on user viewing habits, peak times, and popular content are visualized using ClickHouse and Apache Superset.

Responsibilities and Work

Integrated, cleaned, validated, and transformed data entirely using PySpark, enabling faster and more flexible processing compared to traditional Java-based Hadoop systems.
Implemented a 3-layer data architecture stored on MinIO and Delta Lake for efficient storage and reliable data management.
Visualized data at two key stages: after cleaning and after OLAP system construction.
Delivered insights on peak activity times, popular content topics, and viewing habits by week and month.

Technologies and Highlights

Spark Cluster: Distributed system for faster large-scale data processing than traditional Python scripts.
MinIO + Delta Lake: Optimized storage with data compression, version control, and enhanced security.
ClickHouse: High-performance OLAP data warehouse for real-time analytics.
Apache Superset: Lightweight, open-source visualization tool with strong integration in the Apache ecosystem.

Achievements

Successfully analyzed user behavior from YouTube watch history data.
Provided deep insights into peak viewing times, trending topics, and evolving viewing patterns over time.

Architect

Notes

This project is developed solely for educational purposes.
The data model is designed to optionally skip or merge the Gold Layer with ClickHouse; this architecture may not suit every use case.
Watch history data is highly private and will not be included in this repository. Detailed instructions on how to obtain your own data are provided below.

Step by Step Guide

1. Obtain data from Google Takeout

Visit https://takeout.google.com
(OPTIONAL) Click Deselect all to only export selected services
Scroll to the bottom and find YouTube and YouTube Music, then check the box
Select Multiple formats, prioritize choosing JSON format under diary (Your viewing and search logs on YouTube)
Complete remaining steps and download the data via email or drive, depending on your selection
Save the downloaded data to DockerCompose/spark-apps/data

2. Docker Compose Setup

Navigate to the DockerCompose directory:
cd DockerCompose
Build the images:
docker compose build
Start the services in detached mode:
docker compose up -d
Once the services are successfully running, you are ready to proceed

3. Explanation of files and services

To move and clean data, run the corresponding Spark scripts in this order:
- ingest: ingest raw JSON data into the data lakehouse
- clean: clean and parse the raw data
- checkclean: load cleaned data into ClickHouse; viewable via ClickHouse UI or Apache Superset
- transform: transform data into dimensional (dim) and fact tables, loaded directly into ClickHouse
- Running order: ingest -> clean -> checkclean -> transform
Run scripts inside spark-submit container:

docker exec -it spark-submit bash

cd

cd youtube-script

./ingest.sh
#and other file

Important services in the project:
Spark cluster master UI: http://localhost:8080
ClickHouse UI: http://localhost:8123/play (user: admin / pass: admin123)
MinIO UI: http://localhost:9001 (user: admin / pass: admin123)
Apache Superset UI: http://localhost:8080 (user: admin / pass: admin)

4. Exploring Your Data

Feel free to explore and experiment with your cleaned and transformed data.
Use the ClickHouse UI or Apache Superset to create queries, dashboards, and visualizations.
Hope you find some interesting insights from your YouTube watch history!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
DockerCompose		DockerCompose
Superset		Superset
.gitignore		.gitignore
README.md		README.md
image.png		image.png
{08BF6AEE-B58A-4BC6-9AE5-8F8B10368858}.png		{08BF6AEE-B58A-4BC6-9AE5-8F8B10368858}.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Watch History Data Pipeline

Introduce

Responsibilities and Work

Technologies and Highlights

Achievements

Architect

Notes

Step by Step Guide

1. Obtain data from Google Takeout

2. Docker Compose Setup

3. Explanation of files and services

4. Exploring Your Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YouTube Watch History Data Pipeline

Introduce

Responsibilities and Work

Technologies and Highlights

Achievements

Architect

Notes

Step by Step Guide

1. Obtain data from Google Takeout

2. Docker Compose Setup

3. Explanation of files and services

4. Exploring Your Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages