A solo project to process and analyze YouTube watch history data collected from Google Takeout. The pipeline uses PySpark for efficient big data cleaning and transformation, with a 3-layer data architecture on MinIO and Delta Lake. Insights on user viewing habits, peak times, and popular content are visualized using ClickHouse and Apache Superset.
- Integrated, cleaned, validated, and transformed data entirely using PySpark, enabling faster and more flexible processing compared to traditional Java-based Hadoop systems.
- Implemented a 3-layer data architecture stored on MinIO and Delta Lake for efficient storage and reliable data management.
- Visualized data at two key stages: after cleaning and after OLAP system construction.
- Delivered insights on peak activity times, popular content topics, and viewing habits by week and month.
- Spark Cluster: Distributed system for faster large-scale data processing than traditional Python scripts.
- MinIO + Delta Lake: Optimized storage with data compression, version control, and enhanced security.
- ClickHouse: High-performance OLAP data warehouse for real-time analytics.
- Apache Superset: Lightweight, open-source visualization tool with strong integration in the Apache ecosystem.
- Successfully analyzed user behavior from YouTube watch history data.
- Provided deep insights into peak viewing times, trending topics, and evolving viewing patterns over time.
- This project is developed solely for educational purposes.
- The data model is designed to optionally skip or merge the Gold Layer with ClickHouse; this architecture may not suit every use case.
- Watch history data is highly private and will not be included in this repository. Detailed instructions on how to obtain your own data are provided below.
- Visit https://takeout.google.com
- (OPTIONAL) Click Deselect all to only export selected services
- Scroll to the bottom and find YouTube and YouTube Music, then check the box
- Select Multiple formats, prioritize choosing JSON format under diary (Your viewing and search logs on YouTube)
- Complete remaining steps and download the data via email or drive, depending on your selection
- Save the downloaded data to
DockerCompose/spark-apps/data
- Navigate to the DockerCompose directory:
cd DockerCompose - Build the images:
docker compose build - Start the services in detached mode:
docker compose up -d - Once the services are successfully running, you are ready to proceed
- To move and clean data, run the corresponding Spark scripts in this order:
- ingest: ingest raw JSON data into the data lakehouse
- clean: clean and parse the raw data
- checkclean: load cleaned data into ClickHouse; viewable via ClickHouse UI or Apache Superset
- transform: transform data into dimensional (dim) and fact tables, loaded directly into ClickHouse
- Running order:
ingest -> clean -> checkclean -> transform
- Run scripts inside spark-submit container:
docker exec -it spark-submit bashcdcd youtube-script./ingest.sh
#and other file- Important services in the project:
- Spark cluster master UI: http://localhost:8080
- ClickHouse UI: http://localhost:8123/play (user: admin / pass: admin123)
- MinIO UI: http://localhost:9001 (user: admin / pass: admin123)
- Apache Superset UI: http://localhost:8080 (user: admin / pass: admin)
- Feel free to explore and experiment with your cleaned and transformed data.
- Use the ClickHouse UI or Apache Superset to create queries, dashboards, and visualizations.
- Hope you find some interesting insights from your YouTube watch history!
