𝘼𝙥𝙖𝙘𝙝𝙚 𝙎𝙥𝙖𝙧𝙠

Batch processing with PySpark: local setup, SQL/DataFrames, cloud execution, and homework deliverables

Overview

This repository consolidates Module 6 (Batch Processing with Spark) work:

annotated workshop notebooks (03 to 09)
executable Spark scripts for local and cloud runs
a full project/homework notebook on NYC Yellow Taxi (November 2025)
a project summary of validated answers and queries

Environment

From pyproject.toml:

Python >= 3.13
pyspark >= 4.1.1
jupyter >= 1.1.1
marimo >= 0.20.4

Quick Start

uv sync
source .venv/bin/activate
jupyter notebook

Minimal Spark smoke test:

python workshop/test_spark.py

Data Sources Used

Yellow Taxi November 2025 parquet:
- https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet
Taxi zone lookup CSV:
- https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
FHVHV January 2021 (workshop):
- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz

Download safely to ./data:

mkdir -p ./data
if [ ! -f ./data/yellow_tripdata_2025-11.parquet ]; then
  wget -P ./data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet
fi
if [ ! -f ./data/taxi_zone_lookup.csv ]; then
  wget -P ./data https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
fi

Repository Map

.
├── README.md
├── pyproject.toml
├── main.py
├── ressources/
│   └── pictures/spark_logo.jpg
├── workshop/
│   ├── 03_test.ipynb
│   ├── 04_pyspark.ipynb
│   ├── 04_pyspark_Ipython.py
│   ├── 04_pyspark_marimo.py
│   ├── 05_taxi_schema.ipynb
│   ├── 06_spark_sql.ipynb
│   ├── 06_spark_sql.py
│   ├── 06_spark_sql_big_query.py
│   ├── 07_groupby_join.ipynb
│   ├── 08_rdds.ipynb
│   ├── 09_spark_gcs.ipynb
│   ├── cloud.md
│   ├── download_data.sh
│   ├── test_spark.py
│   └── homework.ipynb
└── project/
    ├── README.md
    ├── homework.ipynb
    └── data/

Workshop Track

Notebooks

03_test.ipynb: local Spark quick validation (CSV -> Spark -> Parquet)
04_pyspark.ipynb: foundational DataFrame workflow and UDF introduction
05_taxi_schema.ipynb: explicit schema design and type handling
06_spark_sql.ipynb: SQL/DataFrame transformations and monthly aggregations
07_groupby_join.ipynb: grouping and joins
08_rdds.ipynb: RDD transformations and distributed concepts
09_spark_gcs.ipynb: Spark + Google Cloud Storage integration
workshop/homework.ipynb: annotated homework practice notebook

Scripts

workshop/test_spark.py: local runtime smoke test (spark.version, simple DataFrame)
workshop/download_data.sh: yearly/monthly raw taxi download helper
workshop/06_spark_sql.py:
- normalizes green/yellow schemas
- unions both datasets
- computes monthly revenue KPIs
- writes parquet output
workshop/06_spark_sql_big_query.py:
- same KPI pipeline
- writes result to BigQuery (with temporary GCS bucket config)

Cloud Notes

workshop/cloud.md documents:

standalone Spark cluster setup (start-master, start-worker)
spark-submit flow
Dataproc job submission
BigQuery connector usage

Project Track (Yellow Taxi November 2025)

The notebook in project/homework.ipynb follows a structured implementation flow for Questions 1 to 6.
The focus is on building a reproducible Spark workflow, not only producing final values.

Technical development logic:

Initialize runtime: import PySpark modules, create SparkSession, and verify session state/version.
Ingest source data: read yellow_tripdata_2025-11.parquet into a DataFrame and inspect schema/sample rows.
Register SQL context: create temp views (e.g. yellow_2025_11) to switch easily between SQL and DataFrame APIs.
Persist partitioned output: repartition(4) then write parquet to a target folder for file-size and partition analysis.
Validate storage artifacts: inspect generated part-* parquet files and _SUCCESS markers from Spark writes.
Implement date-based analytics: derive pickup date (to_date) and run count aggregations for daily filtering use cases.
Implement trip-duration analytics: compute duration from pickup/dropoff timestamps with safe timestamp-to-seconds conversion.
Enable observability: trigger actions and inspect Spark UI (uiWebUrl) for stages, jobs, and task execution behavior.
Enrich with lookup data: load taxi_zone_lookup.csv, align key types (LocationID), and join on PULocationID.
Rank low-frequency pickup zones: aggregate by Zone, sort ascending by trip count, and limit output for answer candidates.
Cross-check logic in two styles: keep equivalent SQL and PySpark DataFrame versions for clarity and debugging.
Handle notebook edge cases: address path overwrite conflicts and Spark session reconnect issues before reruns.

Useful Operational Notes

Spark UI can auto-increment ports if busy (4040, 4041, ...).
For repeated writes to the same parquet path, use overwrite mode:

df.repartition(4).write.mode("overwrite").parquet("partitioned/yellow_2025_11_repartitioned")

ls size units:
- file sizes are bytes in plain ls -l
- total (macOS/BSD) is in 512-byte blocks
- use ls -lah / du -sh for human-readable output
If a Spark session gets unstable in notebooks (ConnectionRefusedError), restart kernel and recreate the session before continuing.

Related Files

Project summary and final homework Q/A: project/README.md
Cloud run cookbook: workshop/cloud.md

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
project		project
ressources/pictures		ressources/pictures
workshop		workshop
.gitignore		.gitignore
.gitignore:Zone.Identifier		.gitignore:Zone.Identifier
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

𝘼𝙥𝙖𝙘𝙝𝙚 𝙎𝙥𝙖𝙧𝙠

Overview

Environment

Quick Start

Data Sources Used

Repository Map

Workshop Track

Notebooks

Scripts

Cloud Notes

Project Track (Yellow Taxi November 2025)

Useful Operational Notes

Related Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

𝘼𝙥𝙖𝙘𝙝𝙚 𝙎𝙥𝙖𝙧𝙠

Overview

Environment

Quick Start

Data Sources Used

Repository Map

Workshop Track

Notebooks

Scripts

Cloud Notes

Project Track (Yellow Taxi November 2025)

Useful Operational Notes

Related Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages