Batch processing with PySpark: local setup, SQL/DataFrames, cloud execution, and homework deliverables
This repository consolidates Module 6 (Batch Processing with Spark) work:
- annotated workshop notebooks (
03to09) - executable Spark scripts for local and cloud runs
- a full project/homework notebook on NYC Yellow Taxi (November 2025)
- a project summary of validated answers and queries
From pyproject.toml:
- Python
>= 3.13 pyspark >= 4.1.1jupyter >= 1.1.1marimo >= 0.20.4
uv sync
source .venv/bin/activate
jupyter notebookMinimal Spark smoke test:
python workshop/test_spark.py- Yellow Taxi November 2025 parquet:
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet
- Taxi zone lookup CSV:
https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
- FHVHV January 2021 (workshop):
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz
Download safely to ./data:
mkdir -p ./data
if [ ! -f ./data/yellow_tripdata_2025-11.parquet ]; then
wget -P ./data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet
fi
if [ ! -f ./data/taxi_zone_lookup.csv ]; then
wget -P ./data https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
fi.
├── README.md
├── pyproject.toml
├── main.py
├── ressources/
│ └── pictures/spark_logo.jpg
├── workshop/
│ ├── 03_test.ipynb
│ ├── 04_pyspark.ipynb
│ ├── 04_pyspark_Ipython.py
│ ├── 04_pyspark_marimo.py
│ ├── 05_taxi_schema.ipynb
│ ├── 06_spark_sql.ipynb
│ ├── 06_spark_sql.py
│ ├── 06_spark_sql_big_query.py
│ ├── 07_groupby_join.ipynb
│ ├── 08_rdds.ipynb
│ ├── 09_spark_gcs.ipynb
│ ├── cloud.md
│ ├── download_data.sh
│ ├── test_spark.py
│ └── homework.ipynb
└── project/
├── README.md
├── homework.ipynb
└── data/
03_test.ipynb: local Spark quick validation (CSV -> Spark -> Parquet)04_pyspark.ipynb: foundational DataFrame workflow and UDF introduction05_taxi_schema.ipynb: explicit schema design and type handling06_spark_sql.ipynb: SQL/DataFrame transformations and monthly aggregations07_groupby_join.ipynb: grouping and joins08_rdds.ipynb: RDD transformations and distributed concepts09_spark_gcs.ipynb: Spark + Google Cloud Storage integrationworkshop/homework.ipynb: annotated homework practice notebook
workshop/test_spark.py: local runtime smoke test (spark.version, simple DataFrame)workshop/download_data.sh: yearly/monthly raw taxi download helperworkshop/06_spark_sql.py:- normalizes green/yellow schemas
- unions both datasets
- computes monthly revenue KPIs
- writes parquet output
workshop/06_spark_sql_big_query.py:- same KPI pipeline
- writes result to BigQuery (with temporary GCS bucket config)
workshop/cloud.md documents:
- standalone Spark cluster setup (
start-master,start-worker) spark-submitflow- Dataproc job submission
- BigQuery connector usage
The notebook in project/homework.ipynb follows a structured implementation flow for Questions 1 to 6.
The focus is on building a reproducible Spark workflow, not only producing final values.
Technical development logic:
- Initialize runtime: import PySpark modules, create
SparkSession, and verify session state/version. - Ingest source data: read
yellow_tripdata_2025-11.parquetinto a DataFrame and inspect schema/sample rows. - Register SQL context: create temp views (e.g.
yellow_2025_11) to switch easily between SQL and DataFrame APIs. - Persist partitioned output:
repartition(4)then write parquet to a target folder for file-size and partition analysis. - Validate storage artifacts: inspect generated
part-*parquet files and_SUCCESSmarkers from Spark writes. - Implement date-based analytics: derive pickup date (
to_date) and run count aggregations for daily filtering use cases. - Implement trip-duration analytics: compute duration from pickup/dropoff timestamps with safe timestamp-to-seconds conversion.
- Enable observability: trigger actions and inspect Spark UI (
uiWebUrl) for stages, jobs, and task execution behavior. - Enrich with lookup data: load
taxi_zone_lookup.csv, align key types (LocationID), and join onPULocationID. - Rank low-frequency pickup zones: aggregate by
Zone, sort ascending by trip count, and limit output for answer candidates. - Cross-check logic in two styles: keep equivalent SQL and PySpark DataFrame versions for clarity and debugging.
- Handle notebook edge cases: address path overwrite conflicts and Spark session reconnect issues before reruns.
- Spark UI can auto-increment ports if busy (
4040,4041, ...). - For repeated writes to the same parquet path, use overwrite mode:
df.repartition(4).write.mode("overwrite").parquet("partitioned/yellow_2025_11_repartitioned")-
lssize units:- file sizes are bytes in plain
ls -l total(macOS/BSD) is in 512-byte blocks- use
ls -lah/du -shfor human-readable output
- file sizes are bytes in plain
-
If a Spark session gets unstable in notebooks (
ConnectionRefusedError), restart kernel and recreate the session before continuing.
- Project summary and final homework Q/A:
project/README.md - Cloud run cookbook:
workshop/cloud.md
