Legacy Pipeline Code

The legacy/ folder contains the original NYC taxi pipeline implemented as Google BigQuery / Dataproc notebooks.

Pipeline Stages

Stage	File	Purpose
00	`00_AutoIngestTLCFiles`	Download parquet from TLC URLs → GCS `nyc_raw_data_bucket`
01	`01_GCStoBronzeIngestion`	GCS parquet → BigQuery `RawBronze` (Spark)
02a	`02aYellowRawBronzeToCleanSilver`	Yellow taxi: RawBronze → `CleanSilver`
02b	`02bGreenRawBronzeToCleanSilver`	Green taxi: RawBronze → CleanSilver
02c	`02cFHVRawBronzeToCleanSilver`	FHV: RawBronze → CleanSilver
02d	`02dHVFHVRawBronzeToCleanSilver`	HVFHV: RawBronze → CleanSilver
03	`03CleanSilverToPreML`	CleanSilver → PreMlGold (daily, hourly, hotspot)
04a	`04aXGBostFleetRecommender`	XGBoost fleet mix predictions → PostMlGold
04b	`04bAnomalies`	Autoencoder anomaly detection → PostMlGold
—	`FinalGradioDashboard`	Gradio dashboard (queries BigQuery live)
—	`LoadInvestigate files`	Exploration, taxi zones, GeoPandas

TLC URLs → GCS → RawBronze → CleanSilver → PreMlGold → PostMlGold
                                                    ↘ Gradio Dashboard

Execute notebooks in order (00 → 01 → 02a–d → 03 → 04a → 04b) in a Dataproc / BigQuery environment with:

The optimized pipeline is implemented in pipeline/ and pipeline_utils/:

See GCP_OPTIMIZATION_PLAN.md for the full architecture and pipeline/README.md for run commands.