The legacy/ folder contains the original NYC taxi pipeline implemented as Google BigQuery / Dataproc notebooks.
| Stage | File | Purpose |
|---|---|---|
| 00 | 00_AutoIngestTLCFiles |
Download parquet from TLC URLs → GCS nyc_raw_data_bucket |
| 01 | 01_GCStoBronzeIngestion |
GCS parquet → BigQuery RawBronze (Spark) |
| 02a | 02aYellowRawBronzeToCleanSilver |
Yellow taxi: RawBronze → CleanSilver |
| 02b | 02bGreenRawBronzeToCleanSilver |
Green taxi: RawBronze → CleanSilver |
| 02c | 02cFHVRawBronzeToCleanSilver |
FHV: RawBronze → CleanSilver |
| 02d | 02dHVFHVRawBronzeToCleanSilver |
HVFHV: RawBronze → CleanSilver |
| 03 | 03CleanSilverToPreML |
CleanSilver → PreMlGold (daily, hourly, hotspot) |
| 04a | 04aXGBostFleetRecommender |
XGBoost fleet mix predictions → PostMlGold |
| 04b | 04bAnomalies |
Autoencoder anomaly detection → PostMlGold |
| — | FinalGradioDashboard |
Gradio dashboard (queries BigQuery live) |
| — | LoadInvestigate files |
Exploration, taxi zones, GeoPandas |
TLC URLs → GCS → RawBronze → CleanSilver → PreMlGold → PostMlGold
↘ Gradio Dashboard
Execute notebooks in order (00 → 01 → 02a–d → 03 → 04a → 04b) in a Dataproc / BigQuery environment with:
google-cloud-bigquerygoogle-cloud-storagepyspark(Dataproc Spark Connect)pandas,xgboost,tensorflow,prophet(for dashboard/ML)
The optimized pipeline is implemented in pipeline/ and pipeline_utils/:
- Legacy → Optimized:
04a–d→02_bronze_to_silver.py(unified) - Shared utilities:
pipeline_utils/(config, spark, bq, gcs, schemas) - Incremental processing: each stage skips already-processed partitions.
- Static dashboard:
pipeline/05_ExportDashboardData.pyexports JSON to GCS.
See GCP_OPTIMIZATION_PLAN.md for the full architecture and pipeline/README.md for run commands.