Skip to content

Latest commit

 

History

History
46 lines (34 loc) · 2.08 KB

File metadata and controls

46 lines (34 loc) · 2.08 KB

Legacy Pipeline Code

The legacy/ folder contains the original NYC taxi pipeline implemented as Google BigQuery / Dataproc notebooks.

Pipeline Stages

Stage File Purpose
00 00_AutoIngestTLCFiles Download parquet from TLC URLs → GCS nyc_raw_data_bucket
01 01_GCStoBronzeIngestion GCS parquet → BigQuery RawBronze (Spark)
02a 02aYellowRawBronzeToCleanSilver Yellow taxi: RawBronze → CleanSilver
02b 02bGreenRawBronzeToCleanSilver Green taxi: RawBronze → CleanSilver
02c 02cFHVRawBronzeToCleanSilver FHV: RawBronze → CleanSilver
02d 02dHVFHVRawBronzeToCleanSilver HVFHV: RawBronze → CleanSilver
03 03CleanSilverToPreML CleanSilver → PreMlGold (daily, hourly, hotspot)
04a 04aXGBostFleetRecommender XGBoost fleet mix predictions → PostMlGold
04b 04bAnomalies Autoencoder anomaly detection → PostMlGold
FinalGradioDashboard Gradio dashboard (queries BigQuery live)
LoadInvestigate files Exploration, taxi zones, GeoPandas

Data Flow

TLC URLs → GCS → RawBronze → CleanSilver → PreMlGold → PostMlGold
                                                    ↘ Gradio Dashboard

Running

Execute notebooks in order (00 → 01 → 02a–d → 03 → 04a → 04b) in a Dataproc / BigQuery environment with:

  • google-cloud-bigquery
  • google-cloud-storage
  • pyspark (Dataproc Spark Connect)
  • pandas, xgboost, tensorflow, prophet (for dashboard/ML)

Migration to Optimized Pipeline

The optimized pipeline is implemented in pipeline/ and pipeline_utils/:

  • LegacyOptimized: 04a–d02_bronze_to_silver.py (unified)
  • Shared utilities: pipeline_utils/ (config, spark, bq, gcs, schemas)
  • Incremental processing: each stage skips already-processed partitions.
  • Static dashboard: pipeline/05_ExportDashboardData.py exports JSON to GCS.

See GCP_OPTIMIZATION_PLAN.md for the full architecture and pipeline/README.md for run commands.