This repository contains the full data pipeline, modeling code, and visualization tools for forecasting hourly subway ridership for every station in New York City. My project ingests large-scale historical ridership data, performs feature engineering and parallelized + rolling model fitting, and generates geospatial animations of predicted flows across the NYC subway network.
We build an autoregressive model with exogenous features (ARX) to predict station-level ridership, achieving:
- R2 > 0.9 on ~97% of all stations
- A fast, scalable training + inference pipeline using Apache Spark
- Geospatial visualizations showing hourly ridership patterns, weekday/weekend dynamics, and commuter surges
This work relies on:
- Polars for streaming & partitioning a 200M+ row ridership dataset
- Spark for parallelized ARX model fitting
- GeoPandas / Plotly / CartoDB tiles for visualizing subway flow
All details and experiments are documented in the included final report.
