Using alternative data (weather, air quality, health records) to forecast hospital admissions 3-7 days in advance.
- Project Overview
- Problem Statement
- Dataset
- Model Architecture and Evaluation
- Project Structure
- Setup Instructions
- How to Train the Model
- How to Build and Test the API Locally
- How to Deploy to the Cloud
- MLFlow Experiment Tracking
- Frontend Application
- Cloud Services Used
- Ethical Considerations & Limitations
- Future Work
- Acknowledgments
- AI Citation
From Air to Care is a machine learning system that predicts hospital admission surges in NYC boroughs based on environmental factors. The system helps hospitals proactively allocate resources, reduce costs, and improve patient outcomes.
- Predictive Modeling: Forecast hospital admissions 3-7 days in advance using environmental data
- Resource Optimization: Enable proactive resource allocation to reduce costs by 15-25%
- Borough-Specific Insights: Provide separate predictions for each NYC borough
- Reproducible Pipeline: Build a containerized, version-controlled ML pipeline
- Regression: Predicts actual expected patient admission count
- Borough-specific: Separate predictions for Brooklyn, Bronx, Manhattan, Queens, Staten Island
- 3-7 day forecasting: Advance warning for hospital planning
We built a predictive system that forecasts hospital admissions 3-7 days in advance by combining air pollution, weather, and health data. Our models achieve 91% accuracy in identifying high-risk days and predict patient volumes with RΒ² = 0.92, enabling hospitals to optimize staffing and reduce surge-related costs by 15-25%.
- Deployed Frontend: https://from-air-to-care.streamlit.app/
- Deployed API: https://from-air-to-care-api-4ahsfteyfa-uc.a.run.app
- API Docs (Swagger UI): https://from-air-to-care-api-4ahsfteyfa-uc.a.run.app/docs
- Air pollution, extreme weather, and seasonal changes drive unexpected surges in ER visits
- Climate change is intensifying these health risks (wildfires, heatwaves, smog)
- Hospitals operate reactively, leading to overcrowding, stressed staff, and higher costs
- COVID-19 exposed the fragility of health systems
- Predictive models using alternative data (weather + pollution + health)
- Forecast hospital strain 3-7 days in advance
- Enable proactive resource allocation
- 15-25% cost reduction through optimized resource allocation
- Better patient outcomes through preparation
- Data-driven capacity planning
| Source | Description | Time Period | Link |
|---|---|---|---|
| NOAA | Weather data (temperature, humidity, wind, precipitation) | 2017-2024 | NOAA Climate Data |
| AQNCI | Air quality data (PM2.5, Ozone, NO2) | 2017-2024 | Air Quality Network |
| NYC DOHMH | Respiratory and Asthma ER visits | 2017-2024 | NYC Open Data |
Data is stored in Google Cloud Storage (GCS) bucket: from-air-to-care-data-1990
- Weather data:
nyc_weather_by_borough_2017-2024.csv - Respiratory data:
Respiratory.csv - Asthma data:
Asthama.csv - Air quality data:
Air_Quality.csv
The pipeline automatically downloads data from GCS using src/data_loader.py.
| Metric | Value |
|---|---|
| Total Hospitalizations | 5,133,904 |
| Asthma Cases | 814,962 |
| Respiratory Cases | 4,318,942 |
| Boroughs | 5 |
| Features (after engineering) | 42 |
| Time Period | 2017-2024 |
- Regression Target:
Total_Hospitalization(continuous)- Actual count of daily admissions
- Evaluation metric: RΒ² Score, MAE, RMSE
We use Gradient Boosting models (from scikit-learn) for both classification and regression tasks:
- Algorithm: GradientBoostingClassifier
- Purpose: Predict if a day will be "high-risk" (top 25% admission volume)
- Parameters:
n_estimators: 100max_depth: 5random_state: 42
- Threshold: Top 25% of admission days (β₯754 admissions) = High Risk
- Algorithm: GradientBoostingRegressor
- Purpose: Predict actual patient admission count
- Parameters:
n_estimators: 100max_depth: 5random_state: 42
The pipeline creates 42 features including:
- Weather features: Temperature (max/min), humidity, precipitation, wind speed
- Air quality features: PM2.5, Ozone, NOβ concentrations
- Temporal features: Month, day, day of week, quarter, season
- Borough features: One-hot encoded borough indicators
- Lag features: 7-day lag of hospitalizations, temperature, humidity
- Rolling features: 7-day rolling averages
| Model | Accuracy | AUROC | Recall | Precision | F1-Score |
|---|---|---|---|---|---|
| Gradient Boosting | 91.8% | 0.965 | 80.0% | 82.2% | 0.811 |
| SVM | 89.5% | 0.949 | 78.5% | 75.0% | 0.767 |
| Random Forest | 88.5% | 0.937 | 83.2% | 70.2% | 0.762 |
| Logistic Regression | 87.3% | 0.943 | 85.4% | 66.7% | 0.749 |
| Model | RΒ² Score | MAE | RMSE | MAPE |
|---|---|---|---|---|
| Gradient Boosting | 0.919 | Β±57.8 | 74.8 | 12.7% |
| Random Forest | 0.904 | Β±57.8 | 81.5 | 10.9% |
| Lasso Regression | 0.842 | Β±80.8 | 104.5 | 17.2% |
- Training: 2017-2019
- Validation: 2023
- Test: 2024
Data-ML-Engineering
βββ api
β βββ app.py
βββ config
β βββ config.yaml
βββ frontend
β βββ app_ui.py
βββ src
β βββ artifacts
β β βββ confusion_matrix.png
β β βββ predicted_vs_actual.png
β β βββ roc_curve.png
β βββ data_loader.py
β βββ feature_engineering.py
β βββ main.py
β βββ predict.py
β βββ preprocessing.py
β βββ train.py
βββ .dockerignore
βββ .gcloudignore
βββ .gitignore
βββ Dockerfile
βββ README.md
βββ cloudbuild.yaml
βββ entrypoint.py
βββ requirements.txt
βββ runtime.txt
βββ test_api.py
- Python 3.11+
- Docker Desktop (optional, for containerized runs)
- Google Cloud account (for data storage)
- Git
git clone https://github.com/SharmilNK/Data-ML-Engineering.git
cd Data-ML-Engineeringpython -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activatepip install -r requirements.txt- Create a GCS service account at https://console.cloud.google.com/iam-admin/serviceaccounts
- Download the JSON key
- Save it as
data/gcs-credentials.json - This file is gitignored and must be created locally
Edit config/config.yaml with your settings:
data:
bucket_name: "from-air-to-care-data-1990"# From project root
cd src
python main.py
# Or from project root
python -m src.mainTo ensure reproducibility, you can run the training pipeline inside a container:
# 1. Build the image
docker build -t from-air-to-care .
# 2. Run training (mounting local volumes for credentials and output)
# Note: For Windows PowerShell, use ${PWD}. For Command Prompt, use %cd%. For Mac/Linux use $(pwd).
docker run -e PYTHONPATH=/app \
-v "${PWD}/data/gcs-credentials.json:/app/data/gcs-credentials.json" \
-v "${PWD}/models:/app/models" \
-v "${PWD}/src/mlruns:/app/src/mlruns" \
from-air-to-care train======================================================================
STARTING TRAINING PIPELINE
======================================================================
β Config loaded from config.yaml
β Weather data: (9130, 9)
β Respiratory data: (12733, 6)
...
β Accuracy: 0.9175
β AUROC: 0.9651
β RΒ²: 0.9191
β MAE: 57.82
...
β PIPELINE COMPLETE!
# Ensure models are trained first (models/models.pkl exists)
# Run the FastAPI server locally
python -m uvicorn api.app:app --reload --host 0.0.0.0 --port 8000# Build the image
docker build -t from-air-to-care .
# Run API server
docker run -p 8000:8000 \
-v "${PWD}/models:/app/models" \
-v "${PWD}/src/models:/app/src/models" \
from-air-to-care serveThe API will be available at http://localhost:8000
Health Check:
curl http://localhost:8000/healthRoot Endpoint:
curl http://localhost:8000/Make a Prediction:
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"Temp_Max_C": 25.0,
"Temp_Min_C": 15.0,
"Humidity_Avg": 70.0,
"month": 6,
"day": 15,
"day_of_week": 5,
"quarter": 2,
"season": 3,
"borough": "brooklyn"
}'We provide a test script for automated testing:
# Test local API
python test_api.py http://localhost:8000
# Test deployed API
python test_api.py https://from-air-to-care-api-4ahsfteyfa-uc.a.run.app- Import the API collection from Swagger UI:
http://localhost:8000/docs - Or manually create requests:
- GET
http://localhost:8000/health - POST
http://localhost:8000/predictwith JSON body
- GET
Once the API is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
We deploy the API to Google Cloud Run for serverless hosting.
- Google Cloud account with billing enabled
- Google Cloud SDK installed (
gcloud) - Docker installed (for local builds)
# Login to Google Cloud
gcloud auth login
# Set your project
gcloud config set project YOUR_PROJECT_ID
# Enable required APIs
gcloud services enable cloudbuild.googleapis.com
gcloud services enable run.googleapis.com
gcloud services enable containerregistry.googleapis.com# Build and deploy in one command
gcloud builds submit --config cloudbuild.yaml# Build the image for amd64 platform (required for Cloud Run)
docker build --platform linux/amd64 -t gcr.io/YOUR_PROJECT_ID/from-air-to-care-api:latest .
# Configure Docker to use gcloud credentials
gcloud auth configure-docker
# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/from-air-to-care-api:latestgcloud run deploy from-air-to-care-api \
--image gcr.io/YOUR_PROJECT_ID/from-air-to-care-api:latest \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 2Gi \
--cpu 2 \
--timeout 300 \
--max-instances 10 \
--set-env-vars PYTHONUNBUFFERED=1gcloud run services describe from-air-to-care-api \
--region us-central1 \
--format 'value(status.url)'# Health check
curl https://YOUR_API_URL/health
# Make a prediction
curl -X POST "https://YOUR_API_URL/predict" \
-H "Content-Type: application/json" \
-d '{
"month": 6,
"day": 15,
"day_of_week": 5,
"quarter": 2,
"season": 3,
"borough": "brooklyn"
}'Production API: https://from-air-to-care-api-4ahsfteyfa-uc.a.run.app
API Endpoints:
GET /- API informationGET /health- Health checkPOST /predict- Make predictionsGET /docs- Swagger UI documentationGET /redoc- ReDoc documentation
After training, launch MLFlow UI:
mlflow uiThen open: http://localhost:5000
- Parameters: model type, n_estimators, max_depth, threshold_percentile, train_years, test_year
- Metrics: accuracy, AUROC, recall, precision, F1-score, RΒ², MAE, RMSE
- Artifacts:
- Trained models (
models.pkl) - Confusion matrix plots (
confusion_matrix.png) - ROC curves (
roc_curve.png) - Predicted vs actual plots (
predicted_vs_actual.png)
- Trained models (
MLFlow allows you to:
- Compare multiple experiment runs side-by-side
- Track model versioning
- Reproduce any previous experiment
- View parameter and metric history
MLFlow is configured in config/config.yaml:
mlflow:
experiment_name: "from-air-to-care"
tracking_uri: "mlruns"Deployed Frontend: https://from-air-to-care.streamlit.app/
The frontend application is now live and publicly accessible!
The frontend application provides an interactive web interface to:
- Select a date (between January 1, 2022 and December 31, 2024)
- Select a NYC borough (Brooklyn, Bronx, Manhattan, Queens, Staten Island)
- Get real-time predictions from the deployed API
- View predicted hospital admission counts with detailed information
Prerequisites:
- Python 3.8+
- Streamlit installed (
pip install streamlit)
Steps:
-
Navigate to frontend directory:
cd frontend -
Run Streamlit app:
streamlit run app_ui.py
-
Open in browser:
- The app will automatically open at
http://localhost:8501 - Or manually navigate to the URL shown in the terminal
- The app will automatically open at
-
Configure API URL:
- The default API URL is set to the deployed production API
- You can change it in the sidebar if testing with a local API
frontend/
βββ app_ui.py # Main Streamlit application
Key Components:
- API health check (cached for 60 seconds)
- Form-based input collection
- Date picker with automatic feature extraction
- API request handling with error management
- Results visualization with prominent display
- Responsive layout using Streamlit columns
The frontend requires:
streamlit>=1.26.0requests>=2.31.0
These are already included in requirements.txt.
| Service | Purpose |
|---|---|
| Google Cloud Storage (GCS) | Store raw CSV data files |
| MLFlow | Experiment tracking and model versioning |
| Docker | Containerization for reproducibility |
| Google Cloud Run | Host API endpoint for model predictions |
| Streamlit Cloud | Host frontend application |
As part of our commitment to responsible AI, we have identified the following considerations:
-
Data Bias: Our training data relies on historical hospital admissions. If certain demographics have historically faced barriers to accessing healthcare, the model may under-predict demand in those communities, potentially perpetuating resource inequity.
-
Correlation vs. Causation: While air quality is a strong predictor, the model does not prove causality. High pollution days often correlate with other factors (e.g., high traffic) that might also influence ER visits.
-
Privacy: All data used is aggregated at the borough level. No individual patient health information (PHI) was accessed or processed, ensuring compliance with privacy standards.
-
Scope Limitation: The model is currently trained only on NYC data. It should not be generalized to other cities without retraining on local environmental and health data.
-
Model Limitations: The model predicts based on historical patterns and may not account for novel events (e.g., new diseases, extreme weather events not seen in training data).
- Real-time Data Integration: Integrate live weather and air quality APIs for real-time predictions
- Multi-city Expansion: Extend the model to other cities with similar data availability
- Advanced Models: Experiment with deep learning models (LSTM, Transformer) for time series forecasting
- Dashboard Development: Create an admin dashboard for hospital staff to monitor predictions
- Alert System: Implement automated alerting for high-risk days
- Model Retraining Pipeline: Set up automated retraining pipeline with new data
- Feature Engineering: Explore additional features (holidays, events, social factors)
- NYC Department of Health and Mental Hygiene (DOHMH) for health data
- NOAA for weather data
- EPA for air quality data
- Google Cloud Platform for cloud infrastructure
- Streamlit for frontend framework
For our project, the following AI tools were utilized to assist in development, analysis, and documentation:
-
Composer 1: Used on November 24, 2025, for the UI design of the frontend application.
-
Claude Sonnet 4.5: Used on November 21 and 24, 2025, to assist with code debugging and error correction.
-
Gemini 3 Pro: Used on November 24, 2025, to assist with the revision, formatting, and completion of the README file.
-
ChatGPT 5.1: Used on November 21, 2025, to provide guidance on cloud data deployment, Docker containerization strategies, and instructions for deploying the Front-End Interface.