Skip to content

GeoCd/MMMX-station-weather-analysis

Repository files navigation

MMMX-station-weather-analysis

Exploratory data analysis, predictive modeling and production pipeline on historical weather data from the Benito Juárez International Airport (MMMX), Mexico City. The dataset was obtained through ethical web scraping from a weather data provider for educational and scientific purposes.


Dataset

Almost 98,000 hourly readings spanning from January 2017 to October 2025. The variables found were the temperature, dew point, humidity, wind speed, wind gust, pressure and sky condition.

Raw data required significant cleaning, thanks to irregular sampling frequency, physically impossible zero-values for temperature and pressure and outliers outside reasonable ranges for CDMX altitude.

The dataset is also available on Kaggle: Historical Weather in Mexico


Objectives

  • Clean and explore meteorological data.
  • Train regression models to predict temperature from atmospheric features.
  • Train a classification model to identify hot days.
  • Compare a classical time series model (ARIMA) against a deep learning approach (LSTM).
  • Develop the validated logic as a modular pipeline with Power BI dashboard as report.

Current models

1. Regression: Temperature Prediction (sklearn)

Linear Regression and Random Forest trained on atmospheric features (dew point, humidity, wind speed, pressure) plus temporal features (hour, month, day of week, day of year). Chronological 40-30-30 rule train-validation-test split.

Random Forest outperformed Linear Regression on the validation set and was selected for final evaluation.

NOTE: Found that R2 is notably high because dew point correlates strongly with temperature by physical definition. This is documented in the notebook conclusions.

2. Classification: Hot Day Detection (sklearn)

Logistic Regression classifying readings above 75°F as hot. Features are the same atmospheric variables excluding temperature to avoid leakage. Stratified 70-30 split for convenience.

3. Time Series: ARIMA(1,1,1) (statsmodels)

Daily average temperature, chronological split with last 30 days reserved for test. Parameters selected based on ACF/PACF analysis.

Metric Value
RMSE 3.0102°F
MAE 2.1579°F

ARIMA without a seasonal component reverts toward the mean over longer horizons and does not capture annual temperature cycles. This is the expected behavior of ARIMA(1,1,1) on daily meteorological data and motivates the use of the LSTM approach.

4. Time Series: LSTM (TensorFlow/Keras)

Hourly resolution with a 24-hour lookback window. Two stacked LSTM layers with dropout, trained with Adam optimizer and early stopping. Chronological 80-20 split for convenience.

Metric Value
RMSE 1.6656°F
MAE 1.2559°F

Findings

  • Dew point is the strongest predictor of temperature in this dataset. After doing some research,I found that this is consistent with thermodynamic theory.
  • ARIMA captures short-term trends but degrades quickly without a seasonal component.
  • LSTM at hourly resolution reduces prediction error by nearly half compared to ARIMA on the same target.
  • The natural next step is SARIMA with a seasonal period or a density estimation approach for denser forecasting.

Production Pipeline for LSTM

The production pipeline is built around the LSTM model as it represents the most advanced current stage in the modeling progression explored in this project: from linear regression and logistic classification, through ARIMA, to a deep learning time series model. Developing the most capable model is the natural next step before integrating live sensor data.

This project originated from a practical problem: while working on weather data analysis, finding reliable, scientifically consistent meteorological measurements proved harder than expected. That gap motivated the development of a dedicated data platform rather than depending on third-party sources.

The end goal is to make clean, localized weather data accessible to people who actually need it: students, educators and hobbyists looking to build or improve their own models, analyze specific zones or identify anomalous conditions for research and engineering projects.

The validated notebook logic is productionized as a modular Python pipeline:

main.py
  |
  ├── cleaner.py      # Classic data cleaning procedure, based on notebook
  ├── lstm_model.py   # LSTM training and inference
  └── report.py       # 4 CSV exports for Power BI dashboard

Usage

# Full reprocess with retraining the model
python main.py --mode batch --trainAgain

# Add new data without retraining the model
python main.py --source data/new_data.csv --mode incremental

# Add new data and retrain the model
python main.py --source data/new_data.csv --mode incremental --trainAgain

Output files

File Description
output/MMMX_clean.csv Cleaned dataset.
output/MMMX_predictions.csv Real vs predicted temperature & error per hour.
output/MMMX_model_metrics.csv Global model metrics (RMSE, MAE).
output/MMMX_monthly_summary.csv Monthly averages of all variables & prediction error.

Power BI Dashboard

Page 1 - Model Predictions

LSTM model performance over the last 30 days of the dataset (October 2025). RMSE and MAE cards, real vs predicted temperature line chart and hourly prediction error bar chart with date slicer.

Model Predictions

Page 2 - Monthly Summary Overview per Year

The 8-year atmospheric summary with year dropdown and month slicer. Monthly averages for temperature (real vs predicted), pressure, humidity and wind speed.

Monthly Summary Overview per Year


Next Steps

This repo is part of OpenWeatherDataPlatform, a distributed weather data platform combining IoT sensor networks with web-scraped data. The gap in accessible, scientifically consistent meteorological data is what motivated building a dedicated platform rather than relying on external providers.

The physical station side (firmware, sensor validation, and hardware enclosure) is already operational and transmitting data. Once the server-side pipeline on the Raspberry Pi or a dedicated server is complete, the production pipeline in this repo feeds directly into the platform real-time processing layer.

On the modeling side, the next step is extending this pipeline with SARIMA to capture annual seasonality, which ARIMA(1,1,1) does not model. The LSTM, while significantly more accurate, introduces its own problems: it tends to smooth out sharp temperature transitions and its predictions lag slightly during abrupt weather changes, which is expected because of the 24-hour lookback window. Addressing these limitations is part of the modeling roadmap.

The modular structure of the pipeline is intentional: once validated on weather data, the same pipeline pattern can be applied to other domains with predictive or classification targets.


Files

File Description
Weather_Analysis_Models.ipynb Exploration: cleaning, EDA, regression, classification, ARIMA, LSTM.
main.py Pipeline entry point.
cleaner.py Data cleaning and resampling.
lstm_model.py LSTM training and inference.
report.py CSV document export for Power BI.
config.py Configurable parameters.
data/HistoricalWeather_MMMX_Dataset.csv Scraped dataset.
Dashboard.pbix Power BI dashboard.

Stack

Python, pandas, NumPy, scikit-learn, statsmodels, TensorFlow/Keras, Matplotlib, Seaborn, Power BI

About

Exploratory data analysis, predictive modeling and production pipeline on historical weather data from the Benito Juárez International Airport (MMMX), Mexico City.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors