MMMX-station-weather-analysis

Exploratory data analysis, predictive modeling and production pipeline on historical weather data from the Benito Juárez International Airport (MMMX), Mexico City. The dataset was obtained through ethical web scraping from a weather data provider for educational and scientific purposes.

Dataset

Almost 98,000 hourly readings spanning from January 2017 to October 2025. The variables found were the temperature, dew point, humidity, wind speed, wind gust, pressure and sky condition.

Raw data required significant cleaning, thanks to irregular sampling frequency, physically impossible zero-values for temperature and pressure and outliers outside reasonable ranges for CDMX altitude.

The dataset is also available on Kaggle: Historical Weather in Mexico

Objectives

Clean and explore meteorological data.
Train regression models to predict temperature from atmospheric features.
Train a classification model to identify hot days.
Compare a classical time series model (ARIMA) against a deep learning approach (LSTM).
Develop the validated logic as a modular pipeline with Power BI dashboard as report.

Current models

1. Regression: Temperature Prediction (sklearn)

Linear Regression and Random Forest trained on atmospheric features (dew point, humidity, wind speed, pressure) plus temporal features (hour, month, day of week, day of year). Chronological 40-30-30 rule train-validation-test split.

Random Forest outperformed Linear Regression on the validation set and was selected for final evaluation.

NOTE: Found that R2 is notably high because dew point correlates strongly with temperature by physical definition. This is documented in the notebook conclusions.

2. Classification: Hot Day Detection (sklearn)

Logistic Regression classifying readings above 75°F as hot. Features are the same atmospheric variables excluding temperature to avoid leakage. Stratified 70-30 split for convenience.

3. Time Series: ARIMA(1,1,1) (statsmodels)

Daily average temperature, chronological split with last 30 days reserved for test. Parameters selected based on ACF/PACF analysis.

Metric	Value
RMSE	3.0102°F
MAE	2.1579°F

ARIMA without a seasonal component reverts toward the mean over longer horizons and does not capture annual temperature cycles. This is the expected behavior of ARIMA(1,1,1) on daily meteorological data and motivates the use of the LSTM approach.

4. Time Series: LSTM (TensorFlow/Keras)

Hourly resolution with a 24-hour lookback window. Two stacked LSTM layers with dropout, trained with Adam optimizer and early stopping. Chronological 80-20 split for convenience.

Metric	Value
RMSE	1.6656°F
MAE	1.2559°F

Findings

Dew point is the strongest predictor of temperature in this dataset. After doing some research,I found that this is consistent with thermodynamic theory.
ARIMA captures short-term trends but degrades quickly without a seasonal component.
LSTM at hourly resolution reduces prediction error by nearly half compared to ARIMA on the same target.
The natural next step is SARIMA with a seasonal period or a density estimation approach for denser forecasting.

Production Pipeline for LSTM

The production pipeline is built around the LSTM model as it represents the most advanced current stage in the modeling progression explored in this project: from linear regression and logistic classification, through ARIMA, to a deep learning time series model. Developing the most capable model is the natural next step before integrating live sensor data.

This project originated from a practical problem: while working on weather data analysis, finding reliable, scientifically consistent meteorological measurements proved harder than expected. That gap motivated the development of a dedicated data platform rather than depending on third-party sources.

The end goal is to make clean, localized weather data accessible to people who actually need it: students, educators and hobbyists looking to build or improve their own models, analyze specific zones or identify anomalous conditions for research and engineering projects.

The validated notebook logic is productionized as a modular Python pipeline:

main.py
  |
  ├── cleaner.py      # Classic data cleaning procedure, based on notebook
  ├── lstm_model.py   # LSTM training and inference
  └── report.py       # 4 CSV exports for Power BI dashboard

Usage

# Full reprocess with retraining the model
python main.py --mode batch --trainAgain

# Add new data without retraining the model
python main.py --source data/new_data.csv --mode incremental

# Add new data and retrain the model
python main.py --source data/new_data.csv --mode incremental --trainAgain

Output files

File	Description
`output/MMMX_clean.csv`	Cleaned dataset.
`output/MMMX_predictions.csv`	Real vs predicted temperature & error per hour.
`output/MMMX_model_metrics.csv`	Global model metrics (RMSE, MAE).
`output/MMMX_monthly_summary.csv`	Monthly averages of all variables & prediction error.

Power BI Dashboard

Page 1 - Model Predictions

LSTM model performance over the last 30 days of the dataset (October 2025). RMSE and MAE cards, real vs predicted temperature line chart and hourly prediction error bar chart with date slicer.

Page 2 - Monthly Summary Overview per Year

The 8-year atmospheric summary with year dropdown and month slicer. Monthly averages for temperature (real vs predicted), pressure, humidity and wind speed.

Next Steps

This repo is part of OpenWeatherDataPlatform, a distributed weather data platform combining IoT sensor networks with web-scraped data. The gap in accessible, scientifically consistent meteorological data is what motivated building a dedicated platform rather than relying on external providers.

The physical station side (firmware, sensor validation, and hardware enclosure) is already operational and transmitting data. Once the server-side pipeline on the Raspberry Pi or a dedicated server is complete, the production pipeline in this repo feeds directly into the platform real-time processing layer.

On the modeling side, the next step is extending this pipeline with SARIMA to capture annual seasonality, which ARIMA(1,1,1) does not model. The LSTM, while significantly more accurate, introduces its own problems: it tends to smooth out sharp temperature transitions and its predictions lag slightly during abrupt weather changes, which is expected because of the 24-hour lookback window. Addressing these limitations is part of the modeling roadmap.

The modular structure of the pipeline is intentional: once validated on weather data, the same pipeline pattern can be applied to other domains with predictive or classification targets.

Files

File	Description
`Weather_Analysis_Models.ipynb`	Exploration: cleaning, EDA, regression, classification, ARIMA, LSTM.
`main.py`	Pipeline entry point.
`cleaner.py`	Data cleaning and resampling.
`lstm_model.py`	LSTM training and inference.
`report.py`	CSV document export for Power BI.
`config.py`	Configurable parameters.
`data/HistoricalWeather_MMMX_Dataset.csv`	Scraped dataset.
`Dashboard.pbix`	Power BI dashboard.

Stack

Python, pandas, NumPy, scikit-learn, statsmodels, TensorFlow/Keras, Matplotlib, Seaborn, Power BI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMMX-station-weather-analysis

Dataset

Objectives

Current models

1. Regression: Temperature Prediction (sklearn)

2. Classification: Hot Day Detection (sklearn)

3. Time Series: ARIMA(1,1,1) (statsmodels)

4. Time Series: LSTM (TensorFlow/Keras)

Findings

Production Pipeline for LSTM

Usage

Output files

Power BI Dashboard

Next Steps

Files

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
images		images
output		output
.gitignore		.gitignore
Dashboard.pbix		Dashboard.pbix
LICENSE		LICENSE
README.md		README.md
Weather_Analysis_Models.ipynb		Weather_Analysis_Models.ipynb
cleaner.py		cleaner.py
config.py		config.py
lstm_model.py		lstm_model.py
main.py		main.py
report.py		report.py

Folders and files

Latest commit

History

Repository files navigation

MMMX-station-weather-analysis

Dataset

Objectives

Current models

1. Regression: Temperature Prediction (sklearn)

2. Classification: Hot Day Detection (sklearn)

3. Time Series: ARIMA(1,1,1) (statsmodels)

4. Time Series: LSTM (TensorFlow/Keras)

Findings

Production Pipeline for LSTM

Usage

Output files

Power BI Dashboard

Next Steps

Files

Stack

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages