A comprehensive machine learning system for predicting flight delays using Random Forest Regression. This project demonstrates end-to-end ML workflow including data preprocessing, feature engineering, model training, hyperparameter tuning, and deployment through a web interface.
Flight delays cost airlines and passengers billions annually. This project builds a predictive model to forecast flight delays based on historical data, enabling proactive decision-making for airlines, airports, and travelers.
Key Features:
- 🔄 Robust data preprocessing pipeline
- 🎛️ Feature engineering from temporal and categorical data
- 🌲 Random Forest model with hyperparameter tuning
- 📊 Comprehensive model evaluation and visualization
- 🚀 Interactive web application using Streamlit
AeroDelay/
├── data/
│ ├── raw/ # Original flight data
│ └── processed/ # Preprocessed data ready for modeling
├── models/ # Saved trained models
├── notebooks/
│ ├── preprocessing.ipynb # Data exploration and preprocessing
│ └── model_training_and_evaluation.ipynb # Model training and evaluation
├── src/
│ ├── preprocess.py # Data preprocessing functions
│ ├── train.py # Model training pipeline
│ └── predict.py # Prediction module with FlightDelayPredictor class
├── app.py # Streamlit web application
├── requirements.txt # Project dependencies
└── README.md
- Python 3.8 or higher
- pip package manager
- Clone the repository
git clone <repository-url>
cd AeroDelay- Create a virtual environment (recommended)
python -m venv venv
source venv\Scripts\activate- Install dependencies
pip install -r requirements.txtPreprocess raw flight data using the preprocessing module:
python src/preprocess.pyOr explore the preprocessing notebook:
jupyter notebook notebooks/preprocessing.ipynbPreprocessing Steps:
- Remove unnecessary columns (IDs, data leakage features)
- Handle missing values and duplicates
- Feature engineering (flight duration, temporal features, weekend indicator)
- Encode categorical variables (Airline, Origin, Destination, Aircraft Type)
- Scale numeric features using StandardScaler
Train the Random Forest model with hyperparameter tuning:
python src/train.pyOr use the training notebook:
jupyter notebook notebooks/model_training_and_evaluation.ipynbTraining Process:
- Train/test split (80/20)
- Baseline Random Forest model
- GridSearchCV for hyperparameter optimization
- Model evaluation with multiple metrics
- Feature importance analysis
- Model persistence using joblib
Use the trained model to predict flight delays:
from src.predict import FlightDelayPredictor
# Load model
predictor = FlightDelayPredictor("models/flight_delay_model.pkl")
# Make predictions
predictions = predictor.predict(X_test)Or run the example:
python src/predict.pyLaunch the interactive Streamlit app:
streamlit run app.py- ScheduledDuration - Flight duration in minutes
- Distance - Flight distance
- DepartureHour - Hour of departure
- Airline_Encoded - Airline carrier
- Origin_Encoded - Origin airport
-
Data Preprocessing
- Label encoding for categorical features
- Standard scaling for numerical features
- Time-based feature extraction
-
Model Architecture
- Algorithm: Random Forest Regressor
- Hyperparameters: Optimized via GridSearchCV
- Cross-validation: 3-fold CV
-
Evaluation Metrics
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- Residual analysis
The system categorizes predictions into severity levels:
- 🟢 On Time: < 15 minutes
- 🟡 Minor Delay: 15-30 minutes
- 🟠 Moderate Delay: 30-60 minutes
- 🔴 Major Delay: > 60 minutes
The notebooks include comprehensive visualizations:
- Distribution of delay minutes
- Average delay by airline
- Feature importance bar charts
- Residual plots
- Predicted vs Actual scatter plots
Run a quick test to ensure everything works:
# Test preprocessing
python -c "from src.preprocess import preprocess_data; print('✓ Preprocessing module OK')"
# Test prediction
python src/predict.py