Predicting Indian domestic flight prices using machine learning, built as part of the Simplon Maghreb Γ Jobintech Data Analysis bootcamp.
Build a regression model that predicts flight ticket prices with a MAE under 1,250 INR (~15β¬), based on 300,000+ booking records from Indian domestic airlines.
βββ Airline_Ticket_Prices_EDA.ipynb # Exploratory Data Analysis
βββ tests_statistic.ipynb # Statistical hypothesis testing
βββ model.ipynb # Preprocessing, modeling & evaluation
βββ load_model.py # Example script for loading & using the model
βββ model_metadata.json # Saved model metrics and feature info
βββ Clean_Dataset.csv # Source dataset
βββ Documentation/ # Additional project documentation
πΎ The trained model (
.joblib) is hosted externally due to file size β see Model.joblib
| Property | Value |
|---|---|
| Source | Indian domestic flight bookings |
| Records | 300,000+ |
| Target | price (INR) |
Features:
| Variable | Type | Description |
|---|---|---|
airline |
Categorical | Carrier name (AirAsia, Vistara, IndiGo, etc.) |
source_city |
Categorical | Departure city |
destination_city |
Categorical | Arrival city |
departure_time |
Ordinal | Time-of-day slot (Early Morning β Late Night) |
arrival_time |
Ordinal | Time-of-day slot (Early Morning β Late Night) |
stops |
Ordinal | zero / one / two_or_more |
class |
Ordinal | Economy / Business |
duration |
Continuous | Flight duration in hours |
days_left |
Continuous | Days between booking and departure |
price |
Continuous | Ticket price in INR (target) |
- Variable type classification (continuous, discrete categorical, ordinal)
- Missing value and duplicate checks
- Univariate analysis: histograms, boxplots, and bar charts for all variables
- Multivariate analysis: relationships between each feature and price
Price distribution β bimodal structure reflecting Economy vs Business classes:
Airline vs average price β Vistara and Air India occupy the premium segment:
Stops vs average price β more stops correlates with higher price:
Days before flight vs price β prices spike sharply within 5 days of departure:
Average price by airline and number of stops β Vistara commands a consistent premium across all stop configurations:
Price by departure time and class β ticket class is the dominant price driver, departure time has marginal effect:
Key insights:
classis the dominant price driver β Business tickets average ~8x more than Economy- Booking within 5 days of departure causes a sharp price spike (~30,000 INR peak)
- Vistara is the most expensive airline across all stop counts
- More stops generally correlates with higher prices, likely due to premium airline routing
| Test | Variables | Result |
|---|---|---|
| Pearson correlation | duration vs price |
r = 0.20, p β 0 β significant but weak correlation |
| ANOVA | airline vs price |
F = 17,194, p β 0 β significant price differences across airlines |
| ANOVA | stops vs price |
F = 6,477, p = 0 β significant price differences across stop counts |
| Chi-square (independence) | departure_time vs stops |
p β 0 β the two variables are dependent |
| Chi-square (goodness of fit) | AirAsia market share | Rejects H0 of 70% share β actual share is significantly different |
Preprocessing:
- Ordinal encoding:
stops(0/1/2),class(0/1) - OneHot encoding:
airline,source_city,destination_city,arrival_time,departure_time - Standard scaling:
duration,days_left - Train/test split: 80/20
Model comparison (5-fold cross-validation on training set):
| Model | Selected |
|---|---|
| Baseline (DummyRegressor) | β |
| Linear Regression | β |
| Ridge Regression | β |
| Random Forest | β |
Final model β Random Forest (n_estimators=50):
| Metric | Value |
|---|---|
| RΒ² | 0.9857 |
| MAE | 1,063 INR (~12.7β¬) β |
| RMSE | 2,722 INR |
| 95% Confidence Interval | [1,043 β 1,083 INR] |
β Business objective achieved: MAE < 1,250 INR (equivalent to ~15β¬)
- Python 3.14
- pandas / NumPy β data manipulation
- Matplotlib / Seaborn β visualization
- scikit-learn β preprocessing, modeling, evaluation
- SciPy β statistical tests
- joblib β model serialization
- uv β package management
import joblib
import pandas as pd
import numpy as np
pipeline = joblib.load("flight_price_pipeline.joblib")
encoder = pipeline["encoder"]
scaler = pipeline["scaler"]
model = pipeline["model"]
new_flight = {
'airline': ['Vistara'],
'source_city': ['Delhi'],
'destination_city': ['Mumbai'],
'departure_time': ['Evening'],
'arrival_time': ['Night'],
'stops': [1], # 0=zero, 1=one, 2=two_or_more
'class': [0], # 0=Economy, 1=Business
'duration': [2.5],
'days_left': [15]
}
df_input = pd.DataFrame(new_flight)
categorical_cols = ['airline', 'source_city', 'destination_city', 'arrival_time', 'departure_time']
numerical_cols = ['duration', 'days_left']
ordinal_cols = ['class', 'stops']
encoded = encoder.transform(df_input[categorical_cols])
scaled = scaler.transform(df_input[numerical_cols])
X_input = np.concatenate([df_input[ordinal_cols].values, scaled, encoded], axis=1)
predicted_price = model.predict(X_input)
print(f"Predicted price: {predicted_price[0]:,.0f} INR")Built by the Data-Analysis-Hub team as part of the Simplon Maghreb Γ Jobintech Data Analyst training program (Cohort 2026).





