This project aims to analyze and predict train delays using historical SNCF (French National Railway Company) data. It is designed for anyone who wants to understand the approach of a data science project applied to delay prediction, even without advanced data knowledge.
- Lohan LECOQ
- Gabriel BRUMENT
- Gabin SCHIRO
- Project Structure
- Prerequisites
- Installation and Usage
- Methodology
- Main Visualizations
- Results Interpretation
- Project Maintenance
- Conclusion and Future Work
The TARDIS project is organized as follows:
- Main files:
dataset.csv- Raw dataset containing train journey informationcleaned_dataset.csv- Cleaned dataset ready for analysis and modelingtardis_eda.ipynb- Exploratory Data Analysis (EDA) notebooktardis_model.ipynb- Modeling and delay prediction notebookdashboard.py- Streamlit script for data and results visualizationrequirements.txt- List of required Python dependenciesG-AIA-210_tardis.txt- Project documentation
For the Exploratory Data Analysis (EDA) to work correctly, the computer must be set to the correct date and time. This is because the data cleaning process automatically removes all rows with dates later than the current system date.
The project requires the following libraries (detailed in requirements.txt):
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- xgboost
- streamlit (for the interface)
- jupyter (to run notebooks)
For quick and complete project execution, use the provided shell script:
./scripts/run_project.shThis script performs the following actions:
- Checks and installs required dependencies
- Runs the Exploratory Data Analysis (EDA) notebook
- Runs the modeling notebook to generate the predictive model
- Launches the interactive Streamlit dashboard
pip install -r requirements.txtjupyter notebook tardis_eda.ipynbjupyter notebook tardis_model.ipynbpython -m streamlit run dashboard.pyThe dashboard will automatically open in your default web browser. If not, you can access the URL displayed in the terminal (usually http://localhost:8501).
The final model is saved in the models/ folder (automatically created when running tardis_model.ipynb):
tardis_model.joblib- The complete pipeline including preprocessing and XGBoost modellabel_encoder.joblib- The encoder to convert numeric predictions to text categories
To load and use the saved model in your own scripts:
import joblib
import pandas as pd
# Load the model and encoder
model = joblib.load('models/tardis_model.joblib')
label_encoder = joblib.load('models/label_encoder.joblib')
# Load your new data (make sure it has the same columns as the original dataset)
new_data = pd.read_csv('your_new_data.csv', sep=';')
# Make predictions
predictions_numeric = model.predict(new_data)
predictions_category = label_encoder.inverse_transform(predictions_numeric)
# Display results
for i, prediction in enumerate(predictions_category):
print(f"Journey {i+1}: Predicted delay: {prediction}")EDA (Exploratory Data Analysis) involves exploring, cleaning, and preparing data for modeling. The main steps are:
- Loading raw data: importing the original CSV file.
- Cleaning: correcting missing values, duplicates, inconsistencies, and typos.
- Removing unnecessary columns or those with a high proportion of missing values.
- Verifying and correcting station names (fuzzy matching).
- Removing variables that could cause "leakage"
- Saving a clean dataset to
cleaned_dataset.csv.
Feature engineering involves creating new explanatory variables from existing data to enrich the model. Examples:
- Extracting year, month, season from the date.
- Creating special period indicators (school holidays, public holidays, ski season).
- Calculating journey length category (short, medium, long).
- Creating a congestion index based on traffic intensity.
- Adding variables combining multiple pieces of information (e.g., journey complexity).
Initially, we attempted to predict the exact delay (in minutes) using regression models. However, the variability of delays and lack of strong signals made this task very difficult. We therefore reformulated the problem as classification: predicting the delay category (low, medium, high).
The "High" class (major delays) is very much a minority, making its detection difficult for standard models. Several strategies were tested to improve detection of this class.
XGBoost (Extreme Gradient Boosting) is an ensemble algorithm based on "gradient boosting". It builds a series of decision trees, where each new tree corrects the errors of previous ones. XGBoost is widely used in data science because it is:
- high-performing on tabular data,
- robust against overfitting,
- capable of handling imbalanced classes (
scale_pos_weightparameter), - fast thanks to internal optimizations.
To obtain the best model, we performed hypertuning:
- Automatic search for the best XGBoost hyperparameters (number of trees, depth, learning rate, etc.).
- Search for the best decision threshold for the "High" class.
- Optimization of overall performance (macro f1-score, which balances prediction quality across all classes).
Code used for advanced hypertuning:
from sklearn.model_selection import ParameterGrid, cross_val_predict, StratifiedKFold
from sklearn.metrics import f1_score
from tqdm import tqdm
param_grid = {
'model__n_estimators': [100, 200, 300],
'model__max_depth': [5, 7, 9],
'model__scale_pos_weight': [3, 5, 10],
'model__learning_rate': [0.05, 0.1],
'model__subsample': [0.8, 1.0],
'model__colsample_bytree': [0.8, 1.0],
}
best_score = 0
best_params = None
best_threshold = 0.5
results = []
for params in tqdm(list(ParameterGrid(param_grid))):
pipeline.set_params(**params)
probas = cross_val_predict(pipeline, X, y, cv=cv, method='predict_proba')
for threshold in np.arange(0.45, 0.91, 0.05):
y_pred_custom = []
for i, p in enumerate(probas):
if p[2] >= threshold:
y_pred_custom.append(2)
else:
y_pred_custom.append(np.argmax(p[:2]))
score = f1_score(y, y_pred_custom, average='macro')
results.append((params.copy(), threshold, score))
if score > best_score:
best_score = score
best_params = params.copy()
best_threshold = threshold
print("Best overall compromise (macro f1-score):")
print(f"Hyperparameters: {best_params}")
print(f"Threshold: {best_threshold:.2f}")
print(f"Macro F1-score: {best_score:.2f}")
results_sorted = sorted(results, key=lambda x: x[2], reverse=True)
print("\nTop 5 combinations (params, threshold, f1-macro):")
for r in results_sorted[:5]:
print(r)Result:
- Best macro f1-score achieved: 0.54
- Hyperparameters:
max_depth=9,n_estimators=300,scale_pos_weight=3,learning_rate=0.1,subsample=0.8,colsample_bytree=1.0, threshold=0.45
This advanced tuning achieves the best overall compromise across all classes with XGBoost to date.
The final model is trained on the complete dataset and saved using joblib:
# Save the complete pipeline (preprocessing + model) with joblib
joblib.dump(final_pipeline, 'models/tardis_model.joblib')
# Save the LabelEncoder to convert numeric predictions to categories
joblib.dump(le, 'models/label_encoder.joblib')The difference between these two files:
-
tardis_model.joblib: Contains the complete machine learning pipeline:- Data preprocessing (imputation, standardization, encoding)
- Trained XGBoost model with all its parameters
- This file allows making predictions on new raw data
-
label_encoder.joblib: Contains only theLabelEncoderobject:- Converts numeric predictions (0, 1, 2) to categories ('Low', 'Medium', 'High')
- Ensures result interpretation remains consistent
Here are the main charts generated during exploratory analysis and modeling:
- Accuracy: proportion of correct predictions (here ~0.69, which is very good for 3 classes).
- Classification report: details by class (precision, recall, f1-score).
- Confusion matrix: visualizes errors between classes.
Note: Regression metrics (RMSE, R², etc.) are not appropriate here since we are predicting categories, not numerical values.
To reset the project to its initial state and clean all generated files, you can use the provided cleanup script:
./scripts/clean.shThis script performs the following actions:
- Deletes the cleaned data file (
cleaned_dataset.csv) - Deletes all executed notebooks (
*_executed.ipynb) - Deletes the models folder (
models/) and all trained models - Offers to delete the virtual environment (
tardis_env/)
Use this script before starting a new analysis cycle or to clean your workspace.
This project demonstrates how to approach a real prediction problem in the railway domain. Key takeaways:
-
Problem reformulation: Transforming a regression problem into classification significantly improved performance.
-
Importance of feature engineering: Creating relevant variables (seasons, congestion indicators, etc.) enriched the model.
-
Handling class imbalance: Optimizing hyperparameters and decision thresholds improved detection of significant delays.
- Collect more weather data
- Integrate data on construction work and infrastructure condition
- Test other algorithms (neural networks, ensemble models)
- Develop a user interface to facilitate model usage
Feel free to consult the code and markdown cells in the notebooks for detailed explanations of each step and key concept in the project.










