Skip to content

SobshDev/Tardis

Repository files navigation

TARDIS - Train Delay Analysis and Modeling

This project aims to analyze and predict train delays using historical SNCF (French National Railway Company) data. It is designed for anyone who wants to understand the approach of a data science project applied to delay prediction, even without advanced data knowledge.

Authors

  • Lohan LECOQ
  • Gabriel BRUMENT
  • Gabin SCHIRO

Table of Contents

  1. Project Structure
  2. Prerequisites
  3. Installation and Usage
  4. Methodology
  5. Main Visualizations
  6. Results Interpretation
  7. Project Maintenance
  8. Conclusion and Future Work

Project Structure

The TARDIS project is organized as follows:

  • Main files:
    • dataset.csv - Raw dataset containing train journey information
    • cleaned_dataset.csv - Cleaned dataset ready for analysis and modeling
    • tardis_eda.ipynb - Exploratory Data Analysis (EDA) notebook
    • tardis_model.ipynb - Modeling and delay prediction notebook
    • dashboard.py - Streamlit script for data and results visualization
    • requirements.txt - List of required Python dependencies
    • G-AIA-210_tardis.txt - Project documentation

Prerequisites

⚠️ Important

For the Exploratory Data Analysis (EDA) to work correctly, the computer must be set to the correct date and time. This is because the data cleaning process automatically removes all rows with dates later than the current system date.

Python Dependencies

The project requires the following libraries (detailed in requirements.txt):

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn
  • xgboost
  • streamlit (for the interface)
  • jupyter (to run notebooks)

Installation and Usage

Option 1: Automatic Script

For quick and complete project execution, use the provided shell script:

./scripts/run_project.sh

This script performs the following actions:

  1. Checks and installs required dependencies
  2. Runs the Exploratory Data Analysis (EDA) notebook
  3. Runs the modeling notebook to generate the predictive model
  4. Launches the interactive Streamlit dashboard

Option 2: Step-by-Step Execution

1. Installing Dependencies

pip install -r requirements.txt

2. Data Exploration

jupyter notebook tardis_eda.ipynb

3. Modeling and Prediction

jupyter notebook tardis_model.ipynb

4. Launching the Dashboard

python -m streamlit run dashboard.py

The dashboard will automatically open in your default web browser. If not, you can access the URL displayed in the terminal (usually http://localhost:8501).

Using the Trained Model

The final model is saved in the models/ folder (automatically created when running tardis_model.ipynb):

  • tardis_model.joblib - The complete pipeline including preprocessing and XGBoost model
  • label_encoder.joblib - The encoder to convert numeric predictions to text categories

To load and use the saved model in your own scripts:

import joblib
import pandas as pd

# Load the model and encoder
model = joblib.load('models/tardis_model.joblib')
label_encoder = joblib.load('models/label_encoder.joblib')

# Load your new data (make sure it has the same columns as the original dataset)
new_data = pd.read_csv('your_new_data.csv', sep=';')

# Make predictions
predictions_numeric = model.predict(new_data)
predictions_category = label_encoder.inverse_transform(predictions_numeric)

# Display results
for i, prediction in enumerate(predictions_category):
    print(f"Journey {i+1}: Predicted delay: {prediction}")

Methodology

1. Data Cleaning and Preparation (EDA)

EDA (Exploratory Data Analysis) involves exploring, cleaning, and preparing data for modeling. The main steps are:

  • Loading raw data: importing the original CSV file.
  • Cleaning: correcting missing values, duplicates, inconsistencies, and typos.
  • Removing unnecessary columns or those with a high proportion of missing values.
  • Verifying and correcting station names (fuzzy matching).
  • Removing variables that could cause "leakage"
  • Saving a clean dataset to cleaned_dataset.csv.

2. Feature Engineering

Feature engineering involves creating new explanatory variables from existing data to enrich the model. Examples:

  • Extracting year, month, season from the date.
  • Creating special period indicators (school holidays, public holidays, ski season).
  • Calculating journey length category (short, medium, long).
  • Creating a congestion index based on traffic intensity.
  • Adding variables combining multiple pieces of information (e.g., journey complexity).

3. Predictive Modeling

From Regression to Classification

Initially, we attempted to predict the exact delay (in minutes) using regression models. However, the variability of delays and lack of strong signals made this task very difficult. We therefore reformulated the problem as classification: predicting the delay category (low, medium, high).

Class Imbalance Problem

The "High" class (major delays) is very much a minority, making its detection difficult for standard models. Several strategies were tested to improve detection of this class.

What is XGBoost?

XGBoost (Extreme Gradient Boosting) is an ensemble algorithm based on "gradient boosting". It builds a series of decision trees, where each new tree corrects the errors of previous ones. XGBoost is widely used in data science because it is:

  • high-performing on tabular data,
  • robust against overfitting,
  • capable of handling imbalanced classes (scale_pos_weight parameter),
  • fast thanks to internal optimizations.

Advanced Hypertuning

To obtain the best model, we performed hypertuning:

  • Automatic search for the best XGBoost hyperparameters (number of trees, depth, learning rate, etc.).
  • Search for the best decision threshold for the "High" class.
  • Optimization of overall performance (macro f1-score, which balances prediction quality across all classes).

Code used for advanced hypertuning:

from sklearn.model_selection import ParameterGrid, cross_val_predict, StratifiedKFold
from sklearn.metrics import f1_score
from tqdm import tqdm

param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [5, 7, 9],
    'model__scale_pos_weight': [3, 5, 10],
    'model__learning_rate': [0.05, 0.1],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0],
}

best_score = 0
best_params = None
best_threshold = 0.5
results = []

for params in tqdm(list(ParameterGrid(param_grid))):
    pipeline.set_params(**params)
    probas = cross_val_predict(pipeline, X, y, cv=cv, method='predict_proba')
    for threshold in np.arange(0.45, 0.91, 0.05):
        y_pred_custom = []
        for i, p in enumerate(probas):
            if p[2] >= threshold:
                y_pred_custom.append(2)
            else:
                y_pred_custom.append(np.argmax(p[:2]))
        score = f1_score(y, y_pred_custom, average='macro')
        results.append((params.copy(), threshold, score))
        if score > best_score:
            best_score = score
            best_params = params.copy()
            best_threshold = threshold

print("Best overall compromise (macro f1-score):")
print(f"Hyperparameters: {best_params}")
print(f"Threshold: {best_threshold:.2f}")
print(f"Macro F1-score: {best_score:.2f}")

results_sorted = sorted(results, key=lambda x: x[2], reverse=True)
print("\nTop 5 combinations (params, threshold, f1-macro):")
for r in results_sorted[:5]:
    print(r)

Result:

  • Best macro f1-score achieved: 0.54
  • Hyperparameters: max_depth=9, n_estimators=300, scale_pos_weight=3, learning_rate=0.1, subsample=0.8, colsample_bytree=1.0, threshold=0.45

This advanced tuning achieves the best overall compromise across all classes with XGBoost to date.

4. Model Saving

The final model is trained on the complete dataset and saved using joblib:

# Save the complete pipeline (preprocessing + model) with joblib
joblib.dump(final_pipeline, 'models/tardis_model.joblib')

# Save the LabelEncoder to convert numeric predictions to categories
joblib.dump(le, 'models/label_encoder.joblib')

The difference between these two files:

  1. tardis_model.joblib: Contains the complete machine learning pipeline:

    • Data preprocessing (imputation, standardization, encoding)
    • Trained XGBoost model with all its parameters
    • This file allows making predictions on new raw data
  2. label_encoder.joblib: Contains only the LabelEncoder object:

    • Converts numeric predictions (0, 1, 2) to categories ('Low', 'Medium', 'High')
    • Ensures result interpretation remains consistent

Main Visualizations

Here are the main charts generated during exploratory analysis and modeling:

Distribution of Average Arrival Delays

Distribution of Average Arrival Delays

Delays by Traffic Intensity and Season

Delays by Traffic Intensity and Season

Correlation Matrix

Correlation Matrix

Impact of Special Periods on Delays

Impact of Special Periods

Top 15 Stations with Highest Average Departure Delays

Top 15 Stations Average Departure Delay

Main Causes of Delays

Main Causes of Delays

Monthly Evolution of Average Delays

Monthly Evolution of Average Delays

Train Reliability by Distance Category

Train Reliability by Distance Category

Delays and Cancellations by Season

Delays and Cancellations by Season

Impact of Delays by Distance Category

Impact of Delays by Distance Category

Top 10 Routes with Highest Delay Impact

Routes with Highest Delay Impact

Results Interpretation

  • Accuracy: proportion of correct predictions (here ~0.69, which is very good for 3 classes).
  • Classification report: details by class (precision, recall, f1-score).
  • Confusion matrix: visualizes errors between classes.

Note: Regression metrics (RMSE, R², etc.) are not appropriate here since we are predicting categories, not numerical values.

Project Maintenance

Project Cleanup

To reset the project to its initial state and clean all generated files, you can use the provided cleanup script:

./scripts/clean.sh

This script performs the following actions:

  1. Deletes the cleaned data file (cleaned_dataset.csv)
  2. Deletes all executed notebooks (*_executed.ipynb)
  3. Deletes the models folder (models/) and all trained models
  4. Offers to delete the virtual environment (tardis_env/)

Use this script before starting a new analysis cycle or to clean your workspace.

Conclusion and Future Work

This project demonstrates how to approach a real prediction problem in the railway domain. Key takeaways:

  1. Problem reformulation: Transforming a regression problem into classification significantly improved performance.

  2. Importance of feature engineering: Creating relevant variables (seasons, congestion indicators, etc.) enriched the model.

  3. Handling class imbalance: Optimizing hyperparameters and decision thresholds improved detection of significant delays.

Areas for Improvement:

  • Collect more weather data
  • Integrate data on construction work and infrastructure condition
  • Test other algorithms (neural networks, ensemble models)
  • Develop a user interface to facilitate model usage

Feel free to consult the code and markdown cells in the notebooks for detailed explanations of each step and key concept in the project.

About

Predicting train delays using historical SNCF data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published