project/
├── configs/
│ └── config.yml # Configuration for all the Machine Learning models
├── dataset/ # Data loading & preprocessing
│ ├── preprocessed_tabular/ # Stores preprocessed tabular data
│ ├── raw/ # Stores raw data before preprocessing
│ └── temp/ # Stores intermediate data (from PostgreSQL)
├── db_utils/ # Configurations for the PostgreSQL database
│ └── db_config.py # Stores DB configurations
│ └── db_setup.py # Used to connect to the DB display infos
├── notebooks/
│ └── exp_2025.ipynb # Jupyter notebooks for exploration and debugging
│ └── plot_metric_results.ipynb # Plotting metric results
├── outputs/ # output for all the experiments
│ └── experiments
│ │ ├── 20250811_222344 # This folder is created based on the date and time
│ │ │ └── logs/ # Stores logs for different ML models
│ │ │ └── models/ # logs models training parameters
│ │ │ └── results/ # stores models results
├── scripts/ # Scripts to prepare raw data by querying PostgreSQL
│ └── fetch_demographics_data.py # fetches the demographics data from DB and merge the target file
│ └── fetch_labevents_data.py # fetches the labevents data from DB
│ └── prepare_data.py # main entry point to the raw data extraction process pipeline
├── src/
│ ├── config/
│ │ └── schemas.py # Pydantic validation for all the classes, methods, and data types
│ ├── data/ # Data handling and preprocessing
│ │ ├── data_loader.py # Loads data from sources
│ │ ├── dataset.py # Dataset logic and split handling
│ │ ├── data_preprocessing.py # Data preprocessing methods definition
│ │ ├── tabular_data_preprocessor.py # Tabular data preprocessing
│ │ └── temporal_preprocessing.py # Temporal feature engineering
│ ├── models/
│ │ └── random_forest.py # Random Forest model definition
│ │ └── gradient_boosting.py # Gradient Boosting model definition
│ │ └── XGBoost.py # XGBoost model definition
│ │ └── CatBoost.py # CatBoost model definition
│ ├── training/
│ │ └── train_rf.py # Training logic for Random Forest
│ │ └── train_gradboost.py # Training logic for Gradient Boosting
│ │ └── XGBoost.py # Training logic for XGBoost
│ │ └── CatBoost.py # Training logic for CatBoost
│ └── utils/
│ └── logging_utils.py # Logging configuration and setup
├── train_models.sbatch # Slurm script to assign model training process in HPC
├── run_exp.py # this script is only used while training models without docker
├── train_model.py # this script is used while training models in docker container
docker/
├── Dockerfile # All the necessary configurations during container creation
docker-compose.yml # Docker container build related configurationAll the configurations related to docker can be found inside docker-compose.yml file. Configurations related to memory might need to be adjusted as per your system.
resources:
limits:
cpus: '12'
memory: 4G
reservations:
memory: 4G- Rest of the configurations should be fine. You can find the postgres login details under
environmentvariable:
- Copy
postgres/load.sqltoload_mimic.sqldocker cp postgres/load.sql mimiciv_postgres:load_mimic.sql
- Then docker execute the
load_mimic.sqldocker exec -it mimiciv_postgres psql -U postgres -d mimiciv -f /load_mimic.sql
- This will take sometime. You can then test it with the following query to login to postgres and display the data, for example, from the
mimiciv_hosp.admissionstable.docker exec mimiciv_postgres psql -U postgres -d mimiciv -c "SELECT * FROM mimiciv_hosp.admissions;"
- Cohort data must be extracted first from the
PostgreSQLdatabase. - Make sure
postgrescontainer is running by doingdocker compose up -d. This will start all the containers basically.
- Before running the pipeline, it's important to have your data ready. For extracting and preparing the raw data, the
scripts/prepare_data.pyneed to be executed. Refer to scripts/README.md for more instructions.
Before starting model training, you must review and adjust the configuration files inside the configs/ directory.
👉 See the configs/README.md for detailed instructions on how to set:
- Cohort-specific training data
- Window sizes (7, 14, or 21 days, matching raw data preparation)
- Feature combinations (
x,m,delta, or their combinations) - Model type and hyperparameters
- Every time a new package is added or a change is made to the project, it is necessary to re-build the image. This will create a
model-trainingcontainer including all the packages.
- Using docker
docker compose build --no-cache # builds the image- Once the container is started, one or more models can be trained on independent containers separated by spaces.
docker compose up randomforest # starts randomforest-trainer container
docker compose up randomforest xgboost catboost # starts randomforest-trainer xgboost-trainer and catboost-trainer container independently- Without using docker
cd project
python run_exp.py- Using
apptainer/docker
- Exclude unnecessary files before building to avoid bloating the Docker image. Make sure you have
.dockerignore(more info here) file at the project root. At minimum, exclude these:
# Large dataset
mimiciv/
dataset/
postgres_data/
postgres/
# Logs and temp files
*.log
tmp/
# Experiment outputs
outputs/
- Create wheels (
.whl) files to avoid package dependency conflicts. These wheels are built using custom docker/Dockerfile, which:- Ensures consistent builds by installing dependencies only from pre-downloaded files which is super useful especially in HPC environment.
- Avoids PyPI network calls on HPC clusters.
- Check out docker/README.md for detailed instructions.
pip download -r requirements.txt -d wheels/- Build the container (
.tar) file. This step uses thedocker/Dockerfileto:- Install the system libraries required by the ML packages.
- Copy the
wheels/directory and install dependencies locally. - Package everything into a minimal, secure image for portability.
docker compose build --no-cache - Copy the
.tarfile to HPC
scp model-training.tar USERNAME@CLUSTER:/home/USERNAME/project_folder/model-training.tar - Creating
.siffile
apptainer build model-training.sif docker-archive://model-training.tar # create .sif file - Edit configurations in
train_models.sbatchUncomment these:
models=("RandomForest" "GradientBoosting" "LogisticRegression" "XGBoost" "CatBoost")
MODEL=${models[$SLURM_ARRAY_TASK_ID]}
apptainer exec --nv -B $PWD:/project model-training.sif \
python project/train_model.py --model $MODEL- Set the number of jobs
Important: Based on the number of models you are training, set the value of
#SBATCH --arraybetween0-4
#SBATCH --array=0-4 --> this will run all 5 models
- Run the slurm job
sbatch -p work train_models.sbatch- Without using
apptainer/dockerUncomment the line:
python project/run_exp.pyand comment out these lines:
models=("RandomForest" "GradientBoosting" "LogisticRegression" "XGBoost" "CatBoost")
MODEL=${models[$SLURM_ARRAY_TASK_ID]}
apptainer exec --nv -B $PWD:/project model-training.sif \
python project/train_model.py --model $MODEL- Copying raw data files from local to remote
scp -r raw/apl_*.parquet USERNAME@CLUSTER:/home/USERNAME/- More information about the database can be found in the Physionnet website
@dataset{johnson2024mimiciv,
title = {MIMIC-IV (version 3.1)},
author = {Johnson, A. and Bulgarelli, L. and Pollard, T. and Gow, B. and Moody, B. and Horng, S. and Celi, L. A. and Mark, R.},
year = {2024},
publisher = {PhysioNet},
note = {RRID:SCR_007345},
doi = {10.13026/kpb9-mt58},
url = {https://doi.org/10.13026/kpb9-mt58}
}
@article{johnson2023mimiciv,
title = {MIMIC-IV, a freely accessible electronic health record dataset},
author = {Johnson, A. E. W. and Bulgarelli, L. and Shen, L. and others},
journal = {Scientific Data},
volume = {10},
number = {1},
pages = {1},
year = {2023},
doi = {10.1038/s41597-022-01899-x},
url = {https://doi.org/10.1038/s41597-022-01899-x}
}
@article{goldberger2000physionet,
title = {PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals},
author = {Goldberger, A. L. and Amaral, L. A. and Glass, L. and Hausdorff, J. M. and Ivanov, P. C. and Mark, R. G. and others},
journal = {Circulation},
volume = {101},
number = {23},
pages = {e215--e220},
year = {2000},
doi = {10.1161/01.CIR.101.23.e215},
url = {https://doi.org/10.1161/01.CIR.101.23.e215},
note = {RRID:SCR_007345}
}







