Informative Missingness

📁 Project Structure

project/
├── configs/
│   └── config.yml                # Configuration for all the Machine Learning models
├── dataset/                      # Data loading & preprocessing
│   ├── preprocessed_tabular/     # Stores preprocessed tabular data
│   ├── raw/                      # Stores raw data before preprocessing
│   └── temp/                     # Stores intermediate data (from PostgreSQL)
├── db_utils/                     # Configurations for the PostgreSQL database
│   └── db_config.py              # Stores DB configurations 
│   └── db_setup.py               # Used to connect to the DB display infos
├── notebooks/                    
│   └── exp_2025.ipynb                          # Jupyter notebooks for exploration and debugging
│   └── plot_metric_results.ipynb                # Plotting metric results
├── outputs/                      # output for all the experiments
│   └── experiments
│   │   ├── 20250811_222344         # This folder is created based on the date and time
│   │   │   └── logs/               # Stores logs for different ML models
│   │   │   └── models/             # logs models training parameters  
│   │   │   └── results/            # stores models results
├── scripts/                        # Scripts to prepare raw data by querying PostgreSQL
│   └── fetch_demographics_data.py      # fetches the demographics data from DB and merge the target file
│   └── fetch_labevents_data.py         # fetches the labevents data from DB
│   └── prepare_data.py                 # main entry point to the raw data extraction process pipeline
├── src/
│   ├── config/
│   │   └── schemas.py            # Pydantic validation for all the classes, methods, and data types
│   ├── data/                     # Data handling and preprocessing
│   │   ├── data_loader.py        # Loads data from sources
│   │   ├── dataset.py            # Dataset logic and split handling
│   │   ├── data_preprocessing.py               # Data preprocessing methods definition
│   │   ├── tabular_data_preprocessor.py        # Tabular data preprocessing
│   │   └── temporal_preprocessing.py           # Temporal feature engineering
│   ├── models/
│   │   └── random_forest.py        # Random Forest model definition
│   │   └── gradient_boosting.py    # Gradient Boosting model definition
│   │   └── XGBoost.py              # XGBoost model definition
│   │   └── CatBoost.py             # CatBoost model definition
│   ├── training/
│   │   └── train_rf.py             # Training logic for Random Forest
│   │   └── train_gradboost.py      # Training logic for Gradient Boosting
│   │   └── XGBoost.py              # Training logic for XGBoost
│   │   └── CatBoost.py             # Training logic for CatBoost
│   └── utils/
│       └── logging_utils.py        # Logging configuration and setup
├── train_models.sbatch             # Slurm script to assign model training process in HPC
├── run_exp.py                      # this script is only used while training models without docker
├── train_model.py                  # this script is used while training models in docker container
docker/
├── Dockerfile                      # All the necessary configurations during container creation
docker-compose.yml                  # Docker container build related configuration

Codebase at a glance

Docker configuration

All the configurations related to docker can be found inside docker-compose.yml file. Configurations related to memory might need to be adjusted as per your system.

resources:
    limits:
        cpus: '12'
        memory: 4G
    reservations:
        memory: 4G

Rest of the configurations should be fine. You can find the postgres login details under environment variable:

Loading the data to `postgres` docker container

Copy postgres/load.sql to load_mimic.sql
- docker cp postgres/load.sql mimiciv_postgres:load_mimic.sql
Then docker execute the load_mimic.sql
- docker exec -it mimiciv_postgres psql -U postgres -d mimiciv -f /load_mimic.sql
This will take sometime. You can then test it with the following query to login to postgres and display the data, for example, from the mimiciv_hosp.admissions table.
- docker exec mimiciv_postgres psql -U postgres -d mimiciv -c "SELECT * FROM mimiciv_hosp.admissions;"

Data Extraction

Cohort data must be extracted first from the PostgreSQL database.
Make sure postgres container is running by doing docker compose up -d. This will start all the containers basically.

Raw data extraction process pipeline

Before running the pipeline, it's important to have your data ready. For extracting and preparing the raw data, the scripts/prepare_data.py need to be executed. Refer to scripts/README.md for more instructions.

Configurations

Before starting model training, you must review and adjust the configuration files inside the configs/ directory.

👉 See the configs/README.md for detailed instructions on how to set:

Cohort-specific training data
Window sizes (7, 14, or 21 days, matching raw data preparation)
Feature combinations (x, m, delta, or their combinations)
Model type and hyperparameters

⚠️ Incorrect configuration will result in invalid or inconsistent experiments.

Running the pipeline locally

Every time a new package is added or a change is made to the project, it is necessary to re-build the image. This will create a model-training container including all the packages.

Using docker

docker compose build --no-cache             # builds the image

Once the container is started, one or more models can be trained on independent containers separated by spaces.

docker compose up randomforest                      # starts randomforest-trainer container 
docker compose up randomforest xgboost catboost     # starts randomforest-trainer xgboost-trainer and catboost-trainer container independently

Without using docker

cd project
python run_exp.py

Running the pipeline in `HPC` server

Using apptainer/docker

Exclude unnecessary files before building to avoid bloating the Docker image. Make sure you have .dockerignore (more info here) file at the project root. At minimum, exclude these:

# Large dataset
mimiciv/
dataset/
postgres_data/
postgres/

# Logs and temp files
*.log
tmp/

# Experiment outputs
outputs/

Create wheels (.whl) files to avoid package dependency conflicts. These wheels are built using custom docker/Dockerfile, which:
- Ensures consistent builds by installing dependencies only from pre-downloaded files which is super useful especially in HPC environment.
- Avoids PyPI network calls on HPC clusters.
Check out docker/README.md for detailed instructions.

pip download -r requirements.txt -d wheels/

Build the container (.tar) file. This step uses the docker/Dockerfile to:
- Install the system libraries required by the ML packages.
- Copy the wheels/ directory and install dependencies locally.
- Package everything into a minimal, secure image for portability.

docker compose build --no-cache

Copy the .tar file to HPC

scp model-training.tar USERNAME@CLUSTER:/home/USERNAME/project_folder/model-training.tar

Creating .sif file

apptainer build model-training.sif docker-archive://model-training.tar          # create .sif file

Edit configurations in train_models.sbatch Uncomment these:

models=("RandomForest" "GradientBoosting" "LogisticRegression" "XGBoost" "CatBoost")
MODEL=${models[$SLURM_ARRAY_TASK_ID]}

apptainer exec --nv -B $PWD:/project model-training.sif \
    python project/train_model.py --model $MODEL

Set the number of jobs Important: Based on the number of models you are training, set the value of #SBATCH --array between 0-4

#SBATCH --array=0-4    --> this will run all 5 models

Run the slurm job

sbatch -p work train_models.sbatch

Without using apptainer/docker Uncomment the line:

python project/run_exp.py

and comment out these lines:

models=("RandomForest" "GradientBoosting" "LogisticRegression" "XGBoost" "CatBoost")
MODEL=${models[$SLURM_ARRAY_TASK_ID]}

apptainer exec --nv -B $PWD:/project model-training.sif \
    python project/train_model.py --model $MODEL

Helpful commands

Copying raw data files from local to remote

scp -r raw/apl_*.parquet USERNAME@CLUSTER:/home/USERNAME/

Results

✔️ Aplasia Cohort

Clinical Target
Gender
Race
Age

✔️ Neutropenic Fever Cohort

Clinical Target
Gender
Race
Age

MIMIC-IV database citations

More information about the database can be found in the Physionnet website

@dataset{johnson2024mimiciv,
  title        = {MIMIC-IV (version 3.1)},
  author       = {Johnson, A. and Bulgarelli, L. and Pollard, T. and Gow, B. and Moody, B. and Horng, S. and Celi, L. A. and Mark, R.},
  year         = {2024},
  publisher    = {PhysioNet},
  note         = {RRID:SCR_007345},
  doi          = {10.13026/kpb9-mt58},
  url          = {https://doi.org/10.13026/kpb9-mt58}
}

@article{johnson2023mimiciv,
  title        = {MIMIC-IV, a freely accessible electronic health record dataset},
  author       = {Johnson, A. E. W. and Bulgarelli, L. and Shen, L. and others},
  journal      = {Scientific Data},
  volume       = {10},
  number       = {1},
  pages        = {1},
  year         = {2023},
  doi          = {10.1038/s41597-022-01899-x},
  url          = {https://doi.org/10.1038/s41597-022-01899-x}
}

@article{goldberger2000physionet,
  title        = {PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals},
  author       = {Goldberger, A. L. and Amaral, L. A. and Glass, L. and Hausdorff, J. M. and Ivanov, P. C. and Mark, R. G. and others},
  journal      = {Circulation},
  volume       = {101},
  number       = {23},
  pages        = {e215--e220},
  year         = {2000},
  doi          = {10.1161/01.CIR.101.23.e215},
  url          = {https://doi.org/10.1161/01.CIR.101.23.e215},
  note         = {RRID:SCR_007345}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Informative Missingness

📁 Project Structure

Codebase at a glance

Docker configuration

Loading the data to `postgres` docker container

Data Extraction

Raw data extraction process pipeline

Configurations

Running the pipeline locally

Running the pipeline in `HPC` server

Helpful commands

Results

MIMIC-IV database citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
docker		docker
papers		papers
postgres		postgres
project		project
report		report
results		results
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT_JOURNAL.md		PROJECT_JOURNAL.md
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.py		setup.py
train_models.sbatch		train_models.sbatch

Folders and files

Latest commit

History

Repository files navigation

Informative Missingness

📁 Project Structure

Codebase at a glance

Docker configuration

Loading the data to postgres docker container

Data Extraction

Raw data extraction process pipeline

Configurations

Running the pipeline locally

Running the pipeline in HPC server

Helpful commands

Results

MIMIC-IV database citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Loading the data to `postgres` docker container

Running the pipeline in `HPC` server

Packages