This repository contains a four-stage pipeline for training and evaluating a multiscale histology-to-expression model on paired Visium HD and Visium V2 data.
Maintained by Hur Lab, University of North Dakota, under Dr. Junguk Hur.
| File | Purpose |
|---|---|
run_pipeline.py |
Wrapper that runs the full pipeline stage by stage. |
model_defination.py |
Model architecture and shared helper functions. This is imported by Stages 1-3, not run as a pipeline stage. |
Stage_1_FinalMultiscale_Training.py |
Trains the background-aware HD/V2 model and writes the Stage 2 checkpoint. |
Stage_2_Inference_V2_Correlation.py |
Runs V2 multiscale inference and writes spot/gene correlation outputs. |
Stage_3_GeneLevel_Analysis.py |
Computes additional gene-level metrics such as RMSE and mutual information. |
Stage_4_Plotting_Figures.py |
Generates publication-style figures from Stage 2 outputs. |
requirements.txt |
Python package list for pip-based setup. |
environment.yml |
Conda environment template. |
docs/DATA_LAYOUT.md |
Expected data and model directory layout. |
LICENSE |
MIT License under Hur Lab, University of North Dakota, Dr. Junguk Hur. |
CITATION.cff |
Citation metadata for GitHub and citation managers. |
AUTHORS.md |
Maintainer and authorship information. |
examples/ |
Example commands for full-pipeline and inference-only runs. |
input/ |
Placeholder for local datasets and UNI weights. Contents are ignored by Git. |
output/ |
Placeholder for generated checkpoints, CSVs, and figures. Contents are ignored by Git. |
The stage scripts import model_defination.py for the neural network architecture, model constants, masks, checkpoint loading, and image patch helpers. It is not a separate stage to run. This repository includes model_defination.py, and the stage scripts use the local copy by default.
If you want to use a different model_defination.py, pass the folder that contains it:
--training-script-dir /path/to/folder/containing/model_defination.pyThe pipeline expects all raw inputs under input/. UNI weights should be placed at:
<root>/input/UNI/pytorch_model.bin
Large datasets, model weights, checkpoints, and generated outputs should not be committed to GitHub. See .gitignore.
Create an environment with Python 3.10 or newer.
Using conda:
conda env create -f environment.yml
conda activate he2hd-visiumUsing pip:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor GPU training, install the PyTorch build that matches your CUDA version if the default package resolver does not select the right build.
By default, the project root is:
/home/sayed.asaduzzaman/Project-Gene_ImgtoExp
You can override it with:
--root /path/to/Project-Gene_ImgtoExpThe expected layout is documented in docs/DATA_LAYOUT.md.
The wrapper creates input/ and output/ automatically if they do not exist. Put datasets and weights in input/; generated checkpoints, CSV files, and figures go to output/.
Preview the commands without running them:
python run_pipeline.py --dry-runRun training, inference, analysis, and plotting:
python run_pipeline.py \
--root /home/sayed.asaduzzaman/Project-Gene_ImgtoExp \
--input-dir /home/sayed.asaduzzaman/Project-Gene_ImgtoExp/input \
--output-dir /home/sayed.asaduzzaman/Project-Gene_ImgtoExp/output \
--train-launcher torchrun \
--nproc-per-node 3 \
--cuda-visible-devices 0,2,3The same command is available as examples/run_full_pipeline.sh.
Skip training and use an existing checkpoint:
python run_pipeline.py \
--skip-training \
--ckpt /home/sayed.asaduzzaman/Project-Gene_ImgtoExp/output/HE2HD_HDV2_BG_AWARE_TRAIN_ALL_PAIRED/stage2_all_genes/best_model.ptThe same pattern is available as examples/run_inference_only.sh.
Run only selected stages:
python run_pipeline.py --only inference analysis plottingor:
python run_pipeline.py --start-at inference --stop-after plottingControl inference sample count and multiscale bins:
python run_pipeline.py \
--skip-training \
--n_spots 5000 \
--scales 16 8 2 \
--mi_bins 20Control training size and schedule:
python run_pipeline.py \
--stage1-epochs 12 \
--stage2-epochs 20 \
--stage1-batch-size 12 \
--stage2-batch-size 8 \
--max-hd-16-samples 100000 \
--max-hd-8-samples 400000 \
--max-hd-2-samples 6400000 \
--max-v2-samples 5000Control patch selection and blank suppression behavior:
python run_pipeline.py \
--skip-training \
--min_overlap 0.35 \
--support_scale 2.5 \
--grid_shift_fraction 0.0 \
--use_gate_weight \
--disable_graph_smoothWith the default root, outputs are written to:
<root>/output/HE2HD_HDV2_BG_AWARE_TRAIN_ALL_PAIRED/stage2_all_genes/best_model.pt
<root>/output/V2_STRICTBLANK_V3_ALL/
<root>/output/V2_STRICTBLANK_V3_ALL/gene_level_metrics/
<root>/output/V2_STRICTBLANK_V3_ALL/figures/
- The wrapper sets shared paths through
HE2HD_ROOT,HE2HD_INPUT_DIR,HE2HD_OUTPUT_DIR, andHE2HD_TRAINING_SCRIPT_DIR. - Stage 1 uses fixed random seeds internally and supports environment overrides exposed by
run_pipeline.py. - The exact model architecture and helper functions are provided by
model_defination.py; keep this file versioned with the stage scripts. - Keep a record of the command printed by
run_pipeline.py, the git commit hash, and package versions for each reported experiment.
This repository is released under the MIT License. Copyright belongs to Hur Lab, University of North Dakota, Dr. Junguk Hur. See LICENSE.
Before launching an expensive run:
python run_pipeline.py --dry-run --skip-trainingThen verify that the printed paths point to the intended root, checkpoint, V2 data, and output folders.