GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models
This repository contains the full fine-tuning, inference, and evaluation pipeline for GRASP. Instead of directly querying LLMs with gene pairs, GRASP introduces relation-aware soft prompt tokens that adapt to gene roles and interaction types. Please contact (yuqiang@umich.edu or gyuanfan@umich.edu) if you have any questions or suggestions.
git clone https://github.com/GuanLab/GRASP.git
cd GRASPconda create -n grasp_env python=3.11 -y
conda activate grasp_env
python -m pip install --upgrade pipconda install -c conda-forge -y \
numpy=1.26.4 pandas=2.0.3 scipy=1.11.4 scikit-learn=1.5.2 \
matplotlib=3.10.0 seaborn=0.13.2 numba=0.61.0 pyarrow=18.1.0 \
python-igraph=0.11.8 lightgbm=4.6.0 distributed=2023.5.0 scanpy=1.11.0pip install \
accelerate==1.4.0 arboreto==0.1.6 causal_learn==0.1.4.0 cdt==0.6.0 \
einops==0.8.2 gdown==5.2.0 huggingface_hub==0.34.3 mygene==3.2.2 \
scprep==1.2.3 pytorch_lightning==2.5.0.post0 safetensors==0.4.5 seaborn==0.13.2 \
slingpy==0.2.12 torch==2.5.1 tqdm==4.67.1 transformers==4.51.3 trl==0.15.2pip install --no-build-isolation flash-attn==2.7.2.post1Alternative: Install from a prebuilt wheel to avoid compilation issues.
pip install --no-deps \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"We (i) download experimentally validated interaction pairs from the public resources, (ii) construct balanced datasets by randomly sampling synthetic negative pairs that do not overlap with any curated positives, and (iii) export each dataset to a plain-text file with the format:
gene1_symbol gene2_symbol label
where label ∈ {0,1} indicates non-interaction versus interaction.
For reproducibility, we generate five independent dataset splits (using different random seeds) and convert them to Hugging Face datasets:
- Human PPI:
total_human_dataset_1…total_human_dataset_5underGRASP/data/PPI/ - Kinase–substrate (PhosphoNetworks):
strict_KSR_dataset_1…strict_KSR_dataset_5underGRASP/data/PhosphoNetwork/
2. BioContext-informed datasets (augmented text files from Step 1 + conversion to Hugging Face datasets)
To construct BioContext-informed inputs, we augment the original PPI/KSR text files by retrieving gene functional summaries from MyGene.info (https://mygene.info) and appending two additional columns:
gene1_symbol gene2_symbol label gene1_summary gene2_summary
Gene summaries are queried in bulk using the MyGene API (via querymany), and the returned summary field is used as the gene description. The resulting BioContext-informed datasets are exported and saved as Hugging Face datasets under:
GRASP/data/biocontext_informed_data/total_human_with_summaries_dataset_1…total_human_with_summaries_dataset_5GRASP/data/biocontext_informed_data/strict_kinase_substrate_with_summaries_dataset_1…strict_kinase_substrate_with_summaries_dataset_5
CausalBench single-cell perturbation datasets are automatically downloaded during evaluation. The dataset download URLs are defined in the CausalBench codebase; they can also be inspected in:
causalscbench/data_access/datasets/download_evaluation_files.py
within the CausalBench GitHub repository: https://github.com/causalbench/causalbench.
We organize model weights under GRASP/checkpoints/ with three subfolders:
base/— original modelscontinual_pretrained/— domain-adapted models after continual pretraining on PubMed titles/abstractsfine_tuned/— fine-tuned models
A typical layout looks like this:
GRASP/checkpoints/
├── base/
│ ├── llama-3.1-8b-instruct/
│ └── gemma-3-4b-it/
├── continual_pretrained/
│ ├── Llama-3.1-8B-Instruct-continual-gene/
│ └── gemma-3-4b-it-continual-gene/
└── fine_tuned/
└── ...
After preparing the datasets and placing the base or continual-pretrained model checkpoints under GRASP/checkpoints/, training and evaluation can be launched directly using the provided shell scripts.
All key hyperparameters (e.g., number of epochs, batch size, dataset path, model backbone, etc.) can be modified inside the corresponding .sh files.
Below we provide an example of training and testing GRASP on the KSR dataset.
To train GRASP on the KSR dataset:
bash scripts/train_ksr/run_ksr_gene_soft_prompt.shFine-tuned models are stored under:
GRASP/checkpoints/fine_tuned/
After training completes, evaluation can be performed using:
bash scripts/test/test_gene_soft_prompt.shAll evaluation results are saved under:
GRASP/results/KSR/
Example output structure:
GRASP/results/KSR/gemma_base/1_gemma_base_KSR_gene_soft_prompt
├── confusion_matrix.png
├── gene_predictions.txt
├── prc_auc_curve.png
├── roc_auc_curve.png
└── test_results.json