GitHub - GuanLab/GRASP

GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models

This repository contains the full fine-tuning, inference, and evaluation pipeline for GRASP. Instead of directly querying LLMs with gene pairs, GRASP introduces relation-aware soft prompt tokens that adapt to gene roles and interaction types. Please contact (yuqiang@umich.edu or gyuanfan@umich.edu) if you have any questions or suggestions.

Environment

1. Clone the repository

git clone https://github.com/GuanLab/GRASP.git
cd GRASP

2. Create and activate the conda environment

conda create -n grasp_env python=3.11 -y
conda activate grasp_env
python -m pip install --upgrade pip

3. Install core scientific dependencies (via conda-forge)

conda install -c conda-forge -y \
  numpy=1.26.4 pandas=2.0.3 scipy=1.11.4 scikit-learn=1.5.2 \
  matplotlib=3.10.0 seaborn=0.13.2 numba=0.61.0 pyarrow=18.1.0 \
  python-igraph=0.11.8 lightgbm=4.6.0 distributed=2023.5.0 scanpy=1.11.0

4. Install remaining Python packages (via pip)

pip install \
  accelerate==1.4.0 arboreto==0.1.6 causal_learn==0.1.4.0 cdt==0.6.0 \
  einops==0.8.2 gdown==5.2.0 huggingface_hub==0.34.3 mygene==3.2.2 \
  scprep==1.2.3 pytorch_lightning==2.5.0.post0 safetensors==0.4.5 seaborn==0.13.2 \
  slingpy==0.2.12 torch==2.5.1 tqdm==4.67.1 transformers==4.51.3 trl==0.15.2

5. Install FlashAttention2

pip install --no-build-isolation flash-attn==2.7.2.post1

Alternative: Install from a prebuilt wheel to avoid compilation issues.

pip install --no-deps \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"

Data preparation

1. PPI / KSR datasets (download + negative sampling + conversion to Hugging Face datasets)

We (i) download experimentally validated interaction pairs from the public resources, (ii) construct balanced datasets by randomly sampling synthetic negative pairs that do not overlap with any curated positives, and (iii) export each dataset to a plain-text file with the format:

gene1_symbol  gene2_symbol  label

where label ∈ {0,1} indicates non-interaction versus interaction.

For reproducibility, we generate five independent dataset splits (using different random seeds) and convert them to Hugging Face datasets:

Human PPI: total_human_dataset_1 … total_human_dataset_5 under GRASP/data/PPI/
Kinase–substrate (PhosphoNetworks): strict_KSR_dataset_1 … strict_KSR_dataset_5 under GRASP/data/PhosphoNetwork/

2. BioContext-informed datasets (augmented text files from Step 1 + conversion to Hugging Face datasets)

To construct BioContext-informed inputs, we augment the original PPI/KSR text files by retrieving gene functional summaries from MyGene.info (https://mygene.info) and appending two additional columns:

gene1_symbol  gene2_symbol  label  gene1_summary  gene2_summary

Gene summaries are queried in bulk using the MyGene API (via querymany), and the returned summary field is used as the gene description. The resulting BioContext-informed datasets are exported and saved as Hugging Face datasets under:

GRASP/data/biocontext_informed_data/total_human_with_summaries_dataset_1 … total_human_with_summaries_dataset_5
GRASP/data/biocontext_informed_data/strict_kinase_substrate_with_summaries_dataset_1 … strict_kinase_substrate_with_summaries_dataset_5

3. CausalBench perturbation datasets (automatic download)

CausalBench single-cell perturbation datasets are automatically downloaded during evaluation. The dataset download URLs are defined in the CausalBench codebase; they can also be inspected in:

causalscbench/data_access/datasets/download_evaluation_files.py

within the CausalBench GitHub repository: https://github.com/causalbench/causalbench.

Models

We organize model weights under GRASP/checkpoints/ with three subfolders:

base/ — original models
continual_pretrained/ — domain-adapted models after continual pretraining on PubMed titles/abstracts
fine_tuned/ — fine-tuned models

A typical layout looks like this:

GRASP/checkpoints/
├── base/
│   ├── llama-3.1-8b-instruct/
│   └── gemma-3-4b-it/
├── continual_pretrained/
│   ├── Llama-3.1-8B-Instruct-continual-gene/
│   └── gemma-3-4b-it-continual-gene/
└── fine_tuned/
    └── ...

Training and Evaluation

After preparing the datasets and placing the base or continual-pretrained model checkpoints under GRASP/checkpoints/, training and evaluation can be launched directly using the provided shell scripts.

All key hyperparameters (e.g., number of epochs, batch size, dataset path, model backbone, etc.) can be modified inside the corresponding .sh files.

Below we provide an example of training and testing GRASP on the KSR dataset.

1. Training

To train GRASP on the KSR dataset:

bash scripts/train_ksr/run_ksr_gene_soft_prompt.sh

Fine-tuned models are stored under:

GRASP/checkpoints/fine_tuned/

2. Testing

After training completes, evaluation can be performed using:

bash scripts/test/test_gene_soft_prompt.sh

All evaluation results are saved under:

GRASP/results/KSR/

Example output structure:

GRASP/results/KSR/gemma_base/1_gemma_base_KSR_gene_soft_prompt
├── confusion_matrix.png
├── gene_predictions.txt
├── prc_auc_curve.png
├── roc_auc_curve.png
└── test_results.json

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
methods		methods
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models

Environment

1. Clone the repository

2. Create and activate the conda environment

3. Install core scientific dependencies (via conda-forge)

4. Install remaining Python packages (via pip)

5. Install FlashAttention2

Data preparation

1. PPI / KSR datasets (download + negative sampling + conversion to Hugging Face datasets)

2. BioContext-informed datasets (augmented text files from Step 1 + conversion to Hugging Face datasets)

3. CausalBench perturbation datasets (automatic download)

Models

Training and Evaluation

1. Training

2. Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models

Environment

1. Clone the repository

2. Create and activate the conda environment

3. Install core scientific dependencies (via conda-forge)

4. Install remaining Python packages (via pip)

5. Install FlashAttention2

Data preparation

1. PPI / KSR datasets (download + negative sampling + conversion to Hugging Face datasets)

2. BioContext-informed datasets (augmented text files from Step 1 + conversion to Hugging Face datasets)

3. CausalBench perturbation datasets (automatic download)

Models

Training and Evaluation

1. Training

2. Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages