Skip to content

GuanLab/GRASP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models

This repository contains the full fine-tuning, inference, and evaluation pipeline for GRASP. Instead of directly querying LLMs with gene pairs, GRASP introduces relation-aware soft prompt tokens that adapt to gene roles and interaction types. Please contact (yuqiang@umich.edu or gyuanfan@umich.edu) if you have any questions or suggestions.

GRASP Overview

Environment

1. Clone the repository

git clone https://github.com/GuanLab/GRASP.git
cd GRASP

2. Create and activate the conda environment

conda create -n grasp_env python=3.11 -y
conda activate grasp_env
python -m pip install --upgrade pip

3. Install core scientific dependencies (via conda-forge)

conda install -c conda-forge -y \
  numpy=1.26.4 pandas=2.0.3 scipy=1.11.4 scikit-learn=1.5.2 \
  matplotlib=3.10.0 seaborn=0.13.2 numba=0.61.0 pyarrow=18.1.0 \
  python-igraph=0.11.8 lightgbm=4.6.0 distributed=2023.5.0 scanpy=1.11.0

4. Install remaining Python packages (via pip)

pip install \
  accelerate==1.4.0 arboreto==0.1.6 causal_learn==0.1.4.0 cdt==0.6.0 \
  einops==0.8.2 gdown==5.2.0 huggingface_hub==0.34.3 mygene==3.2.2 \
  scprep==1.2.3 pytorch_lightning==2.5.0.post0 safetensors==0.4.5 seaborn==0.13.2 \
  slingpy==0.2.12 torch==2.5.1 tqdm==4.67.1 transformers==4.51.3 trl==0.15.2

5. Install FlashAttention2

pip install --no-build-isolation flash-attn==2.7.2.post1

Alternative: Install from a prebuilt wheel to avoid compilation issues.

pip install --no-deps \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"

Data preparation

1. PPI / KSR datasets (download + negative sampling + conversion to Hugging Face datasets)

We (i) download experimentally validated interaction pairs from the public resources, (ii) construct balanced datasets by randomly sampling synthetic negative pairs that do not overlap with any curated positives, and (iii) export each dataset to a plain-text file with the format:

gene1_symbol  gene2_symbol  label

where label ∈ {0,1} indicates non-interaction versus interaction.

For reproducibility, we generate five independent dataset splits (using different random seeds) and convert them to Hugging Face datasets:

  • Human PPI: total_human_dataset_1total_human_dataset_5 under GRASP/data/PPI/
  • Kinase–substrate (PhosphoNetworks): strict_KSR_dataset_1strict_KSR_dataset_5 under GRASP/data/PhosphoNetwork/

2. BioContext-informed datasets (augmented text files from Step 1 + conversion to Hugging Face datasets)

To construct BioContext-informed inputs, we augment the original PPI/KSR text files by retrieving gene functional summaries from MyGene.info (https://mygene.info) and appending two additional columns:

gene1_symbol  gene2_symbol  label  gene1_summary  gene2_summary

Gene summaries are queried in bulk using the MyGene API (via querymany), and the returned summary field is used as the gene description. The resulting BioContext-informed datasets are exported and saved as Hugging Face datasets under:

  • GRASP/data/biocontext_informed_data/total_human_with_summaries_dataset_1total_human_with_summaries_dataset_5
  • GRASP/data/biocontext_informed_data/strict_kinase_substrate_with_summaries_dataset_1strict_kinase_substrate_with_summaries_dataset_5

3. CausalBench perturbation datasets (automatic download)

CausalBench single-cell perturbation datasets are automatically downloaded during evaluation. The dataset download URLs are defined in the CausalBench codebase; they can also be inspected in:

causalscbench/data_access/datasets/download_evaluation_files.py

within the CausalBench GitHub repository: https://github.com/causalbench/causalbench.

Models

We organize model weights under GRASP/checkpoints/ with three subfolders:

  • base/ — original models
  • continual_pretrained/ — domain-adapted models after continual pretraining on PubMed titles/abstracts
  • fine_tuned/ — fine-tuned models

A typical layout looks like this:

GRASP/checkpoints/
├── base/
│   ├── llama-3.1-8b-instruct/
│   └── gemma-3-4b-it/
├── continual_pretrained/
│   ├── Llama-3.1-8B-Instruct-continual-gene/
│   └── gemma-3-4b-it-continual-gene/
└── fine_tuned/
    └── ...

Training and Evaluation

After preparing the datasets and placing the base or continual-pretrained model checkpoints under GRASP/checkpoints/, training and evaluation can be launched directly using the provided shell scripts.

All key hyperparameters (e.g., number of epochs, batch size, dataset path, model backbone, etc.) can be modified inside the corresponding .sh files.

Below we provide an example of training and testing GRASP on the KSR dataset.

1. Training

To train GRASP on the KSR dataset:

bash scripts/train_ksr/run_ksr_gene_soft_prompt.sh

Fine-tuned models are stored under:

GRASP/checkpoints/fine_tuned/

2. Testing

After training completes, evaluation can be performed using:

bash scripts/test/test_gene_soft_prompt.sh

All evaluation results are saved under:

GRASP/results/KSR/

Example output structure:

GRASP/results/KSR/gemma_base/1_gemma_base_KSR_gene_soft_prompt
├── confusion_matrix.png
├── gene_predictions.txt
├── prc_auc_curve.png
├── roc_auc_curve.png
└── test_results.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors