Skip to content

LiuzLab/RareCollab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RareCollab

Introduction

RareCollab is a Python package for rare diseases powered by Ollama. It integrates multimodal patient data, including DNA, RNA, and phenotype information, to support candidate variant prioritization and diagnostic interpretation.

Our paper is available on arXiv: https://arxiv.org/abs/2602.04058

Before Running RareCollab

Before running RareCollab, please prepare the following files and directories:

  1. Please download the RareCollab-data-dependencies from [link].

  2. Create a work folder for intermediate files generated by RareCollab.

  3. Create an output folder for RareCollab results.

  4. Create a reference folder and download all required reference files from here. (If you have RNA data)

  5. Create an NCBI API key using your email address from NCBI API Keys. This step is optional, but it can speed up searches using the NCBI and Entrez APIs.

Usage

Installation

Run the code below to install RareCollab:

!pip install git+https://github.com/LiuzLab/RareCollab.git

Step 1. Setup

1.1 Check required command-line tools

import RareCollab

RareCollab.Setup.CheckRequiredTools()

Follow the on-screen instructions to install any missing tools into your target environment.

Tip

When you see the message All required command-line tools are available., all external dependencies have been correctly installed and you can proceed to the next step.

1.2 Configure the paths

# Path to the reference data dependencies folder
ref_dir = '/path/RareCollab-data-dependencies-1.0'

# Path to your working folder (will be created if it does not exist)
work_dir = '/path/work'

# Reference genome build: 'hg38' or 'hg19'
ref_ver = 'hg38'

1.3 Prepare the references and container images

This step locates the reference files, builds the required indexes, and sets up the Singularity images that the pipeline will use:

# Locate and validate the reference input files for the chosen genome build
references = RareCollab.Setup.ResolveReferenceInputs(ref_dir=ref_dir, ref_ver=ref_ver)

# Build the FASTA index files required for downstream analysis
fasta_references = RareCollab.Setup.BuildReferenceIndex(ref_dir=ref_dir, ref_ver=ref_ver)

# Pull and prepare the Singularity images for the required tools
singularity_images = RareCollab.Setup.PrepareSingularityImages(ref_dir=ref_dir)

Each call returns a handle (references, fasta_references, singularity_images) that later steps will use, so run all three before continuing.

1.4 Load the samplesheet

# Load the samplesheet and validate its configuration
samplesheet = RareCollab.Setup.LoadSamplesheet(csv_path='/path/samplesheet.csv', fulfill_empty_hpo=False)

The samplesheet must be a CSV file containing exactly the following three columns:

Column Description
sampleID A unique identifier for each sample. Must not be duplicated. Letters, digits, hyphens (-), and underscores (_) are safe to use. To avoid potential parsing issues, it is best to avoid other special characters (such as spaces, /, \, *, or #).
vcf_path The absolute path to the sample's VCF file.
hpo_path The absolute path to the sample's HPO file.

Each HPO file is a plain-text (.txt) file containing a list of HPO terms, one per line, in the form HP:XXXXXXX. When fulfill_empty_hpo=True, if no HPO file exists at the specified hpo_path, a default HPO file will be created automatically at that location.

Note

A demo samplesheet and a demo HPO file are provided in the demo/ folder of the repository. You can use them as a reference for the expected format.

1.5 Configure the worker settings

# Automatically recommend parallelization settings based on the samplesheet
config = RareCollab.Setup.RecommendWorkerConfig(samplesheet)

Caution

You can adjust the parallelization settings in config if needed, but doing so is at your own risk — overriding the recommended values may lead to excessive resource usage or unstable runs.


Step 2. Generate Features

This step turns each sample's VCF into the features used by the downstream analysis. It runs in two stages — first processing the VCFs, then generating the features — and each stage updates samplesheet with the results:

# Stage 1: Process the VCF files (split, normalize, etc.)
samplesheet = RareCollab.Features.ProcessVCF(
    samplesheet,
    max_workers=config['split_workers'],
    work_dir=work_dir,
    references=references,
    fasta_references=fasta_references,
    overwrite=False,
)

# Stage 2: Generate features from the processed VCFs
samplesheet = RareCollab.Features.GenerateFeatures(
    samplesheet,
    work_dir=work_dir,
    references=references,
    fasta_references=fasta_references,
    singularity_images=singularity_images,
    ref_ver=ref_ver,
    config=config,
    overwrite=False,
)

Run both calls in order, since GenerateFeatures depends on the output of ProcessVCF.

Note

With overwrite=False, samples whose results already exist in work_dir are skipped, so you can safely re-run this step to resume an interrupted run. Set overwrite=True to force every sample to be reprocessed from scratch.

Optional: RNA Preprocessing

If RNA data are available, run:

RareCollab.Preprocessing.RNA(
    work_path=work_dir,
    splicing_path=splicing_path,
    expression_path=expression_path,
    ase_path=ase_path,
)
Parameter Type Description
work_path str Path to the work directory.
splicing_path str Path to the output from FRASER2.
expression_path str Path to the output from OUTRIDER.
ase_path str Path to the output from GATK ASEReadCounter.

Step 3. Run RareCollab Agents

3.1 Launch the LLM Server

Before running the downstream LLM-based analysis, you need to start an LLM server and keep it listening, then capture its connection details into llm_config. Run:

# Launch the LLM server (keeps listening for requests)
server = RareCollab.Setup.LaunchLLMServer(
    partition="partition",
    nodelist="node",
    port=12321,
    num_parallel=2,
    model_name="gpt-oss:20b",
)

# Capture the server's connection details into a config object
llm_config = RareCollab.Setup.LLMConfig(
    model_name=server["model_name"],
    ollama_url=server["ollama_url"],
    num_parallel=server["num_parallel"],
    temperature=0.7,
)
Parameters — LaunchLLMServer
Parameter Type Description
partition str The partition to launch the server on.
nodelist str The node (or nodes) to run the server on.
port int The port the server listens on.
num_parallel int Number of parallel instances. Increase this to run requests in parallel, based on your available GPU capacity.
model_name str The LLM model to serve (e.g., gpt-oss:20b).
Parameters — LLMConfig
Parameter Type Description
model_name str The served model name. Pass through from server["model_name"].
ollama_url str The server's URL. Pass through from server["ollama_url"].
num_parallel int Number of parallel requests. Pass through from server["num_parallel"].
temperature float Sampling temperature. A value of 0.7 is recommended.

Note

When the server starts, LaunchLLMServer prints its SLURM job ID, for example: Submitted SLURM job id: xxxxxxxx. Keep this ID — you'll need it to stop the server later.

To shut the server down when you're done, run:

# Stop the LLM server using the SLURM job ID printed at launch
RareCollab.Setup.StopLLMServer(SLURM_job_id)

3.2 Run the Diagnostic Agents (Serial)

If you have only a single LLM available, run the diagnostic agents serially as shown below. Each agent updates samplesheet and passes it to the next one, so they must be run in order.

We recommend providing an NCBI email and API key — they're used by the database and literature agents. If you don't have them, set both to None.

# NCBI credentials — used by the database and literature agents.
# If you don't have them, set both to None (queries may then be rate-limited).
NCBI_EMAIL = "your_ncbi@email.com"   # or None
NCBI_KEY   = "your-api-key"          # or None

# Run the Mixture-of-Experts (MoE) diagnostic engine
samplesheet = RareCollab.DiagnosticEngine.MoE(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
)

# Generate the candidate gene/variant list
samplesheet = RareCollab.DiagnosticEngine.Candidates(
    samplesheet=samplesheet,
    work_dir=work_dir,
    config=config,
    overwrite=False,
)

# Database agent: query external databases for each candidate (uses NCBI)
samplesheet = RareCollab.DatabaseAgent.RunAgent(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
    llm_config=llm_config,
    ncbi_email=NCBI_EMAIL,
    ncbi_api_key=NCBI_KEY,
    config=config,
    overwrite=False,
)

# In-silico agent: run in-silico prediction/analysis on the candidates
samplesheet = RareCollab.InSilicoAgent.RunAgent(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: preprocess the phenotype (HPO) data
samplesheet = RareCollab.PhenotypeAgent.Preprocessing(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
    overwrite=False,
)

# Phenotype agent: analysis based on HPO terms
samplesheet = RareCollab.PhenotypeAgent.RunAgent_HPO(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: analysis against OMIM
samplesheet = RareCollab.PhenotypeAgent.RunAgent_OMIM(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: analysis from the literature (uses NCBI)
samplesheet = RareCollab.PhenotypeAgent.RunAgent_Literature(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    ncbi_email=NCBI_EMAIL,
    ncbi_api_key=NCBI_KEY,
    overwrite=False,
)

Note

As before, overwrite=False lets you safely re-run this block to resume an interrupted run — completed steps are skipped. Set overwrite=True on a given call to force it to recompute.


Step 4. Integrate the Results

The final step merges the outputs from all the diagnostic agents into a single integrated result and writes it to output_path:

# Merge all agent outputs into the final integrated result
samplesheet = RareCollab.Integration.Review(
    samplesheet=samplesheet,
    work_dir=work_dir,
    fasta_references=fasta_references,
    output_path='/path/output',
    overwrite=False,
)

After this step completes, your final integrated results are available at output_path.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages