GitHub - LiuzLab/RareCollab

Introduction

RareCollab is a Python package for rare diseases powered by Ollama. It integrates multimodal patient data, including DNA, RNA, and phenotype information, to support candidate variant prioritization and diagnostic interpretation.

Our paper is available on arXiv: https://arxiv.org/abs/2602.04058

Before Running RareCollab

Before running RareCollab, please prepare the following files and directories:

Please download the RareCollab-data-dependencies from [link].
Create a work folder for intermediate files generated by RareCollab.
Create an output folder for RareCollab results.
Create a reference folder and download all required reference files from here. (If you have RNA data)
Create an NCBI API key using your email address from NCBI API Keys. This step is optional, but it can speed up searches using the NCBI and Entrez APIs.

Usage

Installation

Run the code below to install RareCollab:

!pip install git+https://github.com/LiuzLab/RareCollab.git

Step 1. Setup

1.1 Check required command-line tools

import RareCollab

RareCollab.Setup.CheckRequiredTools()

Follow the on-screen instructions to install any missing tools into your target environment.

Tip

When you see the message All required command-line tools are available., all external dependencies have been correctly installed and you can proceed to the next step.

1.2 Configure the paths

# Path to the reference data dependencies folder
ref_dir = '/path/RareCollab-data-dependencies-1.0'

# Path to your working folder (will be created if it does not exist)
work_dir = '/path/work'

# Reference genome build: 'hg38' or 'hg19'
ref_ver = 'hg38'

1.3 Prepare the references and container images

This step locates the reference files, builds the required indexes, and sets up the Singularity images that the pipeline will use:

# Locate and validate the reference input files for the chosen genome build
references = RareCollab.Setup.ResolveReferenceInputs(ref_dir=ref_dir, ref_ver=ref_ver)

# Build the FASTA index files required for downstream analysis
fasta_references = RareCollab.Setup.BuildReferenceIndex(ref_dir=ref_dir, ref_ver=ref_ver)

# Pull and prepare the Singularity images for the required tools
singularity_images = RareCollab.Setup.PrepareSingularityImages(ref_dir=ref_dir)

Each call returns a handle (references, fasta_references, singularity_images) that later steps will use, so run all three before continuing.

1.4 Load the samplesheet

# Load the samplesheet and validate its configuration
samplesheet = RareCollab.Setup.LoadSamplesheet(csv_path='/path/samplesheet.csv', fulfill_empty_hpo=False)

The samplesheet must be a CSV file containing exactly the following three columns:

Column	Description
`sampleID`	A unique identifier for each sample. Must not be duplicated. Letters, digits, hyphens (`-`), and underscores (`_`) are safe to use. To avoid potential parsing issues, it is best to avoid other special characters (such as spaces, `/`, `\`, `*`, or `#`).
`vcf_path`	The absolute path to the sample's VCF file.
`hpo_path`	The absolute path to the sample's HPO file.

Each HPO file is a plain-text (.txt) file containing a list of HPO terms, one per line, in the form HP:XXXXXXX. When fulfill_empty_hpo=True, if no HPO file exists at the specified hpo_path, a default HPO file will be created automatically at that location.

Note

A demo samplesheet and a demo HPO file are provided in the demo/ folder of the repository. You can use them as a reference for the expected format.

1.5 Configure the worker settings

# Automatically recommend parallelization settings based on the samplesheet
config = RareCollab.Setup.RecommendWorkerConfig(samplesheet)

Caution

You can adjust the parallelization settings in config if needed, but doing so is at your own risk — overriding the recommended values may lead to excessive resource usage or unstable runs.

Step 2. Generate Features

This step turns each sample's VCF into the features used by the downstream analysis. It runs in two stages — first processing the VCFs, then generating the features — and each stage updates samplesheet with the results:

# Stage 1: Process the VCF files (split, normalize, etc.)
samplesheet = RareCollab.Features.ProcessVCF(
    samplesheet,
    max_workers=config['split_workers'],
    work_dir=work_dir,
    references=references,
    fasta_references=fasta_references,
    overwrite=False,
)

# Stage 2: Generate features from the processed VCFs
samplesheet = RareCollab.Features.GenerateFeatures(
    samplesheet,
    work_dir=work_dir,
    references=references,
    fasta_references=fasta_references,
    singularity_images=singularity_images,
    ref_ver=ref_ver,
    config=config,
    overwrite=False,
)

Run both calls in order, since GenerateFeatures depends on the output of ProcessVCF.

Note

With overwrite=False, samples whose results already exist in work_dir are skipped, so you can safely re-run this step to resume an interrupted run. Set overwrite=True to force every sample to be reprocessed from scratch.

Optional: RNA Preprocessing

If RNA data are available, run:

RareCollab.Preprocessing.RNA(
    work_path=work_dir,
    splicing_path=splicing_path,
    expression_path=expression_path,
    ase_path=ase_path,
)

Parameter	Type	Description
`work_path`	str	Path to the work directory.
`splicing_path`	str	Path to the output from FRASER2.
`expression_path`	str	Path to the output from OUTRIDER.
`ase_path`	str	Path to the output from GATK ASEReadCounter.

Step 3. Run RareCollab Agents

3.1 Launch the LLM Server

Before running the downstream LLM-based analysis, you need to start an LLM server and keep it listening, then capture its connection details into llm_config. Run:

# Launch the LLM server (keeps listening for requests)
server = RareCollab.Setup.LaunchLLMServer(
    partition="partition",
    nodelist="node",
    port=12321,
    num_parallel=2,
    model_name="gpt-oss:20b",
)

# Capture the server's connection details into a config object
llm_config = RareCollab.Setup.LLMConfig(
    model_name=server["model_name"],
    ollama_url=server["ollama_url"],
    num_parallel=server["num_parallel"],
    temperature=0.7,
)

Parameters — `LaunchLLMServer`

Parameter	Type	Description
`partition`	str	The partition to launch the server on.
`nodelist`	str	The node (or nodes) to run the server on.
`port`	int	The port the server listens on.
`num_parallel`	int	Number of parallel instances. Increase this to run requests in parallel, based on your available GPU capacity.
`model_name`	str	The LLM model to serve (e.g., `gpt-oss:20b`).

Parameters — `LLMConfig`

Parameter	Type	Description
`model_name`	str	The served model name. Pass through from `server["model_name"]`.
`ollama_url`	str	The server's URL. Pass through from `server["ollama_url"]`.
`num_parallel`	int	Number of parallel requests. Pass through from `server["num_parallel"]`.
`temperature`	float	Sampling temperature. A value of `0.7` is recommended.

Note

When the server starts, LaunchLLMServer prints its SLURM job ID, for example: Submitted SLURM job id: xxxxxxxx. Keep this ID — you'll need it to stop the server later.

To shut the server down when you're done, run:

# Stop the LLM server using the SLURM job ID printed at launch
RareCollab.Setup.StopLLMServer(SLURM_job_id)

3.2 Run the Diagnostic Agents (Serial)

If you have only a single LLM available, run the diagnostic agents serially as shown below. Each agent updates samplesheet and passes it to the next one, so they must be run in order.

We recommend providing an NCBI email and API key — they're used by the database and literature agents. If you don't have them, set both to None.

# NCBI credentials — used by the database and literature agents.
# If you don't have them, set both to None (queries may then be rate-limited).
NCBI_EMAIL = "your_ncbi@email.com"   # or None
NCBI_KEY   = "your-api-key"          # or None

# Run the Mixture-of-Experts (MoE) diagnostic engine
samplesheet = RareCollab.DiagnosticEngine.MoE(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
)

# Generate the candidate gene/variant list
samplesheet = RareCollab.DiagnosticEngine.Candidates(
    samplesheet=samplesheet,
    work_dir=work_dir,
    config=config,
    overwrite=False,
)

# Database agent: query external databases for each candidate (uses NCBI)
samplesheet = RareCollab.DatabaseAgent.RunAgent(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
    llm_config=llm_config,
    ncbi_email=NCBI_EMAIL,
    ncbi_api_key=NCBI_KEY,
    config=config,
    overwrite=False,
)

# In-silico agent: run in-silico prediction/analysis on the candidates
samplesheet = RareCollab.InSilicoAgent.RunAgent(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: preprocess the phenotype (HPO) data
samplesheet = RareCollab.PhenotypeAgent.Preprocessing(
    samplesheet=samplesheet,
    work_dir=work_dir,
    references=references,
    overwrite=False,
)

# Phenotype agent: analysis based on HPO terms
samplesheet = RareCollab.PhenotypeAgent.RunAgent_HPO(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: analysis against OMIM
samplesheet = RareCollab.PhenotypeAgent.RunAgent_OMIM(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    overwrite=False,
)

# Phenotype agent: analysis from the literature (uses NCBI)
samplesheet = RareCollab.PhenotypeAgent.RunAgent_Literature(
    samplesheet=samplesheet,
    work_dir=work_dir,
    llm_config=llm_config,
    ncbi_email=NCBI_EMAIL,
    ncbi_api_key=NCBI_KEY,
    overwrite=False,
)

Note

As before, overwrite=False lets you safely re-run this block to resume an interrupted run — completed steps are skipped. Set overwrite=True on a given call to force it to recompute.

Step 4. Integrate the Results

The final step merges the outputs from all the diagnostic agents into a single integrated result and writes it to output_path:

# Merge all agent outputs into the final integrated result
samplesheet = RareCollab.Integration.Review(
    samplesheet=samplesheet,
    work_dir=work_dir,
    fasta_references=fasta_references,
    output_path='/path/output',
    overwrite=False,
)

After this step completes, your final integrated results are available at output_path.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
demo		demo
reference_files		reference_files
src/RareCollab		src/RareCollab
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Before Running RareCollab

Usage

Installation

Step 1. Setup

1.1 Check required command-line tools

1.2 Configure the paths

1.3 Prepare the references and container images

1.4 Load the samplesheet

1.5 Configure the worker settings

Step 2. Generate Features

Optional: RNA Preprocessing

Step 3. Run RareCollab Agents

3.1 Launch the LLM Server

Parameters — `LaunchLLMServer`

Parameters — `LLMConfig`

3.2 Run the Diagnostic Agents (Serial)

Step 4. Integrate the Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Introduction

Before Running RareCollab

Usage

Installation

Step 1. Setup

1.1 Check required command-line tools

1.2 Configure the paths

1.3 Prepare the references and container images

1.4 Load the samplesheet

1.5 Configure the worker settings

Step 2. Generate Features

Optional: RNA Preprocessing

Step 3. Run RareCollab Agents

3.1 Launch the LLM Server

Parameters — LaunchLLMServer

Parameters — LLMConfig

3.2 Run the Diagnostic Agents (Serial)

Step 4. Integrate the Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Parameters — `LaunchLLMServer`

Parameters — `LLMConfig`

Packages