RareCollab is a Python package for rare diseases powered by Ollama. It integrates multimodal patient data, including DNA, RNA, and phenotype information, to support candidate variant prioritization and diagnostic interpretation.
Our paper is available on arXiv: https://arxiv.org/abs/2602.04058
Before running RareCollab, please prepare the following files and directories:
-
Please download the RareCollab-data-dependencies from [link].
-
Create a work folder for intermediate files generated by RareCollab.
-
Create an output folder for RareCollab results.
-
Create a reference folder and download all required reference files from here. (If you have RNA data)
-
Create an NCBI API key using your email address from NCBI API Keys. This step is optional, but it can speed up searches using the NCBI and Entrez APIs.
Run the code below to install RareCollab:
!pip install git+https://github.com/LiuzLab/RareCollab.gitimport RareCollab
RareCollab.Setup.CheckRequiredTools()Follow the on-screen instructions to install any missing tools into your target environment.
Tip
When you see the message All required command-line tools are available., all external dependencies have been correctly installed and you can proceed to the next step.
# Path to the reference data dependencies folder
ref_dir = '/path/RareCollab-data-dependencies-1.0'
# Path to your working folder (will be created if it does not exist)
work_dir = '/path/work'
# Reference genome build: 'hg38' or 'hg19'
ref_ver = 'hg38'This step locates the reference files, builds the required indexes, and sets up the Singularity images that the pipeline will use:
# Locate and validate the reference input files for the chosen genome build
references = RareCollab.Setup.ResolveReferenceInputs(ref_dir=ref_dir, ref_ver=ref_ver)
# Build the FASTA index files required for downstream analysis
fasta_references = RareCollab.Setup.BuildReferenceIndex(ref_dir=ref_dir, ref_ver=ref_ver)
# Pull and prepare the Singularity images for the required tools
singularity_images = RareCollab.Setup.PrepareSingularityImages(ref_dir=ref_dir)Each call returns a handle (references, fasta_references, singularity_images) that later steps will use, so run all three before continuing.
# Load the samplesheet and validate its configuration
samplesheet = RareCollab.Setup.LoadSamplesheet(csv_path='/path/samplesheet.csv', fulfill_empty_hpo=False)The samplesheet must be a CSV file containing exactly the following three columns:
| Column | Description |
|---|---|
sampleID |
A unique identifier for each sample. Must not be duplicated. Letters, digits, hyphens (-), and underscores (_) are safe to use. To avoid potential parsing issues, it is best to avoid other special characters (such as spaces, /, \, *, or #). |
vcf_path |
The absolute path to the sample's VCF file. |
hpo_path |
The absolute path to the sample's HPO file. |
Each HPO file is a plain-text (.txt) file containing a list of HPO terms, one per line, in the form HP:XXXXXXX. When fulfill_empty_hpo=True, if no HPO file exists at the specified hpo_path, a default HPO file will be created automatically at that location.
Note
A demo samplesheet and a demo HPO file are provided in the demo/ folder of the repository. You can use them as a reference for the expected format.
# Automatically recommend parallelization settings based on the samplesheet
config = RareCollab.Setup.RecommendWorkerConfig(samplesheet)Caution
You can adjust the parallelization settings in config if needed, but doing so is at your own risk — overriding the recommended values may lead to excessive resource usage or unstable runs.
This step turns each sample's VCF into the features used by the downstream analysis. It runs in two stages — first processing the VCFs, then generating the features — and each stage updates samplesheet with the results:
# Stage 1: Process the VCF files (split, normalize, etc.)
samplesheet = RareCollab.Features.ProcessVCF(
samplesheet,
max_workers=config['split_workers'],
work_dir=work_dir,
references=references,
fasta_references=fasta_references,
overwrite=False,
)
# Stage 2: Generate features from the processed VCFs
samplesheet = RareCollab.Features.GenerateFeatures(
samplesheet,
work_dir=work_dir,
references=references,
fasta_references=fasta_references,
singularity_images=singularity_images,
ref_ver=ref_ver,
config=config,
overwrite=False,
)Run both calls in order, since GenerateFeatures depends on the output of ProcessVCF.
Note
With overwrite=False, samples whose results already exist in work_dir are skipped, so you can safely re-run this step to resume an interrupted run. Set overwrite=True to force every sample to be reprocessed from scratch.
If RNA data are available, run:
RareCollab.Preprocessing.RNA(
work_path=work_dir,
splicing_path=splicing_path,
expression_path=expression_path,
ase_path=ase_path,
)| Parameter | Type | Description |
|---|---|---|
work_path |
str | Path to the work directory. |
splicing_path |
str | Path to the output from FRASER2. |
expression_path |
str | Path to the output from OUTRIDER. |
ase_path |
str | Path to the output from GATK ASEReadCounter. |
Before running the downstream LLM-based analysis, you need to start an LLM server and keep it listening, then capture its connection details into llm_config. Run:
# Launch the LLM server (keeps listening for requests)
server = RareCollab.Setup.LaunchLLMServer(
partition="partition",
nodelist="node",
port=12321,
num_parallel=2,
model_name="gpt-oss:20b",
)
# Capture the server's connection details into a config object
llm_config = RareCollab.Setup.LLMConfig(
model_name=server["model_name"],
ollama_url=server["ollama_url"],
num_parallel=server["num_parallel"],
temperature=0.7,
)| Parameter | Type | Description |
|---|---|---|
partition |
str | The partition to launch the server on. |
nodelist |
str | The node (or nodes) to run the server on. |
port |
int | The port the server listens on. |
num_parallel |
int | Number of parallel instances. Increase this to run requests in parallel, based on your available GPU capacity. |
model_name |
str | The LLM model to serve (e.g., gpt-oss:20b). |
| Parameter | Type | Description |
|---|---|---|
model_name |
str | The served model name. Pass through from server["model_name"]. |
ollama_url |
str | The server's URL. Pass through from server["ollama_url"]. |
num_parallel |
int | Number of parallel requests. Pass through from server["num_parallel"]. |
temperature |
float | Sampling temperature. A value of 0.7 is recommended. |
Note
When the server starts, LaunchLLMServer prints its SLURM job ID, for example: Submitted SLURM job id: xxxxxxxx. Keep this ID — you'll need it to stop the server later.
To shut the server down when you're done, run:
# Stop the LLM server using the SLURM job ID printed at launch
RareCollab.Setup.StopLLMServer(SLURM_job_id)If you have only a single LLM available, run the diagnostic agents serially as shown below. Each agent updates samplesheet and passes it to the next one, so they must be run in order.
We recommend providing an NCBI email and API key — they're used by the database and literature agents. If you don't have them, set both to None.
# NCBI credentials — used by the database and literature agents.
# If you don't have them, set both to None (queries may then be rate-limited).
NCBI_EMAIL = "your_ncbi@email.com" # or None
NCBI_KEY = "your-api-key" # or None
# Run the Mixture-of-Experts (MoE) diagnostic engine
samplesheet = RareCollab.DiagnosticEngine.MoE(
samplesheet=samplesheet,
work_dir=work_dir,
references=references,
)
# Generate the candidate gene/variant list
samplesheet = RareCollab.DiagnosticEngine.Candidates(
samplesheet=samplesheet,
work_dir=work_dir,
config=config,
overwrite=False,
)
# Database agent: query external databases for each candidate (uses NCBI)
samplesheet = RareCollab.DatabaseAgent.RunAgent(
samplesheet=samplesheet,
work_dir=work_dir,
references=references,
llm_config=llm_config,
ncbi_email=NCBI_EMAIL,
ncbi_api_key=NCBI_KEY,
config=config,
overwrite=False,
)
# In-silico agent: run in-silico prediction/analysis on the candidates
samplesheet = RareCollab.InSilicoAgent.RunAgent(
samplesheet=samplesheet,
work_dir=work_dir,
llm_config=llm_config,
overwrite=False,
)
# Phenotype agent: preprocess the phenotype (HPO) data
samplesheet = RareCollab.PhenotypeAgent.Preprocessing(
samplesheet=samplesheet,
work_dir=work_dir,
references=references,
overwrite=False,
)
# Phenotype agent: analysis based on HPO terms
samplesheet = RareCollab.PhenotypeAgent.RunAgent_HPO(
samplesheet=samplesheet,
work_dir=work_dir,
llm_config=llm_config,
overwrite=False,
)
# Phenotype agent: analysis against OMIM
samplesheet = RareCollab.PhenotypeAgent.RunAgent_OMIM(
samplesheet=samplesheet,
work_dir=work_dir,
llm_config=llm_config,
overwrite=False,
)
# Phenotype agent: analysis from the literature (uses NCBI)
samplesheet = RareCollab.PhenotypeAgent.RunAgent_Literature(
samplesheet=samplesheet,
work_dir=work_dir,
llm_config=llm_config,
ncbi_email=NCBI_EMAIL,
ncbi_api_key=NCBI_KEY,
overwrite=False,
)Note
As before, overwrite=False lets you safely re-run this block to resume an interrupted run — completed steps are skipped. Set overwrite=True on a given call to force it to recompute.
The final step merges the outputs from all the diagnostic agents into a single integrated result and writes it to output_path:
# Merge all agent outputs into the final integrated result
samplesheet = RareCollab.Integration.Review(
samplesheet=samplesheet,
work_dir=work_dir,
fasta_references=fasta_references,
output_path='/path/output',
overwrite=False,
)After this step completes, your final integrated results are available at output_path.
