Lightweight wrappers for installing CellOracle reference genomes and running the GRN inference workflow used in the Bunina lab. The repository provides two entrypoints:
run_install_genome.py: download and install one of the supported genomes into your local genome_dir.run_celloracle_inference.py: launch the inference pipeline end-to-end from a YAML configuration file viaCellOraclePipeline.
git clone git@github.com:bunina-lab/celloracle_tools.gitor if you haven't got SSH key in your github account:
git clone https://github.com/bunina-lab/celloracle_tools.gitCellOracle pulls in heavy scientific dependencies (Scanpy, PyBedtools, PySam, etc.), so using Conda/Mamba is strongly recommended.
-
Create a clean environment
mamba / conda create -n celloracle python=3.10 pip -c conda-forge -c bioconda mamba / conda activate celloracle
-
Install Python dependencies
python -m pip install -r requirements.txt
The
conda_env_requirements.txtfile lists the full package set currently used on Bunina lab machines if you need to audit exact versions. -
System utilities
Some Python wheels expect external tools to be present. Ensurebedtools,samtools,gzip, and a C/C++ toolchain (gcc,g++,make) are available on PATH before installing.
Use config.yaml as a template and adjust:
rna_h5ad: AnnData file with RNA counts (raw counts recommended).peak_names_file: text file with one genomic peak per line (chr_start_end).peak_coaccess_path: TSV with three columns (peak1,peak2,weight). (You can calculate this by pipe: circe)TG2TF_json_path: JSON mapping target genes to transcription factors.genome_dir/reference_dir: location and name of the genome installed viarun_install_genome.py.cluster_column,embedding,raw_counts: AnnData metadata settings.tf_binding_frp,motif_filtering_method, etc.: motif/GRN hyper-parameters.
Resolved paths can be absolute or relative; absolute paths are recommended for reproducibility. The script copies the effective configuration to used_config.yaml inside each run directory.
The installer only downloads genomes that CellOracle supports (list defined near the top of run_install_genome.py). Example:
python run_install_genome.py \
--genome_dir /path/to/genomes/celloracle_refs \
--genome_name hg38--genome_dir: directory where genome FASTA/motif resources will be cached.--genome_name: one of the choices encoded in the script (human, mouse, zebrafish, fly, worm, etc.).
The script checks whether the requested genome already exists and exits early if so. The installation may take tens of minutes depending on download speed.
Once genomes and inputs are ready, launch the inference workflow:
python run_celloracle_inference.py \
--config /path/to/experiment_config.yaml \
--n_cpu 16Key behaviors:
- The script loads your YAML config, resolves paths, and creates an output run folder at
output_dir/run_name(defaults to ISO date ifrun_nameis omitted). - RNA
AnnDatais loaded fromrna_h5ad. Ifraw_countsistrueand the AnnData object contains araw_countlayer, the script replacesadata.Xwith that layer for CellOracle training. - Peak co-accessibility is read from
peak_coaccess_pathand validated to have exactly three columns. CellOraclePipelinesteps are executed in order: base GRN construction from motifs, Oracle initialization, and GRN inference.- The
--n_cpuCLI argument overrides the value in the config file.
Outputs include GRN tables, diagnostics, plots, and the used_config.yaml snapshot. See sarah_test_output/ for an example run layout.
- AnnData column availability: ensure
cluster_columnexists inadata.obs; the script aborts early if it cannot find this column. - Peak metadata:
peak_names_filemust align with the peaks used in your co-accessibility table; mismatches lead to empty GRNs. - Disk usage: genome installations can exceed multiple GB; keep
genome_diron high-capacity storage. - Reproducibility: track both the exact config file and the
requirements.txtcommit hash when sharing results.
- Check
lib/process_celloracle.pyfor more details on each pipeline stage.