pseuPIRA.py is a pipeline for estimating plasmid copy number (PCN) from microbial genome sequencing data. It combines themisto for pseudoalignment and minimap2 for multiread mapping. This project uses uv to manage dependencies and packaging.
We thank Jia Lu and Zhengqing Zhou for extensive testing on the MIT computing cluster and the Duke Compute Cluster (DCC). Testing on DCC and MacOS done by Rohan Maddamsetti and Irida Shyti.
pseuPIRA.py processes a gzipped GenBank reference file to extract replicon sequences (e.g., chromosomes and plasmids) and then:
-
Extracts Replicon Sequences:
- Reads a GenBank file (
*.gbff.gz) and creates individual FASTA files for each replicon with custom headers containing metadata.
- Reads a GenBank file (
-
Builds an Index & Pseudoaligns Reads:
- Uses themisto to build an index from the replicon FASTA files.
- Pseudoaligns sequencing reads (FASTQ format) to count how many reads map to each replicon.
-
Estimates Plasmid Copy Numbers (Naïve):
- Computes initial PCN estimates as follows. First, we normalize read counts by replicon lengths to estimate sequencing read coverage depth for each replicon (i.e., plasmids and chromosomes). Then, we divide plasmid read coverage depth by chromosome read coverage depth to estimate plasmid copy numbers (relative to chromosome copy number).
-
Refines PCN Estimates (PIRA):
- Filters reads that map to multiple replicons.
- Realigns ambiguous reads using minimap2.
- Applies a Probabilistic Iterative Read Assignment (PIRA) algorithm to update and refine PCN estimates.
-
External Tools:
-
themisto– must be installed and available in your$PATH -
minimap2– must be installed and available in your$PATHFor convenience, MacOS and Linux binaries for
themistoand a Linux binary forminimap2are provided in thebin/directory. You may have to turn these binaries into executables like so:chmod +x ${PWD}/bin/themisto_linux-v3.2.2/themisto ##make the linux themisto binary into an executable
Then, you can add these binaries to the
$PATHfrom the command-line as follows, before runningpseuPIRA.py:export PATH="${PWD}/bin/themisto_linux-v3.2.2:$PATH"
Alternatively, you can install themisto and minimap2 from github using the links above. Note that we have had difficulty compiling themisto from source. The v3.2.2 release works for us on MacOS and Linux.
-
-
Python Libraries:
- Python==3.12.9
- Biopython==1.85
- Polars==1.24.0 IMPORTANT: there is a known bug where Polars 1.30.0 breaks PIRA code. Please install Polars 1.24.0.
- HTSeq==2.0.9 Current version may trigger SyntaxWarnings in Python, but does not cause any errors. See: htseq/htseq#105)
- NumPy==2.2.3
- argparse (standard library)
This project uses uv for dependency management and packaging, similar to Poetry. The configuration is defined in the pyproject.toml file.
-
Install uv (https://docs.astral.sh/uv/getting-started/installation/):
pip install uvIf this command does not work, see the official uv installation directions here: https://docs.astral.sh/uv/getting-started/installation/
-
Install Project Dependences: In the project directory, run:
uv sync
This will install all the required dependencies as specified in the pyproject.toml.
-
Activate the Virtual Environment: uv automatically creates a virtual environment (in .venv/) during installation. To use it:
source .venv/bin/activate ## Linux/MacOSNote: If you use
uv runfor executing scripts, it will automatically handle the environment. For manual execution (python file.py), activation ensures you're using the correct dependencies.
Run the pipeline with:
python src/pseuPIRA.py -r <reference.gbff.gz> <reads1.fastq> [<reads2.fastq.gz> ...]-
-r,--reference
(Required) Path to the gzipped GenBank file (*.gbff.gz) containing the reference genome. -
-j,--num-processors
(Optional) Specifies the number of threads/processors to use.
Note: This option is currently not functional; the code is hardcoded to use 4 threads. -
-o,--output
(Optional) Path to the output directory.
Default:../results/test-run -
-q,--quick
(Optional) Quick mode. Runs the pseudoalignment and naive PCN estimation steps but skips the PIRA refinement. -
reads
(Positional) One or more FASTQ read files (supporting gzipped or plain FASTQ).
Example data for testing pseuPIRA can be found in the ./data/RHBSTW-00316 directory. This dataset originates from NCBI BioProject PRJNA605147.
-
Reference Genome:
- RefSeq ID: GCF_013742375.1
- Assembly Level: Complete genome
- Replicon Composition: 1 chromosome and 9 plasmids
- BioSample: SAMN15148572
- Strain: RHBSTW-00316
- Species: Enterobacter hormaechei
-
Reference Genome Files:
- NCBI Genome Datasets:
View Dataset - FTP Link:
Download from FTP
- NCBI Genome Datasets:
-
Sequencing Data:
- Illumina Data Accession: SRX8493146
- Required FASTQ File:
To test the pipeline, download the fileSRR11948691.fastq.gzfrom the NCBI Trace Archive for SRR11948691.
-
Download the Data:
- Download and save the FASTQ file as:
./data/RHBSTW-00316/SRR11948691.fastq.gz - Save the reference genome file (e.g.,
GCF_013742375.1_ASM1374237v1_genomic.gbff.gz) to the./data/RHBSTW-00316directory.
Note that the reference genome file is already present in this github repository; however, the FASTQ test data must be downloaded as it is too large to store here.
- Download and save the FASTQ file as:
-
Run pseuPIRA.py on the Example Data:
Open a terminal in the project directory and execute:cd src/ python pseuPIRA.py -o ../results/RHBSTW-00316 -r ../data/RHBSTW-00316/GCF_013742375.1_ASM1374237v1_genomic.gbff.gz ../data/RHBSTW-00316/SRR11948691.fastq.gz