pseuPIRA

pseuPIRA.py is a pipeline for estimating plasmid copy number (PCN) from microbial genome sequencing data. It combines themisto for pseudoalignment and minimap2 for multiread mapping. This project uses uv to manage dependencies and packaging.

We thank Jia Lu and Zhengqing Zhou for extensive testing on the MIT computing cluster and the Duke Compute Cluster (DCC). Testing on DCC and MacOS done by Rohan Maddamsetti and Irida Shyti.

Overview

pseuPIRA.py processes a gzipped GenBank reference file to extract replicon sequences (e.g., chromosomes and plasmids) and then:

Extracts Replicon Sequences:
- Reads a GenBank file (*.gbff.gz) and creates individual FASTA files for each replicon with custom headers containing metadata.
Builds an Index & Pseudoaligns Reads:
- Uses themisto to build an index from the replicon FASTA files.
- Pseudoaligns sequencing reads (FASTQ format) to count how many reads map to each replicon.
Estimates Plasmid Copy Numbers (Naïve):
- Computes initial PCN estimates as follows. First, we normalize read counts by replicon lengths to estimate sequencing read coverage depth for each replicon (i.e., plasmids and chromosomes). Then, we divide plasmid read coverage depth by chromosome read coverage depth to estimate plasmid copy numbers (relative to chromosome copy number).
Refines PCN Estimates (PIRA):
- Filters reads that map to multiple replicons.
- Realigns ambiguous reads using minimap2.
- Applies a Probabilistic Iterative Read Assignment (PIRA) algorithm to update and refine PCN estimates.

Requirements

External Tools:
- themisto – must be installed and available in your $PATH
- minimap2 – must be installed and available in your $PATH
  
  For convenience, MacOS and Linux binaries for themisto and a Linux binary for minimap2 are provided in the bin/ directory. You may have to turn these binaries into executables like so:
```
chmod +x ${PWD}/bin/themisto_linux-v3.2.2/themisto  ##make the linux themisto binary into an executable
```
  Then, you can add these binaries to the $PATH from the command-line as follows, before running pseuPIRA.py:
```
export PATH="${PWD}/bin/themisto_linux-v3.2.2:$PATH"
```
  Alternatively, you can install themisto and minimap2 from github using the links above. Note that we have had difficulty compiling themisto from source. The v3.2.2 release works for us on MacOS and Linux.
Python Libraries:
- Python==3.12.9
- Biopython==1.85
- Polars==1.24.0 IMPORTANT: there is a known bug where Polars 1.30.0 breaks PIRA code. Please install Polars 1.24.0.
- HTSeq==2.0.9 Current version may trigger SyntaxWarnings in Python, but does not cause any errors. See: htseq/htseq#105)
- NumPy==2.2.3
- argparse (standard library)

Dependency Management with uv

This project uses uv for dependency management and packaging, similar to Poetry. The configuration is defined in the pyproject.toml file.

Getting Started with uv

Install uv (https://docs.astral.sh/uv/getting-started/installation/):
```
pip install uv
```
If this command does not work, see the official uv installation directions here: https://docs.astral.sh/uv/getting-started/installation/
Install Project Dependences: In the project directory, run:
```
uv sync
```

This will install all the required dependencies as specified in the pyproject.toml.

Activate the Virtual Environment: uv automatically creates a virtual environment (in .venv/) during installation. To use it:
```
source .venv/bin/activate  ## Linux/MacOS
```
Note: If you use uv run for executing scripts, it will automatically handle the environment. For manual execution (python file.py), activation ensures you're using the correct dependencies.

Usage

Run the pipeline with:

python src/pseuPIRA.py -r <reference.gbff.gz> <reads1.fastq> [<reads2.fastq.gz> ...]

Command-Line Options

-r, --reference
(Required) Path to the gzipped GenBank file (*.gbff.gz) containing the reference genome.
-j, --num-processors
(Optional) Specifies the number of threads/processors to use.
Note: This option is currently not functional; the code is hardcoded to use 4 threads.
-o, --output
(Optional) Path to the output directory.
Default: ../results/test-run
-q, --quick
(Optional) Quick mode. Runs the pseudoalignment and naive PCN estimation steps but skips the PIRA refinement.
reads
(Positional) One or more FASTQ read files (supporting gzipped or plain FASTQ).

Example Data

Example data for testing pseuPIRA can be found in the ./data/RHBSTW-00316 directory. This dataset originates from NCBI BioProject PRJNA605147.

Test Data Details

Reference Genome:
- RefSeq ID: GCF_013742375.1
- Assembly Level: Complete genome
- Replicon Composition: 1 chromosome and 9 plasmids
- BioSample: SAMN15148572
- Strain: RHBSTW-00316
- Species: Enterobacter hormaechei
Reference Genome Files:
- NCBI Genome Datasets:
  View Dataset
- FTP Link:
  Download from FTP
Sequencing Data:
- Illumina Data Accession: SRX8493146
- Required FASTQ File:
  To test the pipeline, download the file SRR11948691.fastq.gz from the NCBI Trace Archive for SRR11948691.

Setup Instructions

Download the Data:
- Download and save the FASTQ file as:
  ./data/RHBSTW-00316/SRR11948691.fastq.gz
- Save the reference genome file (e.g., GCF_013742375.1_ASM1374237v1_genomic.gbff.gz) to the ./data/RHBSTW-00316 directory.
Note that the reference genome file is already present in this github repository; however, the FASTQ test data must be downloaded as it is too large to store here.

Run pseuPIRA.py on the Example Data:
Open a terminal in the project directory and execute:

cd src/  
python pseuPIRA.py -o ../results/RHBSTW-00316 -r ../data/RHBSTW-00316/GCF_013742375.1_ASM1374237v1_genomic.gbff.gz ../data/RHBSTW-00316/SRR11948691.fastq.gz

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
bin		bin
data/RHBSTW-00316		data/RHBSTW-00316
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pseuPIRA

Overview

Requirements

Dependency Management with uv

Getting Started with uv

Usage

Command-Line Options

Example Data

Test Data Details

Setup Instructions

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pseuPIRA

Overview

Requirements

Dependency Management with uv

Getting Started with uv

Usage

Command-Line Options

Example Data

Test Data Details

Setup Instructions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages