Skip to content

rohanmaddamsetti/pseuPIRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pseuPIRA

pseuPIRA.py is a pipeline for estimating plasmid copy number (PCN) from microbial genome sequencing data. It combines themisto for pseudoalignment and minimap2 for multiread mapping. This project uses uv to manage dependencies and packaging.

We thank Jia Lu and Zhengqing Zhou for extensive testing on the MIT computing cluster and the Duke Compute Cluster (DCC). Testing on DCC and MacOS done by Rohan Maddamsetti and Irida Shyti.


Overview

pseuPIRA.py processes a gzipped GenBank reference file to extract replicon sequences (e.g., chromosomes and plasmids) and then:

  1. Extracts Replicon Sequences:

    • Reads a GenBank file (*.gbff.gz) and creates individual FASTA files for each replicon with custom headers containing metadata.
  2. Builds an Index & Pseudoaligns Reads:

    • Uses themisto to build an index from the replicon FASTA files.
    • Pseudoaligns sequencing reads (FASTQ format) to count how many reads map to each replicon.
  3. Estimates Plasmid Copy Numbers (Naïve):

    • Computes initial PCN estimates as follows. First, we normalize read counts by replicon lengths to estimate sequencing read coverage depth for each replicon (i.e., plasmids and chromosomes). Then, we divide plasmid read coverage depth by chromosome read coverage depth to estimate plasmid copy numbers (relative to chromosome copy number).
  4. Refines PCN Estimates (PIRA):

    • Filters reads that map to multiple replicons.
    • Realigns ambiguous reads using minimap2.
    • Applies a Probabilistic Iterative Read Assignment (PIRA) algorithm to update and refine PCN estimates.

Requirements

  • External Tools:

    • themisto – must be installed and available in your $PATH

    • minimap2 – must be installed and available in your $PATH

      For convenience, MacOS and Linux binaries for themisto and a Linux binary for minimap2 are provided in the bin/ directory. You may have to turn these binaries into executables like so:

      chmod +x ${PWD}/bin/themisto_linux-v3.2.2/themisto  ##make the linux themisto binary into an executable

      Then, you can add these binaries to the $PATH from the command-line as follows, before running pseuPIRA.py:

      export PATH="${PWD}/bin/themisto_linux-v3.2.2:$PATH"

      Alternatively, you can install themisto and minimap2 from github using the links above. Note that we have had difficulty compiling themisto from source. The v3.2.2 release works for us on MacOS and Linux.

  • Python Libraries:

    • Python==3.12.9
    • Biopython==1.85
    • Polars==1.24.0 IMPORTANT: there is a known bug where Polars 1.30.0 breaks PIRA code. Please install Polars 1.24.0.
    • HTSeq==2.0.9 Current version may trigger SyntaxWarnings in Python, but does not cause any errors. See: htseq/htseq#105)
    • NumPy==2.2.3
    • argparse (standard library)

Dependency Management with uv

This project uses uv for dependency management and packaging, similar to Poetry. The configuration is defined in the pyproject.toml file.

Getting Started with uv

  1. Install uv (https://docs.astral.sh/uv/getting-started/installation/):

    pip install uv
    

    If this command does not work, see the official uv installation directions here: https://docs.astral.sh/uv/getting-started/installation/

  2. Install Project Dependences: In the project directory, run:

    uv sync
    

This will install all the required dependencies as specified in the pyproject.toml.

  1. Activate the Virtual Environment: uv automatically creates a virtual environment (in .venv/) during installation. To use it:

    source .venv/bin/activate  ## Linux/MacOS
    

    Note: If you use uv run for executing scripts, it will automatically handle the environment. For manual execution (python file.py), activation ensures you're using the correct dependencies.

Usage

Run the pipeline with:

python src/pseuPIRA.py -r <reference.gbff.gz> <reads1.fastq> [<reads2.fastq.gz> ...]

Command-Line Options

  • -r, --reference
    (Required) Path to the gzipped GenBank file (*.gbff.gz) containing the reference genome.

  • -j, --num-processors
    (Optional) Specifies the number of threads/processors to use.
    Note: This option is currently not functional; the code is hardcoded to use 4 threads.

  • -o, --output
    (Optional) Path to the output directory.
    Default: ../results/test-run

  • -q, --quick
    (Optional) Quick mode. Runs the pseudoalignment and naive PCN estimation steps but skips the PIRA refinement.

  • reads
    (Positional) One or more FASTQ read files (supporting gzipped or plain FASTQ).


Example Data

Example data for testing pseuPIRA can be found in the ./data/RHBSTW-00316 directory. This dataset originates from NCBI BioProject PRJNA605147.

Test Data Details

  • Reference Genome:

    • RefSeq ID: GCF_013742375.1
    • Assembly Level: Complete genome
    • Replicon Composition: 1 chromosome and 9 plasmids
    • BioSample: SAMN15148572
    • Strain: RHBSTW-00316
    • Species: Enterobacter hormaechei
  • Reference Genome Files:

  • Sequencing Data:

Setup Instructions

  1. Download the Data:

    • Download and save the FASTQ file as:
      ./data/RHBSTW-00316/SRR11948691.fastq.gz
    • Save the reference genome file (e.g., GCF_013742375.1_ASM1374237v1_genomic.gbff.gz) to the ./data/RHBSTW-00316 directory.

    Note that the reference genome file is already present in this github repository; however, the FASTQ test data must be downloaded as it is too large to store here.

  2. Run pseuPIRA.py on the Example Data:
    Open a terminal in the project directory and execute:

    cd src/  
    python pseuPIRA.py -o ../results/RHBSTW-00316 -r ../data/RHBSTW-00316/GCF_013742375.1_ASM1374237v1_genomic.gbff.gz ../data/RHBSTW-00316/SRR11948691.fastq.gz

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors