Skip to content

HeloiseMuller/phageAnnotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phage Annotation

Introduction

This pipeline uses existing tools to detect high quality phage sequences in large datasets, starting with a run of VirSorter2 (VS2). As recommended by VS2's GitHub page, this pipeline follows the Sullivan Lab protocole, which consists in running checkV on the VS2's outputs, and then a second run on VS2 of checkV-trimmed sequences. Finally, this pipeline calculate quality scores based on the recommendations of the Sullivan Lab protocole and scores based on the recommendations of the benchmark published in Hegarty et al. (2024).

Citation

Please cite our paper "Genetic exchange networks bridge mobile DNA vehicles in the bacterial pathogen Listeria monocytogenes" by Muller et al. (2025)

In addition, please cite the tools and benchmarks we used to develop this pipeline:

Requirements

The pipeline was not tested with other versions of the above programs, but other versions probably work.

Installation

In a bash-compatible terminal that can execute git, paste

git clone https://github.com/HeloiseMuller/phageAnnotation.git
cd phageAnnotation/

STEP 1. Run VS2 + checkV + VS2

This first step has been written to run in array on a slurm server.

Get your dataset ready

Firstly, create a dataset composed of one fasta file per genome assembly. Each fasta file names must be of the form PATTERN_XXX_genomic.fna, where XXX can be anything.
Then, prepare the file dataset.lst in which each line contains the PATTERN of the genome. For example:

SRR14908170
SRR14300122
SRR14044052

where the first fasta file name would be SRR14908170_XXX_genomic.fna.

Get phageAnnotation_array.sh ready

Modify the following lines in phageAnnotation_array.sh to activate the conda environments in which VS2 and checkV are located on your server:

source /network/rit/lab/andamlab/bin/miniconda3/etc/profile.d/conda.sh 
conda activate virsorter2
conda activate checkV

If VirSorter2 and checkV are not installed in a conda environment, the above lines should be deleted.

Get phageAnnotation_main.sh ready

This script simply runs phageAnnotation_array.sh, but in an array, allowing to easily handle hundreds/thousands of genomes:

#!/bin/bash
#SBATCH -n 1
#SBATCH --partition=batch
#SBATCH --cpus-per-task=18
#SBATCH --mem=64G
#SBATCH -o /network/rit/lab/andamlab/Heloise/project_Listeria/viralAnnotation/phageAnnotation-slurm-%j.out
#SBATCH -D /network/rit/lab/andamlab/Heloise/project_Listeria/  # working directory
#SBATCH -v

#SBATCH --array=3-458%4

bash phageAnnotation_array.sh $SLURM_ARRAY_TASK_ID genomesPath outputsPath dataset.lst pathDB

In the above header, set the appropriate amount of ressources to ask for, and the appropriate paths.
Also set the appropriate --array; it determines from which genome to which genome of the file dataset.lst the pipeline will run for. %4 means that the pipeline will not run more than four genomes at once.
Tip: try it on just two genomes first.
Finally, replace the arguments of phageAnnotation_array.sh accordingly to your own paths/project:

bash phageAnnotation_array.sh $SLURM_ARRAY_TASK_ID genomesPath outputsPath dataset.lst pathDB

where
genomesPath is the absolute path where all your genomes (of the form PATTERN_XXX_genomic.fna) are located,
outputsPath is the absolute path where all the outputs should be saved at,
dataset.lst is the file you prepared earlier,
pathDB is the absolute path where the checkV's database is located (e.g. /network/rit/lab/andamlab/Databases/checkv-db-v1.5/).

Run step 1

sbatch phageAnnotation_main.sh

Output description of step 1

In the output path you provided, the pipeline will create a directory for each genome, each of which will contain the following directories:

  • vs2-pass1/ all the outputs generated by the first run of VS2.
    The present pipeline add two files, ending with _named.fa and named.tsv, in order to help harmonizing all the outputs.
  • checkv/ all the outputs generated by checkV.
    The present pipeline add the file PATTERN_combined.fna on which the second run of VS2 will run.
  • vs2-pass2/ all the outputs generated by the second run of VS2

STEP 2. Combine outputs

Once the first step is done for all genomes of the project, go in outputsPath and combine their outputs as follow:

cat */vs2-pass1/*_final-viral-boundary_named.tsv > allOutputs_vs2-pass1_final-viral-boundary.tsv
cat */checkv/contamination_named.tsv > allOutputs_contamination_named.tsv
cat */vs2-pass2/*_final-viral-score.tsv > allOutputs_vs2-pass2_final-viral-score.tsv
cat */vs2-pass2/*_final-viral-combined.fa > allOutputs_vs2-pass2_final-viral-combined.fa

All these combined files can be copied on a personal machine at this step.

STEP 3. Run the R script that harmonizes these outputs

This step has not been tested in a slurm server; it is recommended to run it on a personal machine.

To see all options run:

Rscript phageAnnotation.R -h

Options:
	-c CHARACTER, --checkV=CHARACTER
		output generated by checkV

	-v CHARACTER, --virsorter=CHARACTER
		output generated by the second run of Virsorter2

	-w CHARACTER, --wirsorter=CHARACTER
		optional, output generated by the first run of Virsorter2

	-l NUMERIC, --len=NUMERIC
		filter out phage sequences shorter than this threshold [default = 5000]

	-f NUMERIC, --filtering=NUMERIC
		optional, whether should apply tuning removal filters [default = 0 to not apply; 1 applies them]

	-o CHARACTER, --out=CHARACTER
		optional, path where to save the outputs. If none is providing, outputs will be save at the same location as checkV file

	-h, --help
		Show this help message and exit

Example of command line to run this step:

Rscript phageAnnotation.R -c -allOutputs_contamination_named.tsv -v allOutputs_vs2-pass2_final-viral-score.tsv -f 1 -l 5000 -w allOutputs_vs2-pass1_final-viral-boundary.tsv

In this example, phage sequenced under 5000 bp (-l 5000) and fulfilling at least one the three tuning removal rules (-f 1) the will be filtered out. As the optional argument -w is provided, the pipeline will be able to recover the coordinates of the phage sequences in the genome.

Output description of step 3

This script will create a folder filtering/ where all the outputs will be saved:

  • phages_quality.tsv contains one line per phage sequence passing the second run of VS2. Important columns would be:
    • seqname: final name of the phage sequences, as found in allOutputs_vs2-pass2_final-viral-combined.fa
    • length: final length of the phage sequences
    • viral_genes_checkV: number of viral genes identified by checkV
    • hallmark: number of hallmark gene recovered by the second round of VS2
    • keep_Sullivan: classification of the phage sequence based on the Sullivan Lab protocole
    • Hegarty_removal_all: sum of all the tuning removal scores. Any sequence with a score<0 are sequences for which at least one of the three tuning removal rules established by Hegarty et al was fulfilled, i.e. low confidence phage sequences.
    • start_inContig and end_inContig: coordinates of the final phage sequence in the genome. These two columns can only be generated if the output of the first run of VS2 is provided (option -w)
  • phages_filtered_XXX.tsv is the exact same file as above, except that only phage sequences passing filters are present. XXX varies depending on the filters chosen.
  • phages_filtered_XXX.lst is a list of all seqname of phage sequences passing filters.

STEP 4 (optional). Remove filtered phages from the phage fasta file

seqtk subseq allOutputs_vs2-pass2_final-viral-combined.fa  phages_filtered_XXX.lst > phages_filtered_XXX.fasta

About

Pipeline to run existing tools to annotate phage, harmonize the outputs and filter phage sequences

Resources

License

Stars

Watchers

Forks

Packages

No packages published