This pipeline uses existing tools to detect high quality phage sequences in large datasets, starting with a run of VirSorter2 (VS2). As recommended by VS2's GitHub page, this pipeline follows the Sullivan Lab protocole, which consists in running checkV on the VS2's outputs, and then a second run on VS2 of checkV-trimmed sequences. Finally, this pipeline calculate quality scores based on the recommendations of the Sullivan Lab protocole and scores based on the recommendations of the benchmark published in Hegarty et al. (2024).
Please cite our paper "Genetic exchange networks bridge mobile DNA vehicles in the bacterial pathogen Listeria monocytogenes" by Muller et al. (2025)
In addition, please cite the tools and benchmarks we used to develop this pipeline:
- VirSorter2
- checkV
- the Sullivan Lab protocole
- Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods by Hegarty et al. (2024)
-
R 4.3+ with the following packages:
- data.table 1.16.2
- stringi 1.8.4
- dplyr 1.1.4
(If not found, these packages are installed automatically by the pipeline.)
-
VirSorter2 2.2.4
-
checkV 1.0.3
The pipeline was not tested with other versions of the above programs, but other versions probably work.
In a bash-compatible terminal that can execute git, paste
git clone https://github.com/HeloiseMuller/phageAnnotation.git
cd phageAnnotation/
This first step has been written to run in array on a slurm server.
Firstly, create a dataset composed of one fasta file per genome assembly. Each fasta file names must be of the form PATTERN_XXX_genomic.fna, where XXX can be anything.
Then, prepare the file dataset.lst in which each line contains the PATTERN of the genome. For example:
SRR14908170
SRR14300122
SRR14044052
where the first fasta file name would be SRR14908170_XXX_genomic.fna.
Modify the following lines in phageAnnotation_array.sh to activate the conda environments in which VS2 and checkV are located on your server:
source /network/rit/lab/andamlab/bin/miniconda3/etc/profile.d/conda.sh
conda activate virsorter2
conda activate checkV
If VirSorter2 and checkV are not installed in a conda environment, the above lines should be deleted.
This script simply runs phageAnnotation_array.sh, but in an array, allowing to easily handle hundreds/thousands of genomes:
#!/bin/bash
#SBATCH -n 1
#SBATCH --partition=batch
#SBATCH --cpus-per-task=18
#SBATCH --mem=64G
#SBATCH -o /network/rit/lab/andamlab/Heloise/project_Listeria/viralAnnotation/phageAnnotation-slurm-%j.out
#SBATCH -D /network/rit/lab/andamlab/Heloise/project_Listeria/ # working directory
#SBATCH -v
#SBATCH --array=3-458%4
bash phageAnnotation_array.sh $SLURM_ARRAY_TASK_ID genomesPath outputsPath dataset.lst pathDB
In the above header, set the appropriate amount of ressources to ask for, and the appropriate paths.
Also set the appropriate --array; it determines from which genome to which genome of the file dataset.lst the pipeline will run for. %4 means that the pipeline will not run more than four genomes at once.
Tip: try it on just two genomes first.
Finally, replace the arguments of phageAnnotation_array.sh accordingly to your own paths/project:
bash phageAnnotation_array.sh $SLURM_ARRAY_TASK_ID genomesPath outputsPath dataset.lst pathDB
where
genomesPath is the absolute path where all your genomes (of the form PATTERN_XXX_genomic.fna) are located,
outputsPath is the absolute path where all the outputs should be saved at,
dataset.lst is the file you prepared earlier,
pathDB is the absolute path where the checkV's database is located (e.g. /network/rit/lab/andamlab/Databases/checkv-db-v1.5/).
sbatch phageAnnotation_main.sh
In the output path you provided, the pipeline will create a directory for each genome, each of which will contain the following directories:
vs2-pass1/all the outputs generated by the first run of VS2.
The present pipeline add two files, ending with_named.faandnamed.tsv, in order to help harmonizing all the outputs.checkv/all the outputs generated by checkV.
The present pipeline add the filePATTERN_combined.fnaon which the second run of VS2 will run.vs2-pass2/all the outputs generated by the second run of VS2
Once the first step is done for all genomes of the project, go in outputsPath and combine their outputs as follow:
cat */vs2-pass1/*_final-viral-boundary_named.tsv > allOutputs_vs2-pass1_final-viral-boundary.tsv
cat */checkv/contamination_named.tsv > allOutputs_contamination_named.tsv
cat */vs2-pass2/*_final-viral-score.tsv > allOutputs_vs2-pass2_final-viral-score.tsv
cat */vs2-pass2/*_final-viral-combined.fa > allOutputs_vs2-pass2_final-viral-combined.fa
All these combined files can be copied on a personal machine at this step.
This step has not been tested in a slurm server; it is recommended to run it on a personal machine.
To see all options run:
Rscript phageAnnotation.R -h
Options:
-c CHARACTER, --checkV=CHARACTER
output generated by checkV
-v CHARACTER, --virsorter=CHARACTER
output generated by the second run of Virsorter2
-w CHARACTER, --wirsorter=CHARACTER
optional, output generated by the first run of Virsorter2
-l NUMERIC, --len=NUMERIC
filter out phage sequences shorter than this threshold [default = 5000]
-f NUMERIC, --filtering=NUMERIC
optional, whether should apply tuning removal filters [default = 0 to not apply; 1 applies them]
-o CHARACTER, --out=CHARACTER
optional, path where to save the outputs. If none is providing, outputs will be save at the same location as checkV file
-h, --help
Show this help message and exit
Example of command line to run this step:
Rscript phageAnnotation.R -c -allOutputs_contamination_named.tsv -v allOutputs_vs2-pass2_final-viral-score.tsv -f 1 -l 5000 -w allOutputs_vs2-pass1_final-viral-boundary.tsv
In this example, phage sequenced under 5000 bp (-l 5000) and fulfilling at least one the three tuning removal rules (-f 1) the will be filtered out. As the optional argument -w is provided, the pipeline will be able to recover the coordinates of the phage sequences in the genome.
This script will create a folder filtering/ where all the outputs will be saved:
phages_quality.tsvcontains one line per phage sequence passing the second run of VS2. Important columns would be:seqname: final name of the phage sequences, as found inallOutputs_vs2-pass2_final-viral-combined.falength: final length of the phage sequencesviral_genes_checkV: number of viral genes identified by checkVhallmark: number of hallmark gene recovered by the second round of VS2keep_Sullivan: classification of the phage sequence based on the Sullivan Lab protocoleHegarty_removal_all: sum of all the tuning removal scores. Any sequence with a score<0 are sequences for which at least one of the three tuning removal rules established by Hegarty et al was fulfilled, i.e. low confidence phage sequences.start_inContigandend_inContig: coordinates of the final phage sequence in the genome. These two columns can only be generated if the output of the first run of VS2 is provided (option-w)
phages_filtered_XXX.tsvis the exact same file as above, except that only phage sequences passing filters are present.XXXvaries depending on the filters chosen.phages_filtered_XXX.lstis a list of all seqname of phage sequences passing filters.
seqtk subseq allOutputs_vs2-pass2_final-viral-combined.fa phages_filtered_XXX.lst > phages_filtered_XXX.fasta