metagenomics • virology • bioinformatics • nextflow • virus-discovery • metagenomic-assembly • taxonomic-classification • viral-analysis • coverage-analysis • checkv • virfinder • kraken2 • blast • docker • singularity • hpc • slurm
- Overview
- Pipeline Schema
- Quick Start
- Features
- Key Improvements
- Prerequisites
- Installation
- Running the Pipeline
- Resource Management on SLURM
- Pipeline Steps
- Output Structure
- Key Output Files
- Configuration
- Docker & Singularity Support
- Citations
- Support
- License
- Authors
MetaNextViro is a robust, modular Nextflow pipeline designed primarily for virus identification and characterization from metagenomic sequencing data. While it also supports bacterial profiling, its main focus is on the detection, classification, and annotation of viral sequences in complex samples such as environmental, clinical, or animal/human microbiome datasets.
The pipeline integrates state-of-the-art tools for:
- Quality control and preprocessing of raw reads (FastQC, fastp, flexbar, trim_galore)
- Assembly of metagenomic data (MEGAHIT, metaSPAdes, or hybrid)
- Taxonomic classification with Kraken2 and visualization with Krona
- Viral genome completion and quality assessment (CheckV)
- Viral sequence identification (VirFinder with custom filtering)
- BLAST-based annotation for both viral and bacterial contigs
- Automated organization of contigs by taxonomy and family
- Contig-level coverage analysis and visualization for assembled contigs
- Comprehensive reporting with coverage plots and an interactive HTML summary
MetaNextViro is suitable for:
- Discovery of known and novel viruses in metagenomic samples
- Viral diversity and abundance profiling
- Viral genome recovery and annotation
- Comparative virome analysis across samples or conditions
- Integrated viral and bacterial community profiling (optional)
The pipeline is highly portable and reproducible, supporting Conda, Docker, and Singularity environments, and can be run on local workstations, HPC clusters (SLURM), or in the cloud. It is ideal for virome research, outbreak investigations, environmental surveillance, and any project requiring robust viral metagenomics.
graph TD
subgraph "Input"
A[Input Samplesheet]
end
subgraph "1. Pre-processing"
A --> B(Parse Input);
B --> C(FASTQC);
B --> D{Trim Reads};
end
subgraph "2. Core Analysis"
D --> E[Assembly: MEGAHIT / metaSPAdes / Hybrid];
D --> F[Taxonomic Profiling: Kraken2];
end
subgraph "3. Downstream Analysis"
F --> G[Krona Report];
E --> H[QUAST Report];
E --> I[Viral Analysis: CheckV & VirFinder];
E --> J[BLAST Annotation: blastn / blastx];
E --> K[Coverage Analysis: Bowtie2];
end
subgraph "4. Organization & Final Report"
J --> L[Organize Contigs by Taxonomy];
K --> M[Coverage Plots];
G & H & I & J & L & M --> N(Final HTML Report);
end
style A fill:#fff,stroke:#333,stroke-width:2px
style N fill:#fff,stroke:#333,stroke-width:2px
- Nextflow (>=21.10.3)
- Java (>=8)
- Conda, Docker, or Singularity
# Clone the repository
git clone https://github.com/navduhan/metanextviro.git
cd metanextviro
# Run with conda (recommended for first-time users)
nextflow run main.nf \
--input <your_samplesheet>.csv \
--outdir <output_directory> \
--kraken2_db <path_to_your_kraken2_db> \
--checkv_db <path_to_your_checkv_db> \
-profile conda
# Run with singularity (recommended for HPC)
nextflow run main.nf \
--input <your_samplesheet>.csv \
--outdir <output_directory> \
--kraken2_db <path_to_your_kraken2_db> \
--checkv_db <path_to_your_checkv_db> \
-profile singularity,slurmCreate a samplesheet.csv file:
sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz- Quality control and adapter/quality trimming (FastQC, fastp, flexbar, trim_galore)
- Multiple assembly options (MEGAHIT, metaSPAdes, or hybrid)
- BLAST-based and Kraken2-based taxonomic annotation
- Flexible BLASTX tool choice (DIAMOND or traditional BLASTX)
- Automated organization of contigs by taxonomy and family
- Viral genome completion (CheckV) and classification (VirFinder with custom filtering)
- Contig-level coverage analysis and visualization
- Modern, Template-Based Final Report with a clean and interactive design.
- Structured, per-sample output organization
- Custom R script (
run_virfinder.R) for improved control over filtering criteria - Dual output format: Full results and high-confidence filtered results (score ≥ 0.9, p-value ≤ 0.05)
- Contig-level coverage calculation instead of nucleotide-level depth
- Custom bash script (
calculate_contig_coverage.sh) for efficient processing
- Template-Based: Report is generated using a Nextflow template, making it robust and easy to modify.
- Clean Design: Features a modern, card-based layout for easy navigation.
- Guaranteed Completion: Generated only after all pipeline steps are finished.
- Nextflow (>=21.10.3)
- Java (>=8)
- Python (>=3.8)
- Conda (recommended)
- Docker or Singularity (optional, for containerized execution)
All dependencies can be installed using the provided environment.yml file or automatically with the conda profile.
- FastQC, fastp, flexbar, trim-galore
- MEGAHIT, SPAdes, BLAST+, DIAMOND, Kraken2, QUAST
- Bowtie2, Samtools, Bedtools
- CheckV, VirFinder (R)
- Python: biopython, pandas, matplotlib, seaborn, ete3
- R: r-base, r-virfinder
-
Clone the repository:
git clone https://github.com/navduhan/metanextviro.git cd metanextviro -
Create and activate the conda environment (platform-specific):
| Platform | Command to Use |
|---|---|
| Linux | conda env create -f environment.yml |
| Intel/AMD Mac | conda env create -f environment.yml |
| Apple Silicon (M1/M2) | CONDA_SUBDIR=osx-64 conda env create -f environment.yml |
-
Linux and Intel/AMD Macs:
conda env create -f environment.yml conda activate metanextviro
-
Apple Silicon (M1/M2) Macs:
CONDA_SUBDIR=osx-64 conda env create -f environment.yml conda activate metanextviro
This tells conda to install Intel-compatible binaries, which work via Rosetta 2 on Apple Silicon.
Or let Nextflow manage it automatically with -profile conda.
Nextflow can automatically manage all dependencies using the conda profile:
nextflow run main.nf --input <your_samplesheet>.csv --outdir <output_dir> -profile condaFor maximum reproducibility, especially on an HPC, it is recommended to use Docker or Singularity.
-
Build the Docker image (optional):
docker build -t metanextviro:latest . -
Run with Singularity and SLURM:
nextflow run main.nf \ --input <your_samplesheet>.csv \ --outdir <output_dir> \ --kraken2_db <path_to_kraken2_db> \ --checkv_db <path_to_checkv_db> \ -profile singularity,slurm
Nextflow will automatically pull the Docker image and convert it to a Singularity image.
All parameters can be set on the command line or in nextflow.config.
| Parameter | Description | Default |
|---|---|---|
| --input | Path to input samplesheet (CSV) | (required) |
| --outdir | Output directory | ./results |
| --adapters | Path to adapters file (for trimming) | (provided) |
| --trimming_tool | Trimming tool: fastp, flexbar, trim_galore |
trim_galore |
| --assembler | Assembler: megahit, metaspades, hybrid |
hybrid |
| --kraken2_db | Path to Kraken2 database | (required) |
| --blastdb_viruses | Path to BLAST viruses database | (optional) |
| --blastdb_nt | Path to BLAST nt database | (optional) |
| --blastdb_nr | Path to BLAST nr database | (optional) |
| --diamonddb | Path to DIAMOND protein database | (optional) |
| --blastx_tool | BLASTX tool: diamond, blastx (DIAMOND is much faster) |
diamond |
| --checkv_db | Path to CheckV database (for viral genome completion) | (required) |
| --min_contig_length | Minimum contig length for assembly | 200 |
| --quality | Quality threshold for trimming | 30 |
| --profile | Nextflow profile (e.g., local, slurm, conda, docker) |
slurm |
| --help | Show help message and exit |
The pipeline is configured to be efficient on HPC systems using a SLURM executor. Instead of requesting large resources for every job, it uses a label system to assign resources based on the task's requirements.
The defined labels in nextflow/configs/slurm.config are:
| Label | CPUs | Memory | Default Time | Use Case |
|---|---|---|---|---|
low |
10 | 50 GB | 8h | QC, trimming, reporting, etc. |
medium |
25 | 100 GB | 12h | Lighter alignments, viral analysis. |
high |
40 | 200 GB | 24h | Heavy assembly, large database BLAST. |
vhigh |
40 | 250 GB | 48h | Very memory-intensive assembly (metaSPAdes). |
Any process without a label receives a minimal default of 2 CPUs, 8 GB of RAM, and a 4-hour time limit to prevent resource waste.
- Input Parsing
- Validates input files and sample sheet
- Preprocessing
- FastQC on raw reads
- Adapter/quality trimming (fastp, flexbar, or trim_galore)
- FastQC on trimmed reads
- Taxonomic Profiling
- Kraken2 classification
- Krona visualization
- Assembly
- MEGAHIT, metaSPAdes, or hybrid assembly
- QUAST quality assessment
- BLAST Annotation
- Taxonomic annotation of contigs against multiple databases (NT, NR, viruses)
- Viral Analysis
- CheckV genome completion
- VirFinder classification with custom filtering
- Contig Organization
- Organize contigs by taxonomy and family using NT database results
- Coverage Analysis
- Contig-level coverage calculation and statistics
- Enhanced coverage plots with intelligent x-axis labeling
- Final Report Generation
- A comprehensive and modern HTML summary report is generated using a Nextflow template.
results/
├── fastp/ # Trimmed reads and fastp reports
├── fastqc/ # Raw and trimmed read QC reports
├── assembly/ # Assembly results
├── assembly_stats/ # Assembly quality assessment (QUAST)
├── blast_results/ # BLAST annotation results
├── kraken2_results/ # Kraken2 classification results
├── krona_results/ # Krona HTML visualizations
├── organized_contigs/ # Organized contigs by taxonomy
├── checkv/ # CheckV viral genome completion
├── virfinder/ # VirFinder results (full and filtered)
├── coverage/ # BAM files and contig-level coverage stats
├── coverage_plots/ # Coverage plots (PNG)
├── final_report/ # Comprehensive HTML report (final step)
└── ... # Other outputs as configured
final_report.html: A comprehensive and modern HTML summary report. It includes:- A clean, card-based layout for easy navigation.
- Direct links to key results from all major pipeline steps.
- A summary of the run and a timestamp.
virfinder_full_*.txt: Complete VirFinder results for all contigsvirfinder_filtered_*.txt: High-confidence viral contigs (score ≥ 0.9, p-value ≤ 0.05)
coverage_*.txt: Contig-level coverage statistics.coverage_plot_*.png: Bar plots showing coverage distribution across contigs.
nextflow.config: Main configuration file.environment.yml: Conda environment for all dependencies.Dockerfile: (Optional) Build your own container for full reproducibility.nextflow/bin/: Custom scripts and templates for enhanced functionality.
- Docker: Use the provided Dockerfile to build an image for full reproducibility.
- Singularity: On HPC, use the
singularityprofile. Nextflow will automatically convert the Docker image on the fly.
nextflow run main.nf --input <samplesheet> --outdir <results> -profile slurm,singularityIf you use this pipeline, please cite:
- MetaNextViro Pipeline: https://github.com/navduhan/metanextviro
- Nextflow: Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. https://doi.org/10.1038/nbt.3820
- FastQC: Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- MEGAHIT: Li, D., Liu, C. M., Luo, R., Sadakane, K., & Lam, T. W. (2015). MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10), 1674-1676. https://doi.org/10.1093/bioinformatics/btv033
- SPAdes: Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5), 455-477. https://doi.org/10.1089/cmb.2012.0021
- Kraken2: Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 1-13. https://doi.org/10.1186/s13059-019-1891-0
- CheckV: Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature Biotechnology, 39(5), 578-585. https://doi.org/10.1038/s41587-020-00774-7
- VirFinder: Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A., & Sun, F. (2017). VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5(1), 1-20. https://doi.org/10.1186/s40168-017-0283-5
- BLAST: Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
- Bowtie2: Langmead, B., & Salzberg, S. L. (2022). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357-359. https://doi.org/10.1038/nmeth.1923
- Samtools: Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078-2079. https://doi.org/10.1093/bioinformatics/btp352
- Krona: Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12(1), 1-10. https://doi.org/10.1186/1471-2105-12-385
For issues, questions, or suggestions:
- Create an issue on GitHub
- Contact: naveen.duhan@outlook.com
This project is licensed under the MIT License - see the LICENSE file for details.
For more information about the MIT License, visit: https://opensource.org/licenses/MIT
- Naveen Duhan
- [Other contributors]
