A lightweight and modular NGS pipeline from FASTQ → BAM → VCF, designed for small projects and rapid iteration.
Ready for AWS Batch.
- Snakemake version (ideal for development, rule-based execution locally)
- Nextflow version (designed for cloud execution and scalability)
This pipeline automates the following steps:
- Trimming (TrimGalore)
- Quality control (FastQC + MultiQC)
- Read alignment (BWA MEM)
- Sorting and indexing (Samtools)
- Variant calling (GATK Mutect2 - Tumor-Only Mode)
- This pipeline performs variant calling exclusively in tumor-only mode. Matched normal samples are not supported in the current version.
It supports Single-End (IonTorrent, AmpliSeq) (Snakemake only) and Paired-End (Illumina) FASTQ files.
- Clone the repo:
git clone https://github.com/Roxicaro/Pypeline.git
cd Pypeline- Install requirements: Snakemake / Nextlow (Nextflow can be used on any POSIX-compatible system (Linux, macOS, etc), and on Windows through WSL.)
- Prepare
data/directory (FASTQs + reference) - Edit
workflow/config.yaml(Snakemake) ornexflow.config(Nexflow) - Add Sample_ID and fastq file paths to data/samplesheet.csv (Nextflow)
- Run
Before running the pipeline, organize your data as follows:
data/
├── fastq_files/ # FASTQ input files
├── bed/ # BED files for target regions (optional)
└── references/ # Reference genome files (FASTA + indexe files for BWA MEM and Mutect2)
workflow/
├── envs/ # Environment .yaml files
├── config.yaml # Configuration file where the user sets pipeline parameters and file paths
├── Snakefile # Main workflow file
└── functions.smk # Python functions to get input files
results/ # Output files will be written hereNextflow/
├── envs/ # Environment .yaml files
├── main.nf # Main workflow file
├── nextflow.config # Configuration file where the user sets pipeline parameters and file paths
├── results/ # Output files will be written here
├── modules/ # Code for the different workflow modules
└── data/
├── fastq_files/ # FASTQ input files
├── samplesheet.csv # Sample list and file paths
├── bed/ # BED files for target regions (optional)
└── references/ # Reference genome files (FASTA + indexe files for BWA MEM and Mutect2)- FASTQ files: Input sequencing reads.
- BED files: Target regions for variant calling (optional).
- References: Reference genome files including any required index files for
bwa memandMutect2(.fa / .fai / .dict / .amb / .ann / .bwt / .pac / .sa).
To run the Nexflow pipeline, you must provide a samplesheet CSV listing the input FASTQ files.
This file must be located at: Nextflow/data/samplesheet.csv
| sample_id | read1 | read2 |
|---|---|---|
| Unique sample name | Path to R1 FASTQ | Path to R2 FASTQ |
Note: Mutect2 runs in tumor-only mode.
The pipeline does not accept or require normal/control samples.
sample_id,read1,read2
SRR35855706,data/fastq_files/SRR35855706_1.fastq,data/fastq_files/SRR35855706_2.fastqFirst, edit config.yaml (Snakemake) or data/samplesheet.csv (Nextflow) to inform FASTQ file paths.
To run using AWS Batch (Nextflow) edit nextflow.conf to include the paths to the reference and index files in an S3 bucket.
docker run -it --rm \
-v $PWD:/pipeline \
-w /pipeline/workflow \
roxicaro/pypeline-snakemake \
snakemake --cores 4--cores specifies the number of cores to be used.
Local (Not recommended. Requires all tools to be locally installed):
nextflow run main.nfDocker (Recommended):
nextflow run main.nf -profile dockerAWS Batch:
nextflow run main.nf -profile aws_batchdocker run -it --rm \
-v $PWD:/pipeline \
-w /pipeline/workflow \
roxicaro/pypeline-snakemake \
snakemake -np --dag | dot -Tsvg > dag.svgAfter processing, results are written to:
results/
├── trimmed_fastq/ # FASTQ files after trimming
├── fastqc/ # QC reports (per sample + MultiQC)
├── mapped_reads/ # Unsorted BAM
├── sorted_reads/ # Sorted BAM and BAI files
├── variant_calls/ # `.vcf` files
└── logs/ # Execution logs for troubleshooting- Snakemake / Nextflow
Note: Nextflow can be used on any POSIX-compatible system (Linux, macOS, etc), and on Windows through WSL.
Pair-end sequencing FASTQs (Illumina):
Single-read sequencing FASTQs (Ion Torrent):