deeptools tools for exploring deep sequencing data
- Analysis of correlation of bamfiles
An end-to-end Nextflow workflow now lives in main.nf. It automates:
- Raw-read FastQC
- fastp trimming
- Trimmed-read FastQC
- BWA-MEM mapping with read groups
- samtools duplicate marking and filtered BAM generation with
samtools flagstat - MultiQC summary
- CNV calling with CNVkit, GATK CNV, or both
config/samples.tsv- one row per sample, with FASTQ paths and CNV reference group.nextflow.config- reference genome, targets, output directory, executors, resources, and CNV settings.envs/qc_mapping_cnv.yaml- Conda environment for Nextflow and the required tools.main.nf- automated pipeline definition.
Edit config/samples.tsv and nextflow.config, then run:
conda env create -f envs/qc_mapping_cnv.yaml
conda activate qc_mapping_cnv
nextflow run main.nf -profile conda -resumeFor an HPC run using the built-in SLURM profile:
nextflow run main.nf -profile slurm -resumeSet params.cnv_method in nextflow.config:
cnvkit- build CNVkit references from samples markednormal,control, orreference, then call.called.cnsfiles.gatk- preprocess target intervals, collect read counts, build a panel of normals, denoise, segment, and call.called.segfiles.both- run both CNVkit and GATK CNV outputs from the same marked BAMs.
Samples with cnv_role set to normal, control, or reference are used for CNV references. Other roles, such as case or treated, are CNV-called against their cnv_reference_group.
For ChIP-seq, a separate generic Nextflow workflow is available in chipseq_main.nf with configuration in chipseq_nextflow.config.
- Input FASTQ files are fully controlled by
config/chipseq_samples.tsv(no fixed filename pattern is required). - Sample names can be any value in the
samplecolumn. - The
control_samplecolumn pairs each ChIP sample with the exact Input sample ID for downstreambamCompareoutput. - Paths and run parameters are set in
chipseq_nextflow.config, including reference genome, mapping filters, bigWig options, andbamComparesettings (ratio mode with duplicate ignoring, no scale-factor normalization).
chipseq_main.nf- ChIP-seq QC, trimming, mapping/filtering, optional bigWig generation, optional ChIP/InputbamCompare, and MultiQC.chipseq_nextflow.config- ChIP-seq pipeline parameters and runtime profiles.config/chipseq_samples.tsv- sample sheet template; edit file paths, sample IDs, andcontrol_samplepairing.
nextflow run chipseq_main.nf -c chipseq_nextflow.config -profile conda -resumeFor SLURM:
nextflow run chipseq_main.nf -c chipseq_nextflow.config -profile slurm -resumeGATK & CNVkit Workflows for Targeted and Whole-Exome Sequencing
This repository provides reproducible, HPC-ready workflows for copy number variation (CNV) analysis using two independent pipelines:
- GATK CNV Workflow — Best-practice CNV calling using the Broad Institute's Genome Analysis Toolkit (GATK).
- CNVkit Workflow — Coverage-based CNV detection using CNVkit for targeted and hybrid capture sequencing.
Each workflow includes:
- Ready-to-run SLURM batch scripts for HPC clusters
- Step-by-step setup and execution guides
- Notes on parameters, expected outputs, and biological interpretation
All scripts are fully modular and can be customized per project. Each step includes:
- Input and output definitions
- Environment setup instructions
- Optional parameters for advanced tuning To rerun or adapt:
- Update paths in the scripts (BAM, REF, TARGETS, etc.)
- Submit each job to the HPC queue using sbatch
- Review logs and resulting CNV tables/plots
- Manual inspection of fusioned genes
- Using the "supplementary", "mates on different chromosomes", and mates on same chromosomes but in distant than expected" reads.
- Compare it with Normal.
- CNVkit: Contains scripts and tools for copy number variation analysis using CNVkit.
- GATK_CNV: Includes files related to the Genome Analysis Toolkit for copy number variations.
- Mapping: Houses the mapping files and scripts used for aligning sequencing data.
- QC: Contains quality control metrics and reports for the datasets.
- ChIP-Seq_Chromatin_analysis: Includes analysis scripts and data related to ChIP-Seq experiments.
- Duplication_fusion_genes: Contains files related to the analysis of gene duplications and fusions.
- deeptools: Houses scripts and tools used for deep data analysis.
- bcftools: Contains tools for variant calling and manipulating VCF files.
- fastq: Houses FASTQ files of raw sequencing data.
- Shell: 84.9%
- Jupyter Notebook: 15.1%
relevant tools:
- GATK CNV – Benjamin et al., Nature Genetics (2013)
- CNVkit – Talevich et al., PLOS Computational Biology (2016)