End-to-end tumor-normal somatic variant calling pipeline for breast cancer sequencing data. The workflow starts from public sequencing accessions, performs FASTQ generation, quality control, read alignment, BAM processing, somatic variant calling with GATK Mutect2, filtering, SnpEff annotation, cancer-gene prioritization, and final visualization of candidate driver mutations.
This repository is designed as a GitHub-ready demonstration of real-world cancer genomics pipeline development. Large raw data files such as FASTQ, SRA, BAM, BAI, compressed VCF, and index files are intentionally excluded. Small result tables, QC reports, alignment metrics, scripts, and final figures are included.
• Processed paired tumor-normal whole genome sequencing data using a reproducible variant-calling workflow.
• Generated high-confidence somatic mutation calls using GATK Mutect2.
• Annotated variants using SnpEff to determine functional consequences and predicted impact.
• Identified 313 HIGH-impact variants and 2468 MODERATE-impact variants.
• Prioritized mutations occurring in known cancer-associated genes.
• Detected candidate driver alterations in genes including PIK3CA, CDH1 and ATM.
• Classified variants according to functional consequence including missense, nonsense and frameshift mutations.
• Generated mutation burden summaries and variant-impact distributions.
• Produced publication-ready visualizations for mutation prioritization and biological interpretation.
• Constructed a reproducible workflow from raw sequencing reads to candidate driver mutation identification.
Input data were handled as matched tumor-normal sequencing data from NCBI SRA accessions used locally in the workflow.
Project accessions used in the local pipeline:
SRR19077323SRR19077324
The local sample naming used:
11871Tfor tumor11871Nfor matched normal
Small metadata/sample information is stored in:
data/breast5.txt
Raw SRA, FASTQ, BAM, and compressed VCF files are not included in this GitHub version.
The main goal was to build and document a complete somatic mutation discovery workflow for a tumor-normal breast cancer sample.
Specific objectives:
- Download / process sequencing reads from public accessions.
- Perform sequencing quality control.
- Align tumor and normal reads to the human reference genome.
- Generate sorted, indexed, duplicate-marked BAM files.
- Call somatic variants using matched tumor-normal analysis.
- Filter candidate variants to retain PASS calls.
- Annotate functional consequences using SnpEff.
- Prioritize high-impact, moderate-impact, and cancer-gene variants.
- Generate interpretable visual summaries and candidate driver tables.
Somatic variant calling identifies mutations present in tumor DNA but absent or much less supported in matched normal DNA. In cancer genomics, these mutations can reveal driver genes, tumor suppressor loss, oncogene activation, DNA repair defects, and clinically relevant candidate targets.
Breast cancer commonly involves alterations in pathways related to:
- PI3K signaling
- DNA damage repair
- cell adhesion and epithelial integrity
- tumor suppressor function
- growth factor signaling
This project demonstrates how a matched tumor-normal workflow can move from sequencing reads to interpretable cancer mutation findings.
flowchart TD
A[NCBI SRA accessions: tumor + matched normal] --> B[FASTQ generation with SRA Toolkit / fasterq-dump]
B --> C[Read QC with FastQC]
C --> D[QC aggregation with MultiQC]
D --> E[Alignment to GRCh38 with BWA-MEM]
E --> F[BAM sorting and indexing with Samtools]
F --> G[Duplicate marking with Picard/GATK]
G --> H[Somatic variant calling with GATK Mutect2]
H --> I[Variant filtering with FilterMutectCalls]
I --> J[PASS-only somatic variants]
J --> K[Functional annotation with SnpEff]
K --> L[Impact classification: HIGH, MODERATE, LOW, MODIFIER]
L --> M[Cancer-gene prioritization]
M --> N[Candidate driver mutation table]
N --> O[Final figures and interpretation]
| Stage | Tool | Purpose |
|---|---|---|
| FASTQ generation | SRA Toolkit / fasterq-dump |
Convert SRA accessions to paired FASTQ files |
| QC | FastQC | Per-read quality, adapter, GC, sequence quality checks |
| QC aggregation | MultiQC | Combined QC reports |
| Alignment | BWA-MEM | Align reads to GRCh38 reference genome |
| BAM processing | Samtools | Sort and index BAM files |
| Duplicate handling | Picard / GATK | Mark duplicate reads |
| Variant calling | GATK Mutect2 | Matched tumor-normal somatic variant calling |
| Variant filtering | GATK FilterMutectCalls | Retain high-confidence somatic calls |
| VCF handling | bcftools | PASS extraction, stats, variant table creation |
| Annotation | SnpEff | Functional consequence prediction |
| Prioritization | Python / pandas | High-impact and cancer-gene filtering |
| Visualization | matplotlib | Driver, impact, effect, and gene-level plots |
SRA accessions were converted to paired-end FASTQ files using SRA Toolkit.
Script:
scripts/run_fasterq_11871.sh
Large FASTQ files are excluded from GitHub.
FastQC was used for per-file quality reports. MultiQC was used to combine QC results.
Included outputs:
qc/fastqc/
qc/multiqc/
These reports document sequencing quality and provide evidence that the reads were inspected before downstream analysis.
Reads were aligned to the human GRCh38 reference genome using BWA-MEM. BAM files were sorted, indexed, and duplicate-marked with Samtools and Picard/GATK utilities.
Script:
scripts/align_11871.sh
Included outputs:
metrics/11871T.metrics.txt
metrics/11871N.metrics.txt
Large BAM/BAI/SAM files are excluded from GitHub.
GATK Mutect2 was used for matched tumor-normal somatic variant calling.
The workflow completed successfully and produced an unfiltered somatic VCF, which was then filtered using FilterMutectCalls.
Final PASS calls:
- 9,823 PASS variants
- 9,357 SNPs
- 144 MNPs
- 322 indels
PASS-level summary is stored in:
results/11871_somatic_PASS_stats.txt
results/11871_somatic_PASS_variants.tsv
Compressed VCF/index files are excluded from GitHub.
SnpEff was used to annotate predicted functional consequences.
The annotated variant table is stored in:
results/11871_somatic_PASS_snpeff_table.tsv
The SnpEff summary files are stored in:
results/snpEff_summary.html
results/snpEff_genes.txt
Variants were grouped by predicted impact.
| category | count |
|---|---|
| HIGH | 313 |
| MODERATE | 2468 |
| Cancer_gene | 22 |
Interpretation:
- HIGH-impact variants include events such as stop-gain, frameshift, splice-disrupting, or other potentially severe changes.
- MODERATE-impact variants include mostly missense or protein-altering changes.
- Cancer-gene variants were prioritized using a curated breast/cancer-relevant gene list.
Candidate driver mutations were prioritized using:
- PASS status
- SnpEff impact
- predicted effect
- known cancer-gene relevance
- TLOD score
- protein-level consequence
Driver candidate summary:
| gene | effect | impact | hgvs_c | hgvs_p | tlod |
|---|---|---|---|---|---|
| PIK3CA | missense_variant | MODERATE | c.3140A>G | p.His1047Arg | 301.38 |
| CDH1 | stop_gained | HIGH | c.1003C>T | p.Arg335* | 228.38 |
| ATM | missense_variant | MODERATE | c.6154G>A | p.Glu2052Lys | 9.31 |
| PIK3CA | missense_variant&splice_region_variant | MODERATE | c.2495G>A | p.Arg832Gln | 7.87 |
| CDH1 | splice_region_variant&intron_variant | LOW | c.1566-7C>T | nan | 7.43 |
| NF1 | missense_variant&splice_region_variant | MODERATE | c.5609G>A | p.Arg1870Gln | 7.41 |
| ATM | missense_variant | MODERATE | c.4397G>A | p.Arg1466Gln | 7.18 |
| CDH1 | missense_variant | MODERATE | c.2254G>A | p.Val752Ile | 6.68 |
The strongest prioritized candidate was:
PIK3CA p.His1047Arg
TLOD = 301.38
Effect = missense_variant
Impact = MODERATE
Interpretation:
PIK3CA is a key oncogene in breast cancer. The p.His1047Arg mutation is a well-known activating hotspot in the PI3K pathway. Its very high TLOD score makes it the strongest candidate event in this analysis.
A high-impact candidate was:
CDH1 p.Arg335*
TLOD = 228.38
Effect = stop_gained
Impact = HIGH
Interpretation:
CDH1 encodes E-cadherin, a key cell-cell adhesion protein. A stop-gain mutation may disrupt epithelial adhesion and tumor suppressor function. This is one of the most biologically important variants in the result set because it is both high-impact and high-confidence.
ATM variants included:
ATM p.Glu2052Lys
ATM p.Arg1466Gln
Interpretation:
ATM is involved in DNA damage response. ATM alterations can be relevant to genome instability and DNA repair defects in tumors.
NF1 and additional cancer-associated genes were also detected among prioritized cancer-gene variants. These variants should be interpreted cautiously because not every cancer-gene mutation is necessarily a driver; however, they provide useful candidates for follow-up.
The folder main_figures/ contains the main visual summary of the variant calling project.
| Figure | Meaning |
|---|---|
11871_driver_candidate_table.png |
Table-style summary of prioritized driver candidates |
11871_driver_mutations.png |
TLOD-based bar plot of candidate driver mutations |
11871_driver_lollipop_like_plot.png |
Lollipop-style view of candidate driver mutations |
11871_cancer_gene_effect_matrix.png |
Gene-by-effect matrix showing cancer-gene mutation consequences |
11871_snpeff_impact_counts.png |
Counts of HIGH, MODERATE, LOW, and MODIFIER SnpEff impacts |
11871_top_variant_effects.png |
Most common variant consequence types |
11871_top_mutated_genes.png |
Genes with highest number of annotated variants |
11871_variant_summary.png |
Summary of prioritized variant categories |
The folder supporting_figures/ contains additional copies and supplementary visualizations.
SomaticVariantCalling_BreastCancer
│
├── README.md
│ Main project explanation, workflow, results, and biological interpretation.
│
├── docs/
│ Workflow notes and project documentation.
│
├── scripts/
│ Shell and Python scripts used for alignment, variant calling,
│ annotation, prioritization, and plotting.
│
├── data/
│ Small sample metadata files only.
│ Raw SRA/FASTQ files are excluded.
│
├── qc/
│ Sequencing quality-control reports.
│
│ ├── fastqc/
│ │ Per-FASTQ FastQC HTML reports.
│ │
│ └── multiqc/
│ Aggregated MultiQC report and small MultiQC data tables.
│
├── metrics/
│ Alignment and duplicate metrics for tumor and normal BAMs.
│
├── main_figures/
│ Main figures showing driver candidates, variant consequences,
│ impact categories, and cancer-gene effects.
│
├── supporting_figures/
│ Extra copies and supporting visualizations.
│
└── results/
Small result tables from variant filtering, annotation,
prioritization, and visualization.
| File | Description |
|---|---|
results/11871_somatic_PASS_variants.tsv |
PASS somatic variant table |
results/11871_somatic_PASS_stats.txt |
bcftools PASS VCF summary |
results/11871_somatic_PASS_snpeff_table.tsv |
Parsed SnpEff annotation table |
results/11871_HIGH_impact_variants.tsv |
HIGH-impact variants |
results/11871_MODERATE_variants.tsv |
MODERATE-impact variants |
results/11871_cancer_gene_variants.tsv |
Variants overlapping curated cancer genes |
results/11871_driver_candidate_variants.tsv |
Final prioritized driver candidates |
results/11871_driver_candidate_summary_table.tsv |
Compact driver candidate summary |
results/11871_variant_priority_summary.tsv |
Counts of HIGH, MODERATE, and cancer-gene variants |
results/11871_cancer_gene_variant_interpretation_table.tsv |
Cancer-gene interpretation table |
results/snpEff_summary.html |
SnpEff annotation report |
metrics/11871T.metrics.txt |
Tumor BAM alignment/duplicate metrics |
metrics/11871N.metrics.txt |
Normal BAM alignment/duplicate metrics |
Activate the environment:
conda activate scrna
cd ~/Projects/Cancer_pipeline/Variant_CallingGenerate FASTQ:
bash scripts/run_fasterq_11871.shAlign and process BAMs:
bash scripts/align_11871.shRun Mutect2 and FilterMutectCalls:
gatk Mutect2 \
-R reference/fasta/Homo_sapiens_assembly38.fasta \
-I bam/11871T.markdup.bam \
-I bam/11871N.markdup.bam \
-normal 11871N \
-O vcf/11871_somatic_unfiltered.vcf.gz
gatk FilterMutectCalls \
-R reference/fasta/Homo_sapiens_assembly38.fasta \
-V vcf/11871_somatic_unfiltered.vcf.gz \
-O vcf/11871_somatic_filtered.vcf.gzExtract PASS variants:
bcftools view -f PASS \
-Oz -o vcf/11871_somatic_PASS.vcf.gz \
vcf/11871_somatic_filtered.vcf.gzAnnotate with SnpEff:
snpEff GRCh38.99 \
vcf/11871_somatic_PASS.vcf.gz \
> results/11871_somatic_PASS_snpeff.vcfExtract tables and prioritize variants:
python scripts/extract_snpeff_table.py
python scripts/prioritize_snpeff_variants.py
python scripts/plot_full_variant_summary.py
python scripts/plot_publication_variant_figures.pyThe following file types are excluded from GitHub:
.sra.fastq.fq.gz.bam.bai.sam.vcf.gz.tbi- large reference genome files
- cache folders
The repository includes scripts and small outputs needed to understand and reproduce the workflow, but raw data and large alignment files must be regenerated or downloaded separately.
This project demonstrates a complete tumor-normal somatic variant calling pipeline for breast cancer sequencing data. The workflow successfully identified high-confidence somatic variants, annotated their functional effects, prioritized cancer-gene alterations, and produced interpretable visual summaries.
The most important findings were:
- PIK3CA p.His1047Arg, a strong candidate activating oncogenic mutation with the highest TLOD.
- CDH1 p.Arg335*, a high-impact stop-gain mutation affecting an important epithelial adhesion/tumor suppressor gene.
- ATM missense mutations, potentially relevant to DNA damage response.
- 313 HIGH-impact variants, 2,468 MODERATE-impact variants, and 22 cancer-gene variants.
Overall, this repository shows practical experience with real NGS cancer genomics, including QC, alignment, BAM processing, GATK Mutect2, variant filtering, SnpEff annotation, cancer-gene prioritization, and publication-style reporting.
- NGS sequencing workflow development
- Tumor-normal somatic variant calling
- BWA-MEM alignment
- Samtools and Picard/GATK BAM processing
- GATK Mutect2 and FilterMutectCalls
- bcftools VCF processing
- SnpEff annotation
- Cancer-gene prioritization
- Python-based result parsing and visualization
- GitHub-ready cancer genomics project organization