This folder collects the local scripts and notebooks used to go from raw paired-end FASTQs to somatic ClinCNV calls in the WES BedCoverage workflow.
The goal is transparency rather than abstraction: these are copies of the working scripts we used, grouped in pipeline order, plus a stitched notebook that explains how the outputs of one step become the inputs of the next.
-
01_fastq_to_bam/GATK-style preprocessing from paired FASTQs to deduplicated BAMs and BAIs. -
02_somatic_vcf/Tumor-normal somatic calling withMutect2, followed byLearnReadOrientationModel,CalculateContamination, andFilterMutectCalls. -
03_baf_and_coverage/Creation of the cleaned per-sample BAF folder from pair VCFs, generation of the WES off-target BED, and creation of the BedCoverage matrices used by ClinCNV. -
04_clincnv_runs/Final ClinCNV run scripts for the WES BedCoverage analysis: one with--bafFolderand one without.
Starting inputs:
- paired FASTQ files per sample, stored in S3
- GRCh38 reference FASTA, FAI, and dictionary
- capture BED used for the panel/WES setup
Computed outputs:
- one deduplicated BAM per sample
- one BAM index per sample
- per-sample logs and metrics
How the inputs are assembled:
- wrapper scripts read a CSV with sample IDs and S3 locations
- FASTQs are discovered under the sample S3 prefix, or read from an explicit FASTQ list column
- if multiple FASTQ members exist for a mate, they are concatenated into one merged
R1and one mergedR2 - a per-sample
config.yamlis written and passed into the Snakemake workflow
How the BAMs are computed:
FastqToSamcreates an unmapped BAMRevertSamnormalizes the unmapped BAMMarkIlluminaAdaptersmarks adaptersSamToFastq | bwa mem | MergeBamAlignmentperforms alignment and merges aligned reads back with the unmapped BAM metadataMarkDuplicatescreates the final deduplicated BAM and BAI
Starting inputs:
- tumor and normal deduplicated BAMs and BAIs
- GRCh38 reference FASTA, FAI, and dictionary
- target BED
- panel of normals VCF and index
- pair list describing tumor-normal matches
Computed outputs:
- unfiltered paired somatic VCFs
- Mutect2
.statsfiles - F1R2 tarballs for orientation-bias modeling
- filtered VCFs after post-processing
- contamination tables and segmentation outputs
How the inputs are assembled:
vcf_gen_v3.shreads apairs.txtfile containingtumor,normal- for each pair, the script downloads the tumor and normal BAM/BAI files from the BAM S3 prefix
filter_vcfs.shthen scans the VCF output prefix for pair folders and downloads those local results again for post-processing
How the VCFs are computed:
Mutect2is run per tumor-normal pair with the panel BED and a panel of normalsLearnReadOrientationModelis run from the Mutect2 F1R2 tarballGetPileupSummariesis run for tumor and normal against a common-SNP resourceCalculateContaminationestimates contamination and segmentationFilterMutectCallsuses the orientation model and contamination estimates to emit the filtered somatic VCF
Starting inputs:
- one pair VCF per tumor-normal pair
- pair list CSV for all included samples
Computed outputs:
- one TSV per sample in the BAF folder
- a skipped-pairs log
- a README describing the generated folder
How the inputs are assembled:
baf_from_pair_vcfs.shreadspairs_df_filtered.csv- for each pair it expects a file named
unfiltered.<tumor>__<normal>.vcf.gz - by default it reads from
/mnt/myvolume/panel_seq/new_bed_analysis/vcfs
How the BAF values are computed:
- only biallelic SNVs are retained
- each sample record must have non-missing
GT,AD, andDP - a minimum depth threshold of
DP >= 10is applied - BAF is computed as
alt_count / DP - output columns are
chr start end chr_pos baf depth - if the same sample appears in more than one pair, duplicate positions are collapsed by keeping the row with the highest depth
Starting inputs:
- sample BAMs from the BAM S3 prefix
- target BED
- reference FASTA and FAI
- WES off-target BED definition
Computed outputs:
- on-target coverage matrix for tumors
- on-target coverage matrix for normals
- off-target coverage matrix for tumors
- off-target coverage matrix for normals
- cached per-sample BedCoverage outputs
- logs for matrix construction
How the inputs are assembled:
gen_wes_offtarget_bed_100kb.shderives the WES off-target BED from the genome minus merged target regions- the script bins off-target space into 100 kb windows and removes windows smaller than 50 kb
- GC is then computed per off-target window using
bedtools nuc build_wes_bedcoverage_matrices.shdownloads each sample BAM from S3 and runsBedCoverage
How the coverage matrices are computed:
- on-target bins use
BedCoveragewithmin_mapq 0 - off-target bins use
BedCoveragewithmin_mapq 10 - per-sample outputs are cached and later merged into matrix-style
.covfiles - tumor and normal sample order is reconstructed from header source coverage files
Starting inputs:
- tumor and normal on-target coverage matrices
- tumor and normal off-target coverage matrices
- on-target BED with GC column
- off-target BED with GC column
- tumor-normal pair CSV
- optionally, the per-sample BAF folder
Computed outputs:
- ClinCNV cohort reanalysis output
- final pair-level somatic ClinCNV calls
- ClinCNV logs and summary tables in the run directory
How the inputs are assembled:
- both run scripts point to the same BedCoverage matrices and BED files
- both use the same pair list and the same ClinCNV R script
- the only branch point is whether
--bafFolderis supplied
How the final calls are computed:
- ClinCNV is run in WES mode with off-target support
- key parameters are:
--colNum 4--lengthS 9--filterStep 2--scoreS 200--reanalyseCohort
snakemake_wf.py
- Main GATK-style preprocessing workflow from paired FASTQs to final deduplicated BAM/BAI outputs.
- Expects per-sample FASTQs named
<sample>_R1.fastq.gzand<sample>_R2.fastq.gzin the sample input folder. - Reads tool and reference locations from
config.yaml. - Produces rule-specific logs and metrics for every sample.
config.yaml
- Template configuration for the Snakemake workflow.
- Defines input and output roots, sample IDs, GRCh38 reference files, target BED, padding, and tool names.
- In the wrappers, a per-sample config is generated dynamically from this structure.
run_all.py
- General wrapper for samples with one or more FASTQ members per mate under an S3 prefix.
- Reads a CSV with
sampleands3_dir. - Discovers all FASTQs under the prefix, merges all R1 files together and all R2 files together, writes a per-sample config, runs Snakemake, uploads final BAM/BAI files to S3, uploads non-FASTQ run artifacts, and deletes the local work directory.
double_csv.py
- Variant wrapper for samples whose FASTQs may be defined more explicitly in a CSV.
- Can either read an explicit FASTQ list column or discover FASTQs under an S3 prefix.
- Handles cases where several R1 or R2 files should be concatenated before running the Snakemake workflow.
double_fastq.py
- More restrictive wrapper for cases where the CSV explicitly provides exactly one R1 and one R2 URI per sample.
- Resolves FASTQ paths, downloads the two mates, writes a per-sample config, runs Snakemake, and preserves logs in a dedicated local logs directory.
- Useful when the FASTQ layout is irregular and direct discovery under a prefix is not reliable.
vcf_gen_v3.sh
- Paired
Mutect2caller. - Reads
pairs.txtto define tumor-normal matches. - Downloads the deduplicated BAMs and BAIs for each pair from the BAM S3 prefix.
- Runs
Mutect2with the target BED, panel of normals, and GRCh38 reference. - Emits unfiltered VCFs,
.stats, F1R2 tarballs, and logs, then uploads them back to the VCF S3 prefix.
filter_vcfs.sh
- Post-processing script for the paired Mutect2 outputs.
- Scans the VCF S3 prefix for pair folders, downloads the unfiltered VCFs and F1R2 tarballs, fetches the corresponding BAMs, and runs the downstream GATK filtering steps.
- Produces artifact priors, pileup summaries, contamination tables, contamination segments, and filtered VCFs.
baf_from_pair_vcfs.sh
- Converts pair VCFs into a ClinCNV-compatible BAF folder.
- Reads each pair VCF, extracts per-sample SNV allele fractions from
ADandDP, and writes one TSV per sample. - Also writes a skipped-pairs report and a README describing the generated folder.
gen_wes_offtarget_bed_100kb.sh
- Builds the WES off-target BED used by ClinCNV.
- Starts from the genome, subtracts the merged target intervals, windows the remaining space into 100 kb bins, removes bins smaller than 50 kb, and computes GC content for each bin.
- Produces the GC-annotated off-target BED consumed by the BedCoverage and ClinCNV steps.
build_wes_bedcoverage_matrices.sh
- Generates WES coverage matrices from BAMs using
BedCoverage. - Creates sorted versions of the target and off-target BEDs, downloads each sample BAM, computes on-target and off-target depth profiles, caches the per-sample outputs, and merges them into final tumor and normal matrix files.
- Uses MAPQ 0 on-target and MAPQ 10 off-target.
bedcoverage_wes_full_transparency.ipynb
- Existing explanatory notebook focused on the WES BedCoverage and ClinCNV portion of the workflow.
- Documents how the cleaned BAF folder, off-target BED, coverage matrices, and final ClinCNV runs were created.
run_clincnv_baf_bedcoverage_wes.sh
- Final ClinCNV command for the WES BedCoverage run with BAF support.
- Uses the on-target and off-target BedCoverage matrices together with the cleaned BAF folder.
- Writes outputs to
/mnt/myvolume/panel_seq/new_bed_analysis/new_clincnv_runs/full_wes_s200_l9_f2_baf_bedcoverage_wes.
run_clincnv_no_baf_bedcoverage_wes.sh
- Final ClinCNV command for the matching WES BedCoverage run without BAF support.
- Uses the same matrices, BED files, pair list, and core ClinCNV parameters, but omits
--bafFolder. - Writes outputs to
/mnt/myvolume/panel_seq/new_bed_analysis/new_clincnv_runs/full_wes_s200_l9_f2_no_baf_bedcoverage_wes.
MANIFEST.tsv
- Maps each bundled file to its original source path and gives a one-line purpose.
fastq_to_clincnv_end_to_end_transparency.ipynb
- Narrative notebook that ties the copied scripts together and spells out the full flow from FASTQs to ClinCNV.
The two final ClinCNV runs are identical except for BAF usage:
04_clincnv_runs/run_clincnv_baf_bedcoverage_wes.shIncludes--bafFolder /mnt/myvolume/panel_seq/new_bed_analysis/baf_from_pair_vcfs_clean04_clincnv_runs/run_clincnv_no_baf_bedcoverage_wes.shOmits--bafFolder
- These scripts point at the real local paths, references, S3 locations, and working directories that were used here.
- Some stages depend on external infrastructure being available, especially:
- AWS/S3 access
- local references under
/home/ubuntu/reference - Conda environment
mamba-env BedCoverage,bedtools,samtools,bwa, and GATK in the expected locations
- The notebook is intentionally explanatory and reproducibility-oriented; it is not meant to rerun the whole pipeline automatically inside a single notebook session.