This repository provides a Snakemake workflow that converts AnchorWave MAFs into per-contig merged gVCFs, splits them into clean/filtered/invariant sets, and produces mask bedfiles for downstream ARG inference (see logic.md for details).
- Conda (you may need to
module load condaon your cluster) - TASSEL (provided via the
tassel-5-standalonesubmodule) - GATK, Picard, htslib (installed in the conda env defined below)
Clone with submodules so TASSEL is available:
git clone --recurse-submodules <repo>Create and activate the environment (do this before running Snakemake):
module load conda
conda env create -f argprep.yml
conda activate argprepEdit config.yaml to point at your MAF directory and reference FASTA. At minimum you must set:
maf_dir: directory containing*.mafor*.maf.gzreference_fasta: reference FASTA path (plain.fa/.fastaor bgzipped.fa.gz)
If your reference FASTA does not have an index (.fai), either create one (samtools faidx) or set contigs: explicitly in config.yaml.
The pipeline can be run one of two ways, both from the repo root. It is recommended you first run on the example data provided to ensure the pipeline works on your system.
A default SLURM profile is provided under profiles/slurm/. Edit profiles/slurm/config.yaml to customize sbatch options if needed.
Defaults for account/partition and baseline resources are set in config.yaml (slurm_account, slurm_partition, default_*).
SLURM stdout/stderr logs are written to logs/slurm/ by default.
snakemake --profile profiles/slurmsnakemake -j 8 Common options:
-j <N>: number of parallel jobs--rerun-incomplete: clean up partial outputs--printshellcmds: show executed commands--notemp: keep temporary intermediates (see Notes)
By default the workflow uses these locations (override in config.yaml):
gvcf/: TASSEL gVCFs (*.gvcf.gz) from MAFsgvcf/cleangVCF/: cleaned gVCFs fromscripts/dropSV.pygvcf/cleangVCF/dropped_indels.bed: bedfile of large indelsgvcf/cleangVCF/split_gvcf/: per-contig gVCFs for mergingresults/combined/combined.<contig>.gvcf.gz: merged gVCF per contigresults/split/combined.<contig>.inv: invariant sitesresults/split/combined.<contig>.filtered: filtered sitesresults/split/combined.<contig>.clean: clean sitesresults/split/combined.<contig>.missing.bed: missing positionsresults/split/combined.<contig>.filtered.bed: merged mask bedresults/split/combined.<contig>.coverage.txt: split coverage validation summaryresults/split/combined.<contig>.accessible.npz: boolean accessibility array (union of clean + invariant sites), for scikit-allel statisticsresults/summary.html: HTML summary of jobs run, outputs created, and warnings
- If
bgzip_output: true, the.inv,.filtered,.clean, and.missing.bedfiles will have a.gzsuffix. - All gzipped outputs in this pipeline use bgzip (required for
tabix). scripts/dropSV.pyremoves indels larger thandrop_cutoff(if set inconfig.yaml).scripts/split.pysupports--filter-multiallelicand--bgzip-output(toggle viaconfig.yaml).scripts/filt_to_bed.pymerges<prefix>.filtered,<prefix>.missing.bed, anddropped_indels.bedinto a final mask bed.make_accessibilitybuilds a per-contig accessibility array from the union ofcombined.<contig>.cleanandcombined.<contig>.invusing the reference.faito size the array. The output is a compressed NumPy archive containing a boolean array namedmask, intended for scikit-allel statistics.- Ploidy is inferred from MAF block structure by default (max non-reference
slines per block, typically1for pairwise MAFs). You can override withploidyinconfig.yaml. - Optional: enable
vt_normalize: trueinconfig.yamlto normalize merged gVCFs withvt normalizeafterSelectVariants. - If GenomicsDBImport fails with a buffer-size error, increase
genomicsdb_vcf_buffer_sizeandgenomicsdb_segment_sizeinconfig.yaml(set them above your longest gVCF line length). - Large intermediate files are marked as temporary and removed after a successful run (per-sample gVCFs, cleaned gVCFs, per-contig split gVCFs, and the GenomicsDB workspace). Use
snakemake --notempif you want to preserve them for debugging or reruns. - Resource knobs (memory/threads/time) and GenomicsDB buffer sizes are configurable in
config.yaml(e.g.,merge_contig_mem_mb,maf_to_gvcf_*,genomicsdb_*). - To cap concurrent contig-merge jobs on SLURM, set
merge_gvcf_max_jobsinconfig.yaml(used by the profile as a globalmerge_gvcf_jobsresource limit). - To cap the SLURM array concurrency for
scripts/maf_to_gvcf.sh, setmaf_to_gvcf_array_max_jobsinconfig.yaml(default 4).
- Added HTML summary report with embedded SVG histograms and expanded output details.
- Split logic tightened: clean sites now require all samples called; missing GTs are routed to filtered.
- Invariant/filtered/clean outputs are enforced as mutually exclusive per position; filtered BED spans now respect END/REF lengths and subtract inv/clean.
- Merged gVCFs are produced via GATK SelectVariants with genotype calling; TASSEL
outputJustGTdefault set tofalseto retain likelihoods for calling. - Added accessibility mask generation (
combined.<contig>.accessible.npz) for scikit‑allel workflows. - New/expanded validation and tests: split coverage checks, filtered‑bed tests, integration tests gated by
RUN_INTEGRATION=1. - Example data regenerated via msprime with indels and missing data and AnchorWave‑style MAF formatting.
check_split_coverage.pynow reports overlap intervals with file names to aid debugging.filt_to_bed.pyfilters masks to the target contig, preventing cross‑contig lines incombined.<contig>.filtered.bed.- SLURM default resources now read
default_*fromconfig.yamlinstead of hardcoded profile values.
- Moved HTML summary generation into
scripts/summary_report.pyand simplifiedSnakefile. - Corrected example MAF inputs so
example_data/*.maf.gzare valid gzip files. - Updated split classification so
ALT=<NON_REF>-only records are treated as invariant. - Updated SINGER clean-output formatting to strip
<NON_REF>while preserving genotype/sample fields. - Added
logic.mdwith detailed site-routing/filtering logic and concrete examples. - Added
results/split/combined.<contig>.coverage.txtto documented workflow outputs. - Added split-test coverage for invariant/nonref and genotype-preserving clean formatting.
- Updated SLURM profile default resources to numeric values to avoid resource conversion/submission errors.
- Added merge_gvcf_max_jobs pipeline concurrency control
Use Nate Pope's SINGER Snakemake pipeline with combined.<contig>.clean and combined.<contig>.filtered.bed as inputs.
If you use scikit-allel, you can use the combined.<contig>.clean VCF and load the accessibility mask like this:
import numpy as np
mask = np.load("results/split/combined.1.accessible.npz")["mask"]