Note: This repository is under active development.
Github repository containing the code to apply stochastic character mapping (SCM) to somatic single-nucleotide variants (SNVs). The Snakemake workflow here is specifically designed to accommodate single-cell stratified data comparable to those generated for, e.g., isogenic hematopoietic clones or single-cell whole genome-sequencing data. This approach is part of a larger study that aims to investigate the frequency, predictors, and consequences of somatic mutations that evolve in violation of the infinite sites model of evolution.
An early implementation of the SCM to model the evolutionary history of an arbitraty somatic variant is presented in the shortTL_hematopoiesis github repository, available here.
The implementation of SCM in this workflow leverages the phytools comparative phylogenetics toolkit, and SCM summaries are stored as HDF5 files which can accommodate the variation in data size and structure across SCM summary stats.
The data prequisites for the Snakemake workflow are organized via a config.yaml (a template can be found here).
In brief, the workflow requires the following data per donor:
- A multi-sample VCF file containing biallelic somatic SNVs called across all samples (e.g., single-cells) per donor. A Phred-scaled genotype likelihood (
PL) element is required in theFORMATfield, as these values are used to compute scaled genotype probabilities for use as genotype-state priors in SCM. An example of aFORMATfield containinPLscores is shown below:
FORMAT clone1 clone2 ...
GT:AD:DP:GQ:PL:VAF 0/1:5,3:8:60:60,0,117:0.375 0/0:10,0:12:30:0,30,286:0 ...
GT:AD:DP:GQ:PL:VAF 0/1:4,4:8:50:50,0,100:0.5 0/1:6,6:12:60:60,0,120:0.5 ...
... ... ... ...
- A molecular phylogeny (newick format) with branch lengths measured in genotype substitutions per site. At present, we recommend the use of
CellPhyfor phylogenetic inference, as it designed to reconstruct somatic phylogenies from somatic SNVs while accounting for technical artifacts (e.g., allelic dropout) that are common in single-cell data.
(clone2:0.020321,clone1:0.029519)100:0.019519,(((clone3:0.036173,...
- An unphased genotype substitution model that captures both the relative rates and genotype state frequencies of the data. Using
CellPhy, both the molecular phylogeny and genotype substitution model can be inferred simultaneously from the multi-sample VCF file.
GT10{0.001000/1.979766/2.598794/3.482508/1.000000/2.315514/1.900595}+FU{0.157963/0.287340/0.280840/0.158375/0.008751/0.037105/0.010944/0.010282/0.040406/0.007993}, noname = 1-10000
The code in this repository is organized into the following directories:
somatic_mutation_scm/
├── LICENSE # License for the repository
├── README.md # This file
├── docs/ # Documentation and notebooks related to code/theory development and testing
│ ├── figs/ # Figures for the documentation
│ └── notebooks/ # Jupyter notebooks for code/theory development and testing
└── smk/ # Snakemake workflow for applying stochastic character mapping to somatic SNVs
├── config/ # Configuration files for the Snakemake workflow
├── example_data/ # Example input files for testing purposes
├── workflow/ # Snakemake rules, scripts, and environments for the workflow
│ ├── bin/ # Code for running Snakefile rules and applying stochastic character mapping to somatic SNVs
│ ├── envs/ # Conda environments for the Snakemake workflow
│ └── Snakefile # Snakemake workflow for applying stochastic character mapping to somatic SNVs
└── smk7_scm.yaml # Conda environment for running the Snakemake workflow

