Skip to content

mtassia/somatic_mutation_scm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stochastic character mapping of somatic mutations

Note: This repository is under active development.

Table of contents

1. About

Example SCM summaries

Github repository containing the code to apply stochastic character mapping (SCM) to somatic single-nucleotide variants (SNVs). The Snakemake workflow here is specifically designed to accommodate single-cell stratified data comparable to those generated for, e.g., isogenic hematopoietic clones or single-cell whole genome-sequencing data. This approach is part of a larger study that aims to investigate the frequency, predictors, and consequences of somatic mutations that evolve in violation of the infinite sites model of evolution.

An early implementation of the SCM to model the evolutionary history of an arbitraty somatic variant is presented in the shortTL_hematopoiesis github repository, available here.

The implementation of SCM in this workflow leverages the phytools comparative phylogenetics toolkit, and SCM summaries are stored as HDF5 files which can accommodate the variation in data size and structure across SCM summary stats.

2. Data prerequisites

The data prequisites for the Snakemake workflow are organized via a config.yaml (a template can be found here).
In brief, the workflow requires the following data per donor:

  1. A multi-sample VCF file containing biallelic somatic SNVs called across all samples (e.g., single-cells) per donor. A Phred-scaled genotype likelihood (PL) element is required in the FORMAT field, as these values are used to compute scaled genotype probabilities for use as genotype-state priors in SCM. An example of a FORMAT field containin PL scores is shown below:
FORMAT                  clone1                          clone2                          ...
GT:AD:DP:GQ:PL:VAF      0/1:5,3:8:60:60,0,117:0.375     0/0:10,0:12:30:0,30,286:0       ...
GT:AD:DP:GQ:PL:VAF      0/1:4,4:8:50:50,0,100:0.5       0/1:6,6:12:60:60,0,120:0.5      ...
...                     ...                             ...                             ...
  1. A molecular phylogeny (newick format) with branch lengths measured in genotype substitutions per site. At present, we recommend the use of CellPhy for phylogenetic inference, as it designed to reconstruct somatic phylogenies from somatic SNVs while accounting for technical artifacts (e.g., allelic dropout) that are common in single-cell data.
(clone2:0.020321,clone1:0.029519)100:0.019519,(((clone3:0.036173,...
  1. An unphased genotype substitution model that captures both the relative rates and genotype state frequencies of the data. Using CellPhy, both the molecular phylogeny and genotype substitution model can be inferred simultaneously from the multi-sample VCF file.
GT10{0.001000/1.979766/2.598794/3.482508/1.000000/2.315514/1.900595}+FU{0.157963/0.287340/0.280840/0.158375/0.008751/0.037105/0.010944/0.010282/0.040406/0.007993}, noname = 1-10000 

3. Repository organization

The code in this repository is organized into the following directories:

somatic_mutation_scm/
├── LICENSE                        # License for the repository
├── README.md                      # This file
├── docs/                          # Documentation and notebooks related to code/theory development and testing
│   ├── figs/                      # Figures for the documentation
│   └── notebooks/                 # Jupyter notebooks for code/theory development and testing
└── smk/                           # Snakemake workflow for applying stochastic character mapping to somatic SNVs
    ├── config/                    # Configuration files for the Snakemake workflow
    ├── example_data/              # Example input files for testing purposes
    ├── workflow/                  # Snakemake rules, scripts, and environments for the workflow
    │   ├── bin/                   # Code for running Snakefile rules and applying stochastic character mapping to somatic SNVs
    │   ├── envs/                  # Conda environments for the Snakemake workflow
    │   └── Snakefile              # Snakemake workflow for applying stochastic character mapping to somatic SNVs
    └── smk7_scm.yaml              # Conda environment for running the Snakemake workflow

4. Rulegraph

Rulegraph

About

Application of stochastic character mapping to infer somatic mutation histories

Resources

License

Stars

Watchers

Forks

Contributors