miXer:a Machine-learning method to detect genomic Imbalances exploiting X-chromosome Exome Reads

miXer (with a capital X) is a lightweight machine learning tool designed to detect genomic deletions and duplications by exploiting the natural difference in X-chromosome copy number between male and female exomes. It builds on EXCAVATOR2 for data preprocessing and combines two key components: a single-exon Copy Number (CN) classifier based on Support Vector Machines (SVM), trained on data from six widely used exome sequencing kits; and a post-classification step that uses a Hidden Markov Model (HMM) for filtering and aggregation.

flowchart TD
    subgraph A_PREPROC["preprocessing"]
        EXC2["EXCAVATOR2"]
    end
    subgraph B_INFER["inference"]
        SVM["SVM"]
        HMM["HMM"]
    end

    A[/"BAM files"/] --> A_PREPROC --> B[/"ReadCount data"/]
    C[/"ReadCount data"/] --> B_INFER --> D[/"VCFs"/]

Setup

To run miXer, one of Docker, Apptainer or Singularity must be installed on the machine.

Clone the repository:
git clone https://github.com/ctglab/miXer

Download the support files from the Zenodo repository and place them in a folder with reading/writing permissions. The archive contains the following files:

CentromerePosition_hgVersion.txt: Containing the coordinates of the centromeres for the considered human genome assembly.
ChromosomeCoordinate_hgVersion.txt: Containing the coordinates of chromosomes.
Gap_hgVersion.txt: Containing gap annotations.
mappability_track_hgVersion.bw: Encodes genome-wide mappability.
GRC_pseudoautosomal_regions_hgVersion.gz: Containing the coordinates of the pseudoautosomal regions (PARs) for the considered human genome assembly.

The BigWig files encode genome-wide mappability for hg19 and hg38 assemblies and are required in order to annotate each target region. They were produced with the GEM mapper (Derrien et al., 2012) from the GEM suite (https://gemlibrary.sourceforge.net/), using 100 bp sliding windows and allowing up to two mismatches.

Running miXer: Requirements and Configuration Steps

miXer requires the following resources to be configured before running:

config.json
sample_sheet.tsv

A draft of both files can be found in the utils/ folder of the repository.

Executing miXer

miXer analysis is composed of two main steps:

preprocessing the input data with EXCAVATOR2
running the miXer CNV calling algorithm

To run the preprocessing using Docker:

docker run --rm \
  -v /path/to/bam_files \
  -v /path/to/resources_folder \
  -v /path/to/output_directory \
  ctglabcnr/mixer:latest preprocessing \
  -j /path/to/config.json \

To run the preprocessing using Apptainer/Singularity:

apptainer run \
  -B /path/to/bam_files \
  -B /path/to/resources_folder \
  -B /path/to/output_directory \
  docker://ctglabcnr/mixer:latest preprocessing \
  -j /path/to/config.json \

To run the miXer inference step using Docker:

docker run --rm \
  -v /path/to/bam_files \
  -v /path/to/resources_folder \
  -v /path/to/output_directory \
  ctglabcnr/mixer:latest inference \
  -j /path/to/config.json \

To run the miXer inference step using Apptainer/Singularity:

apptainer run \
  -B /path/to/bam_files \
  -B /path/to/resources_folder \
  -B /path/to/output_directory \
  docker://ctglabcnr/mixer:latest inference \
  -j /path/to/config.json \

Some additional arguments can be passed to the inference command:

-bw : Baum-Welch iterations to run for the HMM. Default is 20.
-delta : Baum-Welch delta value. Default is 1e-9.

An End-to-End analysis can also be run instead of the two steps above.

With Docker:

docker run --rm \
  -v /path/to/bam_files \
  -v /path/to/resources_folder \
  -v /path/to/output_directory \
  ctglabcnr/mixer:latest end2end \
  -j /path/to/config.json \

With Apptainer/Singularity:

apptainer run \
  -B /path/to/bam_files \
  -B /path/to/resources_folder \
  -B /path/to/output_directory \
  docker://ctglabcnr/mixer:latest end2end \
  -j /path/to/config.json \

Setup the config file

The config.json file must be compiled with the following information:

JSON Variable Name	Value	Meaning
`exp_id`	string: `experiment_name`	Experiment identifier, used as name for output folder.
`threads`	int: `12`	Number of threads to use for parallel execution.
`sample_list`	string: `sampleList.tsv` bam	Name of the configuration file used by miXer.
`target`	string: `TargetFile.bed`	Name of target BED file.
`par`	string: `GRC_pseudoautosomal_regions_hgVersion.gz`	Name of annotation file containing Pseudoautosomal Regions.
`map`	string: `mappability_track_hgVersion.bw`	Name of Mappability track in BigWig format.
`gap`	string: `Gap_hgVersion.txt`	Name of UCSC gap annotation file.
`centro`	string: `CentromerePosition_hgVersion.txt`	Name of Centromere coordinates file.
`chrom`	string: `ChromosomeCoordinate_hgVersion.txt`	Name of file containing chromosome coordinates.
`ref`	string: `reference_genome.fasta`	Name of reference genome file in FASTA format.
`premade_controls`	string: `premadeControl.NRC.RData`	Precomputed RData file containing normalized read counts for control samples.
`main_outdir_host`	string: `/path/to/output_directory/`	Path to output directory, will be created if not existing.
`enable_intermediate_results_output`	bool: `false`	Enable intermediate results output (SVM-ready datasets and SVM-processed output).

A JSON configuration file template can be found in utils/ folder.

Control Sample Requirements:

Required control sample pool size: 10 samples.
ALL F samples are required for CNV calling on X chromosome.
WARNING: Unpredictable behaviour can arise if using a different number of control samples.
If reusing control data from a previous EXCAVATOR2/miXer run:
The .RData file generated by EXCAVATOR2's DataPrepare module can be reused.

If control samples are unavailable:
miXer can exploit the control sample .RData generated when building the X-chromosome training set
of the SVM CN classifier. Such files are available on Zenodo

WARNING: This will only be feasible if the exome sequencing kit is exactly the same as one of those reported below (unpredictable behaviour otherwise):

Kit name	Capture Technology	Bait Size (Mb)	Ref. Build	Platform	Read Length
MGI	MGIEasy Exome Capture V4 (MGI, Shenzen, China)	58.97	hg19	Illumina HiSeq X	150 PE
MedExome	SeqCap EZ MedExome (Roche NimbleGen Inc, Madison, USA)	46.58	hg19	Illumina NextSeq 550	150 PE
Nextera	Nextera Rapid Capture Exome V1.2 (Illumina Inc, San Diego, USA)	45.33	GRCh37	Illumina HiSeq X	150 PE
SureSelect V6	SureSelect Human All Exon V6 (Agilent Technologies, Santa Clara, USA)	60.46	hg19	Illumina NovaSeq 6000	150 PE
Twist	Human Core Exome + RefSeq Panel V1.3 (Twist Bioscience, San Francisco, USA)	36.71	GRCh38	Illumina HiSeq X	150 PE
SureSelect V2 (1000G)	SureSelect Human All Exon V2 (Agilent Technologies, Santa Clara, USA)	46.00	GRCh38	Illumina HiSeq 2000/2500	76 or 101 PE

Sample Sheet Configuration

The sampleList.tsv file must be compiled with the following information:

ID	bamName	Gender	sampleType
test1	/path/to/bam_files/sample_file.bam	M	T
ctrl1	/path/to/bam_files/control1_file.bam	F	C

Where:

ID: Sample identifier which will be used to name miXer outputs. MUST not be an integer--only value.
bamName: Path to .bam file for current sample.
Gender: Specify if sample is known M/F (if unknown, please write F — miXer will correct it automatically).
sampleType: Either T (Test) or C (Control); CNV calling will be done for T samples using C samples as controls.

miXer Outputs

exca2_output_experiment_name: Folder containing the full output of EXCAVATOR2 tool:
- If control samples are provided, EXCAVATOR2 CNV calls will also be available. Otherwise, if miXer is run using a pre-made control sample RData, only the output of the DataPrepare module will be present.
mixer_vcfs: Folder containing VCFs (v.4.4) for all samples specified as T in the sampleList.tsv
mixer_windows: Folder containing, for all samples specified as T in the sampleList.tsv:
- SVM_guess_ploidy_results.txt: File containing miXer's estimated ploidy status of the X-chromosome for each sample
- sampleID_TARGET folders: One folder for each sample, containing:
  - sampleID_TARGET_hmm_bw20_PASS_ONLY_windows.bed: PASS quality CNV windows only (i.e. CNVs with a confidence score higher than 0.9)
  - test1_TARGET_hmm_bw20_windows.bed: ALL CNV windows defined by miXer.

miXer CNV windows

Files sampleID_TARGET_hmm_bw20_PASS_ONLY_windows.bed and test1_TARGET_hmm_bw20_windows.bed will contain CNV windows defined by miXer in .bed format with the following structure:

Field	Meaning
Chr	Chr
Start	CNV window start coordinate
End	CNV window end coordinate
State	CNV type (DEL/DUP)
Call	CN state of CNV window (-2,-1,0,1,2+)
CN	Estimated CN of CNV window (0,1,2,3,4+)
ProbCall	CNV window confidence
p_error	Probability of error associated to CNV call
Median_NRC	Median value of NRC_poolNorm for TRs in CNV window
Number_of_TR	Number of TR included in CNV window
window_length	CNV window length
Mean_TR_length	Mean length of TRs in CNV window
window_mean_mappability	Mean of TR's mappability in CNV windows
alt_post_prob_mean	Mean of HMM state probabilities in CNV windows

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
.test		.test
conda_envs/linux		conda_envs/linux
entrypoints		entrypoints
preprocessingMixer		preprocessingMixer
processing/scripts		processing/scripts
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miXer:a Machine-learning method to detect genomic Imbalances exploiting X-chromosome Exome Reads

Setup

Running miXer: Requirements and Configuration Steps

Executing miXer

Setup the config file

Control Sample Requirements:

Sample Sheet Configuration

miXer Outputs

miXer CNV windows

About

Uh oh!

Releases 2

Packages

Contributors 5

Uh oh!

Languages

License

ctglab/miXer

Folders and files

Latest commit

History

Repository files navigation

miXer:a Machine-learning method to detect genomic Imbalances exploiting X-chromosome Exome Reads

Setup

Running miXer: Requirements and Configuration Steps

Executing miXer

Setup the config file

Control Sample Requirements:

Sample Sheet Configuration

miXer Outputs

miXer CNV windows

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Uh oh!

Languages

Packages