The Organelle Nuclear Similarity Mapper.
The idea is to attempt to classify NUMTs vs NIMTs with some certainty. NUMTs are nuclear insertions of mitochondrial DNA, and NIMTs are the reverse, nuclear insertions into the mitochondrial genome.
Some features so far:
- Detects candidate homologous loci between nuclear and mitochondrial assemblies.
- Scores each locus for NUMT vs NIMT signal using long read support.
- Reports classification (Likely_NUMT, Likely_NIMT, Ambiguous) with confidence scores.
- Provides genome-level summary metrics (percent nuclear genome NUMT, percent mitochondrial genome NIMT, etc.).
- Works with HiFi or ONT long reads (I've not tested with ONT).
Not on crates.io and no releases yet, so compile from this repo:
git clone https://github.com/ARU-life-sciences/onsm
cd onsm
cargo build --releaseRequires:
- minimap2 >2.24
- samtools >1.16
These are super easy to install via:
One shot classification:
onsm classify \
--mito data/Arabidopsis_thaliana.fasta \
--nuclear data/GCA_933208065.1.fasta.gz \
--reads data/Arabidopsis_thaliana.ccs.fastq.gz \
--platform hifi \
--out results_dir
--keep-tmpIf you kept temp files (as above) you can use reuse (mainly for dev):
onsm reuse \
--mito mito.fasta \
--nuclear nuclear.fasta.gz \
--platform hifi \
--paf-mito-to-nuc mito_to_nuc.paf \
--paf-nuc-to-mito nuc_to_mito.paf \
--bam-reads-to-nuc reads_to_nuc.bam \
--bam-reads-to-mito reads_to_mito.bam \
--out reuse_resultsYou may also want to run onsm syscheck which will dump some info about your system and make sure you're ready to run. You'll either need minimap2 and samtools in PATH, or specify them in the onsm syscheck. Otherwise you'll get an error.
When you run onsm classify or onsm reuse, three main result files are created in the output directory:
This is the main call table. Each row corresponds to one candidate locus (a mito↔nuclear alignment pair). Columns:
- pair_id – unique identifier for the candidate locus.
- call – classification of the locus:
- Likely_NUMT: locus looks like a mitochondrial sequence inserted into the nuclear genome.
- Likely_NIMT: locus looks like a nuclear sequence inserted into the mitochondrial genome.
- Ambiguous: insufficient evidence to decide.
- confidence – a scaled score difference between the NUMT and NIMT models (higher = more confident).
- reason_codes – why a call was made (e.g. score_difference, delta_below_threshold).
Detailed per-locus statistics from alignments and read support. Columns:
- pair_id – matches classification.tsv.
- nuc_contig / nuc_start / nuc_end – coordinates of the nuclear locus.
- mito_contig / mito_start / mito_end – coordinates of the mitochondrial locus.
- aln_len / aln_ident – alignment length and identity (fraction).
- rnuc / rmito – normalized read depths (coverage ratios).
- s_nuc / s_mito – span fractions: proportion of reads spanning the locus in nuclear vs. mito references.
- score_numt / score_nimt – composite scores used by the classifier for NUMT vs NIMT hypotheses.
P000004 OZ173161.1 0 43942 u104 0 43942 43942 1.0000 0.768 0.703 0.001 0.001 0.4747 0.2947
Here, the alignment covers ~44 kb with ~77% identity; nuclear and mito coverages are close, but the scoring leaned toward NUMT (score_numt > score_nimt).
A high-level overview across all pairs:
- n_pairs – number of candidate loci found.
- n_numt / n_nimt – number of loci called as NUMT / NIMT.
- nuclear_bp_total – size of the nuclear assembly.
- nuclear_bp_numt – number of nuclear bases overlapping called NUMTs.
- nuclear_pct_numt – % of nuclear genome spanned by NUMTs.
- mito_bp_total – size of the mitochondrial assembly.
- mito_bp_nimt – number of mitochondrial bases overlapping called NIMTs.
- mito_pct_nimt – % of mitochondrial genome spanned by NIMTs.
The classifier compares evidence from both the assembly alignments (mito to nuclear) and the read support (long reads mapped to each assembly). For each candidate locus (row in pairs.tsv), the following features are considered:
-
Alignment features
aln_lenandaln_identare combined into an alignment score (long, high-identity alignments are weighted higher).- Alignments are paired in both directions (mito→nuc and nuc→mito) to define a candidate locus.
-
Coverage ratios
rnuc= coverage in nuclear locus ÷ median nuclear coverage.rmito= coverage in mitochondrial locus ÷ median mitochondrial coverage.- Intuition:
- If the nuclear copy has depth consistent with the nuclear genome (
rnuc ≈ 1), and the mitochondrial locus is depleted (rmito << 1), this supports a NUMT (mito → nuc). - If the mitochondrial copy has depth consistent with mitochondria, and the nuclear copy is depleted, this supports a NIMT (nuc → mito).
- If the nuclear copy has depth consistent with the nuclear genome (
-
Span fractions
s_nucands_mito= fraction of reads spanning across the candidate locus in each reference.- Guards against partial alignments or collapsed repeats — true insertions are well-spanned in the “host” genome, but not in the “donor” genome.
-
Scoring model
Each locus is scored under two hypotheses:-
NUMT score
score_numt = w_a * aln_score + w_l * length_score + w_d * depth_term(rnuc, rmito) + w_s * span_term(s_nuc, s_mito) -
NIMT score
score_nimt = w_a * aln_score + w_l * length_score + w_d * depth_term(rmito, rnuc) + w_s * span_term(s_mito, s_nuc)
where
w_a, w_l, w_d, w_sare weights (default: 0.25, 0.15, 0.25, 0.25; configurable via CLI).depth_termandspan_termpenalize deviations from expected coverage/span in the host genome and low values in the donor genome.
-
-
Decision rule
- Compute the difference:
Δ = score_numt – score_nimt - If
Δ >= call_threshold→ Likely_NUMT - If
Δ <= –call_threshold→ Likely_NIMT - Else → Ambiguous
- Default
call_threshold = 0.15. - A stricter cutoff (
highconf_threshold = 0.30) highlights particularly confident calls.
- Compute the difference:
-
Confidence value
- Reported in
classification.tsvasconfidence = |Δ| - The
reason_codescolumn records why the call was made:score_difference: one hypothesis clearly scored higher.delta_below_threshold: both scores too close → ambiguous.
- Reported in