Skip to content

Implementation Details

Haoliang Xue edited this page Mar 11, 2025 · 2 revisions

1. Organisation of source code

The source code for highest-level entry of KaMRaT locates at apps/kamrat.cpp.

Other source codes are findable within src/ folder:

src/
├── cmds
│   ├── kamratFilter.cpp         # FilterMain function        <=        |- Entry of filter module
│   ├── kamratIndex.cpp          # IndexMain function         <=        |- Entry of index module
│   ├── kamratMask.cpp           # MaskMain function          <=        |- Entry of mask module
│   ├── kamratMerge.cpp          # MergeMain function         <=        |- Entry of merge module
│   ├── kamratQuery.cpp          # QueryMain function         <=        |- Entry of query module
│   ├── kamratScore.cpp          # ScoreMain function         <=        |- Entry of score module
├── include
│   ├── kamratFilter.hpp
│   ├── kamratIndex.hpp
│   ├── kamratMask.hpp
│   ├── kamratMerge.hpp
│   ├── kamratQuery.hpp
│   ├── kamratScore.hpp
├── data_struct
│   ├── contig_elem.cpp          # ContigElem object implementation, used in merge module
│   ├── contig_elem.hpp          # ContigElem object declaration, used in merge module
│   ├── feature_elem.cpp         # FeatureElem object implementation, used in score module
│   ├── feature_elem.hpp         # FeatureElem object declaration, used in score module
│   ├── merge_knot.cpp           # MergeKnot object implementation, used in merge module
│   ├── merge_knot.hpp           # MergeKnot object declaration, used in merge module
│   ├── scorer.cpp               # Scorer object implementation, used in score module
│   └── scorer.hpp               # Scorer object declaration, used in score module
├── runinfo_files
│   ├── filter_runinfo.hpp       # filter helper and argument parsing functions
│   ├── index_runinfo.hpp        # index helper and argument parsing functions
│   ├── mask_runinfo.hpp         # mask helper and argument parsing functions
│   ├── merge_runinfo.hpp        # merge helper and argument parsing functions
│   ├── query_runinfo.hpp        # query helper and argument parsing functions
│   └── score_runinfo.hpp        # score helper and argument parsing functions
└── utils
    ├── FeatureStreamer.cpp      # FeatureStreamer class 
    ├── FeatureStreamer.hpp
    ├── IndexRandomAccess.cpp    # IndexRandomAccess class
    ├── IndexRandomAccess.hpp
    ├── index_loading.cpp        # functions for load/write indexes and count parsing
    ├── index_loading.hpp
    ├── seq_coding.cpp           # functions for dealing with k-mer uint_64 code
    ├── seq_coding.hpp
    ├── vect_opera.cpp           # functions for dealing with vector operations (Pearson, Spearman, and MAC distances)
    └── vect_opera.hpp

2. KaMRaT merge implementation

KaMRaT merge inherits the idea in DE-kupl mergeTags, with a novel intervention step to improve extension correctness. The procedure is:

  1. Scan the k-mers/contigs waiting to be merged and index both ends for each with the given prefix/suffix length. This constructs a series of "knots" that binds k-mers/contigs with matched prefix/suffix overlap together (MakeOverlapKnots function);
  2. Scan the overlap knots, and merge the binded k-mers/contigs (DoExtension function) under certain conditions (see below);
  3. Iteratively do the previous two steps till no further extension can be made.

These steps simplify O(n^2) k-mer/contig comparisons to several times of O(n) scanning.

2.1 MakeOverlapKnots Function

The "overlap knot" is defined by an uint64 code, computed from a character string of length i_ovlp < k. During the merging process, each knot may attach 0 or 1 or multiple k-mers/contigs on either side. The k-mers/contigs attached on the left are the "predecessors" of the knot, while those attached on the right are the "successors". The predecessors and successors of the same knot are candidates of the final merging.

To establish the overlap knot, each k-mer/contig waiting to be processed is scanned with the function MakeOverlapKnots. This function indexes the leading and ending i_ovlp (i_ovlp < k) nucleotides (denoted as "prefix" and "suffix") by converting their character strings to uint64 codes (computed by the utility function Seq2Int).

Each k-mer/contig is attached to one or two overlap knots:

  • the k-mer/contig is attached to one knot if its prefix and suffix are same;
  • the k-mer/contig is attached to two different knots if its prefix and suffix are different (one for prefix and one for suffix).

The k-mer/contig may be attached as the predecessor (left k-mer/contig) or successor (right k-mer/contig) of the overlap knot, depending on whether prefix' or suffix' need to be reversed in the unstranded dataset:

  • if the dataset is stranded, the prefix' and suffix' uint64 code are computed directly by the sequence per se (Seq2Int function), and the k-mer/contig is attached as a successor to the knot of its prefix and as a predecessor to the knot of its suffix;
  • if the dataset is unstranded, the prefix' and suffix' uint64 code are computed as the minimum code between the character string per se and its reverse-complement counterpart (Seq2Int function):
    • prefix:
      • if the prefix has not been reverse-complement transformed, the k-mer/contig is attached as the successor to the knot;
      • if the prefix has been reverse-complement transformed, the k-mer/contig is attached as the predecessor to the knot and is marked that it should be RC-transformed before extension;
    • suffix:
      • if the suffix has not been reverse-complement transformed, the k-mer/contig is attached as the predecessor to the knot;
      • if the suffix has been reverse-complement transformed, the k-mer/contig is attached as the successor to the knot and is marked that it should be RC-transformed before extension.

When attaching the k-mer/contig to the knot:

  • if the k-mer/contig has different prefix and suffix, it is straightforwardly attached to the two knots by its prefix and suffix;
  • if the k-mer/contig has same prefix and suffix, it is attached according to the knot's status:
    • if the knot has no predecessor yet (no matter it has successor or not), attach the k-mer/contig as predecessor;
    • if the knot has predecessor already, but no successor yet, attach the k-mer/contig as successor;
    • otherwise, attach the k-mer/contig according to its prefix, but this is not important since the knot will be marked as has_ambiguity.

2.2 DoExtension Function

The merging is done by scanning the established overlap knot list (Function DoExtension). The k-mers/contigs attached to a overlap knot can be merged if:

  • There has and only has one k-mer/contig on either side of the knot;
  • The k-mers/contigs bound on each side are not a same one (to avoid the case where a k-mer/contig has same prefix and suffix, and thus merged as a circle);
  • The k-mers/contigs bound on each side have compatible count profile across samples (new in KaMRaT).

The first two conditions are checked by IsMergeable function in the MergeKnot class. The third condition is estimated and checked by one of the three functions:

  • CalcPearsonDist: estimate count profile distance by 0.5 x (1 - corr.pearson);
  • CalcSpearmanDist: estimate count profile distance by 0.5 x (1 - corr.spearman);
  • CalcMACDist: estimate count profile distance by mean absolute contrast.

At each overlap knot with unique predecessor and unique successor, one k-mer/contig is considered as 'representative' while the other is considered 'subordinate'. This is defined either by a special "rep_value" column, or by the input order.

We always keep the representative k-mer/contig as the base, and prevent it from the reverse-complement transformation:

  • if the representative contig is a predecessor:
    • if it is not needed to be RC-transformed, then straightforwardly do RightExtend function (with subordinate contig RC-transformed if needed);
    • if it is needed to be RC-transformed, then don't do the RC-transformation on this representative contig, but do LeftExtend function with the subordinate contig:
      • RC-transformed if originally not needed;
      • non-RC-transformed if originally needed;
  • if the representative contig is a successor:
    • if it is not needed to be RC-transformed, then straightforwardly do LeftExtend function (with subordinate contig RC-transformed if needed);
    • if it is needed to be RC-transformed, then don't do the RC-transformation on this representative contig, but do RightExtend function with the subordinate contig:
      • RC-transformed if originally not needed;
      • non-RC-transformed if originally needed.

Clone this wiki locally