-
Notifications
You must be signed in to change notification settings - Fork 2
Implementation Details
The source code for highest-level entry of KaMRaT locates at apps/kamrat.cpp.
Other source codes are findable within src/ folder:
src/
├── cmds
│ ├── kamratFilter.cpp # FilterMain function <= |- Entry of filter module
│ ├── kamratIndex.cpp # IndexMain function <= |- Entry of index module
│ ├── kamratMask.cpp # MaskMain function <= |- Entry of mask module
│ ├── kamratMerge.cpp # MergeMain function <= |- Entry of merge module
│ ├── kamratQuery.cpp # QueryMain function <= |- Entry of query module
│ ├── kamratScore.cpp # ScoreMain function <= |- Entry of score module
├── include
│ ├── kamratFilter.hpp
│ ├── kamratIndex.hpp
│ ├── kamratMask.hpp
│ ├── kamratMerge.hpp
│ ├── kamratQuery.hpp
│ ├── kamratScore.hpp
├── data_struct
│ ├── contig_elem.cpp # ContigElem object implementation, used in merge module
│ ├── contig_elem.hpp # ContigElem object declaration, used in merge module
│ ├── feature_elem.cpp # FeatureElem object implementation, used in score module
│ ├── feature_elem.hpp # FeatureElem object declaration, used in score module
│ ├── merge_knot.cpp # MergeKnot object implementation, used in merge module
│ ├── merge_knot.hpp # MergeKnot object declaration, used in merge module
│ ├── scorer.cpp # Scorer object implementation, used in score module
│ └── scorer.hpp # Scorer object declaration, used in score module
├── runinfo_files
│ ├── filter_runinfo.hpp # filter helper and argument parsing functions
│ ├── index_runinfo.hpp # index helper and argument parsing functions
│ ├── mask_runinfo.hpp # mask helper and argument parsing functions
│ ├── merge_runinfo.hpp # merge helper and argument parsing functions
│ ├── query_runinfo.hpp # query helper and argument parsing functions
│ └── score_runinfo.hpp # score helper and argument parsing functions
└── utils
├── FeatureStreamer.cpp # FeatureStreamer class
├── FeatureStreamer.hpp
├── IndexRandomAccess.cpp # IndexRandomAccess class
├── IndexRandomAccess.hpp
├── index_loading.cpp # functions for load/write indexes and count parsing
├── index_loading.hpp
├── seq_coding.cpp # functions for dealing with k-mer uint_64 code
├── seq_coding.hpp
├── vect_opera.cpp # functions for dealing with vector operations (Pearson, Spearman, and MAC distances)
└── vect_opera.hpp
KaMRaT merge inherits the idea in DE-kupl mergeTags, with a novel intervention step to improve extension correctness. The procedure is:
- Scan the k-mers/contigs waiting to be merged and index both ends for each with the given prefix/suffix length. This constructs a series of "knots" that binds k-mers/contigs with matched prefix/suffix overlap together (
MakeOverlapKnotsfunction); - Scan the overlap knots, and merge the binded k-mers/contigs (
DoExtensionfunction) under certain conditions (see below); - Iteratively do the previous two steps till no further extension can be made.
These steps simplify O(n^2) k-mer/contig comparisons to several times of O(n) scanning.
The "overlap knot" is defined by an uint64 code, computed from a character string of length i_ovlp < k. During the merging process, each knot may attach 0 or 1 or multiple k-mers/contigs on either side. The k-mers/contigs attached on the left are the "predecessors" of the knot, while those attached on the right are the "successors". The predecessors and successors of the same knot are candidates of the final merging.
To establish the overlap knot, each k-mer/contig waiting to be processed is scanned with the function MakeOverlapKnots. This function indexes the leading and ending i_ovlp (i_ovlp < k) nucleotides (denoted as "prefix" and "suffix") by converting their character strings to uint64 codes (computed by the utility function Seq2Int).
Each k-mer/contig is attached to one or two overlap knots:
- the k-mer/contig is attached to one knot if its prefix and suffix are same;
- the k-mer/contig is attached to two different knots if its prefix and suffix are different (one for prefix and one for suffix).
The k-mer/contig may be attached as the predecessor (left k-mer/contig) or successor (right k-mer/contig) of the overlap knot, depending on whether prefix' or suffix' need to be reversed in the unstranded dataset:
- if the dataset is stranded, the prefix' and suffix' uint64 code are computed directly by the sequence per se (
Seq2Intfunction), and the k-mer/contig is attached as a successor to the knot of its prefix and as a predecessor to the knot of its suffix; - if the dataset is unstranded, the prefix' and suffix' uint64 code are computed as the minimum code between the character string per se and its reverse-complement counterpart (
Seq2Intfunction):- prefix:
- if the prefix has not been reverse-complement transformed, the k-mer/contig is attached as the successor to the knot;
- if the prefix has been reverse-complement transformed, the k-mer/contig is attached as the predecessor to the knot and is marked that it should be RC-transformed before extension;
- suffix:
- if the suffix has not been reverse-complement transformed, the k-mer/contig is attached as the predecessor to the knot;
- if the suffix has been reverse-complement transformed, the k-mer/contig is attached as the successor to the knot and is marked that it should be RC-transformed before extension.
- prefix:
When attaching the k-mer/contig to the knot:
- if the k-mer/contig has different prefix and suffix, it is straightforwardly attached to the two knots by its prefix and suffix;
- if the k-mer/contig has same prefix and suffix, it is attached according to the knot's status:
- if the knot has no predecessor yet (no matter it has successor or not), attach the k-mer/contig as predecessor;
- if the knot has predecessor already, but no successor yet, attach the k-mer/contig as successor;
- otherwise, attach the k-mer/contig according to its prefix, but this is not important since the knot will be marked as
has_ambiguity.
The merging is done by scanning the established overlap knot list (Function DoExtension). The k-mers/contigs attached to a overlap knot can be merged if:
- There has and only has one k-mer/contig on either side of the knot;
- The k-mers/contigs bound on each side are not a same one (to avoid the case where a k-mer/contig has same prefix and suffix, and thus merged as a circle);
- The k-mers/contigs bound on each side have compatible count profile across samples (new in
KaMRaT).
The first two conditions are checked by IsMergeable function in the MergeKnot class. The third condition is estimated and checked by one of the three functions:
-
CalcPearsonDist: estimate count profile distance by 0.5 x (1 - corr.pearson); -
CalcSpearmanDist: estimate count profile distance by 0.5 x (1 - corr.spearman); -
CalcMACDist: estimate count profile distance by mean absolute contrast.
At each overlap knot with unique predecessor and unique successor, one k-mer/contig is considered as 'representative' while the other is considered 'subordinate'. This is defined either by a special "rep_value" column, or by the input order.
We always keep the representative k-mer/contig as the base, and prevent it from the reverse-complement transformation:
- if the representative contig is a predecessor:
- if it is not needed to be RC-transformed, then straightforwardly do
RightExtendfunction (with subordinate contig RC-transformed if needed); - if it is needed to be RC-transformed, then don't do the RC-transformation on this representative contig, but do
LeftExtendfunction with the subordinate contig:- RC-transformed if originally not needed;
- non-RC-transformed if originally needed;
- if it is not needed to be RC-transformed, then straightforwardly do
- if the representative contig is a successor:
- if it is not needed to be RC-transformed, then straightforwardly do
LeftExtendfunction (with subordinate contig RC-transformed if needed); - if it is needed to be RC-transformed, then don't do the RC-transformation on this representative contig, but do
RightExtendfunction with the subordinate contig:- RC-transformed if originally not needed;
- non-RC-transformed if originally needed.
- if it is not needed to be RC-transformed, then straightforwardly do