dot calling state of the art

This issue is going to briefly touch upon the state of the dot-calling in `cooltools`.

### Intro
We were trying to re-implement HiCCUPS dot-calling algorithm under the `cooltools` umbrella for some time now. It is still under active development and right now code is scattered across forks and branches.


### master
The initial progress that we made with dot-calling, by implementing convolution based calculation of locally adjusted expected (`donut`, `lowleft`, `vertical`, `horizontal`) is reflected in this repository in the `master` branch. The post processing steps in this `master` are closest to the so-called BH-FDR version of dot-calling in the original HiCCUPS paper (Rao etal 2014) - in a sense that we do not do the lambda-chunking to perform multiple hypothesis testing. Moreover this implementation simply ends with the dump of the pre-calculated adjusted expected for different kernels: `donut`, `lowleft`, `vertical`, `horizontal`, and reports that post-processing in a bad shape. Thus this isn't very usable for now, not for the final dot-calling at least.

### new-dotfinder
Lambda-chunking procedure is implemented in the `dekkerlab` fork of the `cooltools` branch `new-dotfinder`, which is `pip`-installable:
```
pip install git+https://github.com/dekkerlab/cooltools@new-dotfinder
pip install -e git+https://github.com/dekkerlab/cooltools@new-dotfinder#egg=cooltools
```
The second command would allow to modify the source code, whereas the 1st one would simply install it. One would want to modify the source code if the enrichment threshold modification is needed, for instance - as those are not implemented as CLI options just yet. A typical run would be:
```
cooltools call_dots -n {cores} 
    -o {signif_dots} -v --fdr 0.1 
    --max-nans-tolerated 7 
    --max-loci-separation 20000000 
    --dots-clustering-radius 21000 
    --tile-size 10000000 
    {input_cooler} {input_expected}
```
Which would produce list of dots that passed multiple hypothesis testing (the lambda-chunking step itself) but haven't been post-processed, i.e. clustered or filtered by enrichment. The post processed list of dots would show up in the same folder as `signif_dots` but with the prefix `final_` - we'll fix this ugliness later on of course. `call_dots` CLI determines resolution and pick correct kernels parameters `w`, `p` accordingly (all of the defaults kept same as HiCCUPS).
This dot-caller albeit very close to HiCCUPS implementation, deviates from it in some regards, some of the most importnat aspects:
 1. fixed kernels size for every pixel (in HiCCUPS "donuts" are shrinked near the diagonal and enlarged as needed based on the value of `lowleft`)
 2. clustering is slightly different - we use off the shelve `Birch`, HiCCUPS implements special greedy algorithm for that - results are very close anyways.
 3. really minor detail in the way we treat pixels near the bad-rows/columns (ones filled with `NaNs` after balancing): HiCCUPS disregard pixels that are within `~5` pixels away from bad-rows/cols, instead we simply check number of `NaNs` in a kernel footprint `--max-nans-tolerated` - given the resolution/size of the kernels one can realize which pixels would be discarded.

... I might add more details here later on, and edit/elaborate more on this here ...

### shrink-donut-dotfinder
This branch elaborates further on top of the `new-dotfinder` by dealing with the disrepancy `#1` - dynamic kernels resizing. As it follows from the name here we implemented only the near-diagonal kernels shrinakge - arguably the most important aspect of the dynamic kernels, which was preventing us from calling dots really close to the diagonal and driving the deviation between `cooltools` dot-calling and HiCCUPS. There are no additional parameters that user needs to control in this case, everything is done the same way as in HiCCUPS - and this is data-independent kernel shrinkage, as opposed to the enlarging kernels based on the value of the `lowleft` kernels for each pixel tested.

This branch eliminates another difference between HiCCUPS and `cooltools` which is related to the way lambda-chunking is implemented and is too technical to describe here. To give some numbers, for Rao et al 2014 GM... primary dataset we are getting ~7700 dots vs ~8050 by HiCCUPS, where overlap is ~7600.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dot calling state of the art #39

Intro

master

new-dotfinder

shrink-donut-dotfinder

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dot calling state of the art #39

Description

Intro

master

new-dotfinder

shrink-donut-dotfinder

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions