RANN::nn2 is too inefficient

The main `dobin` procedure actually turns out to be unusable for me because it's simply too slow. My current data aren't huge, 100,000s of rows, around 100 columns, but clearly way too big for `dobin`. The main "bottleneck" is the calculation of `RANN::nn2`, which has to be re-calculated on every iteratively reduced matrix:

https://github.com/sevvandi/dobin/blob/04453c1249cdaf0952b2b1823240957f436b5790/R/y_space.R#L11

The following code illustrates just one of several available alternatives that is a lot more efficient:
``` r
nrow <- 10000
ncol <- 50
x <- array (runif (nrow * ncol), dim = c (nrow, ncol))

n <- floor (10 ^ (8:16 / 4))
k <- 20
res <- vapply (n, function (i) {
    xtest <- x [seq (i), ]
    bench::mark (
        nn_obj <- RANN::nn2 (xtest, xtest, k = k),
        nn_obj <- dbscan::kNN (xtest, k = k),
        check = FALSE,
        time_unit = "s")$median },
               numeric (2))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
res <- data.frame (n = n,
                   RANN = res [1, ],
                   dbscan = res [2, ])
res <- tidyr::gather (res,
                      key = "method",
                      value = "duration",
                      RANN, dbscan)
library (ggplot2)
ggplot (res, aes (x = n, y = duration, colour = method)) +
    geom_line () +
    geom_point ()
```

![](https://i.imgur.com/pR7X7ho.png)

<sup>Created on 2021-09-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0.9000)</sup>

`dbscan:kNN` is at least twice as fast as `RANN::nn2`, and scales much better.

---

That is nevertheless unlikely to make `dobin` useable at scale. I suspect it may be necessary to reconsider the brute-force `knn` calls, and hand-code some sort of transformation of former neighbour relationships into your new `B`-basis. Updated neighbour relationships change very little, especially in the early (high-dimensional) stages, so there's a lot of unnecessary processing going on recalculating those from scratch each time. Happy to discuss approaches if and when things get that far, but at least dropping `RANN` will help us along the way. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RANN::nn2 is too inefficient #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RANN::nn2 is too inefficient #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions