-
Notifications
You must be signed in to change notification settings - Fork 65
Implementation of local outlier factor #1825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Hakdag97
wants to merge
282
commits into
main
Choose a base branch
from
features/1758-Implementation_of_local_outlier_factor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
282 commits
Select commit
Hold shift + click to select a range
4da69fd
merge branch release/1.2.x
ClaudiaComito 27ea911
Update ubuntu
ClaudiaComito d0fb6c8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0e704d4
switch back to ubuntu 20.04
ClaudiaComito f5d7850
pull
ClaudiaComito acfe9bd
Upgrade CI to ubuntu 22.04 and cuda 11.7.1
ClaudiaComito 0fd3d87
avoid unnecessary gathering of test DNDarrays
ClaudiaComito 3c4c07c
early out for resplit of non-distributed DNDarrays
ClaudiaComito 989e0f4
match split of comparison array to expected output
ClaudiaComito 6d66fad
avoid MPI calls in non-distributed cases
ClaudiaComito a37b4d3
avoid MPI calls in non-distributed resplit
ClaudiaComito 8eebe10
set default to None
ClaudiaComito 22c5c68
remove print statement
ClaudiaComito c692bff
upgrade torch version
ClaudiaComito df6a4e5
copy to cpu before comparing
ClaudiaComito af0e721
use ht.allclose instead of np.allclose
ClaudiaComito bac6d4e
cast different dtype operands to promoted dtype within torch call
ClaudiaComito c0c6362
compare local tensors to corresponding slice of expected_array only
ClaudiaComito 587bc05
expand tests
ClaudiaComito 24239a1
remove redundant code
ClaudiaComito cd65b37
Implement slicing with negative step
ClaudiaComito 86e8801
test slicing with negative step
ClaudiaComito 6779010
merge branch bugs/#1057-Allgatherv-contiguity-mismatch
ClaudiaComito 3b1f46d
Fix single-element indexing within mixed-type key
ClaudiaComito 1a4bf97
Non-ordered indexing, split != 0
ClaudiaComito 9e42156
generalize negative step slicing to all splits, loss of dims
ClaudiaComito 1a310a9
loop over active ranks only when key in descending order
ClaudiaComito c2ba0d9
replace list-on-list mapping with argsort mapping for non-ordered key
ClaudiaComito f6bb5c3
replace list-on-list mapping with argsort mapping for boolean indexing
ClaudiaComito cad9975
fix advanced indexing via list, remove last key-mapping bottleneck fo…
ClaudiaComito 83e6950
fix local slices, expand tests
ClaudiaComito 28ab925
fix and test dimensional indexing
ClaudiaComito bc226fc
Fix same-dim advanced indexing, expand tests
ClaudiaComito c48c66e
[skip ci] implement single-element indexing along split axis w/ Itera…
ClaudiaComito 18329a1
[skip ci] generalize advanced indexing incl. distributed DNDarray key
ClaudiaComito f024ebb
[skip ci] Expand tests combined advanced / basic indexing
ClaudiaComito 6ae2788
[skip ci] fix advanced dimensional indexing on non-distributed array
ClaudiaComito 178d7f8
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 09e586c
fix distr advanced indexing with broadcasted shape
ClaudiaComito c56ebf4
transpose without copying
ClaudiaComito 86f704a
[skip ci] document __process_key(), clean up code
ClaudiaComito 68ead71
[skip ci] docs edits
ClaudiaComito 252995c
fix Ellipsis dimensions
ClaudiaComito c2a7e20
fix shape and split bookkeeping within advanced indexing
ClaudiaComito 235a7b8
test adv indexing on non consecutive dims
ClaudiaComito 4e936e8
abstract scalar key checks for both getitem and setitem
ClaudiaComito 8a74cd9
setitem scalar key
ClaudiaComito 8cf3ff1
DRAFT - abstraction common utilities for getitem and setitem
ClaudiaComito b45578a
handle all single-element indexing along split axis in same block
ClaudiaComito cec4bb9
resolve send/recv dimensions mismatch in a few edge cases
ClaudiaComito cc49a49
transpose self back to original shape after indexing
ClaudiaComito fe26ae8
add setitem tests
ClaudiaComito 611b46d
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito affdb60
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 7e5be66
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 6d2e369
do not index input unnecessarily for sanitation
ClaudiaComito f528356
test named split dimension for torch_proxy
ClaudiaComito 01a1140
value broadcasting abstraction
ClaudiaComito f8264a9
introduce distr sanitation for value when key is ordered
ClaudiaComito b1cd02f
keep track of original key
ClaudiaComito 31bdb34
fix value broadcasting for advanced setitem
ClaudiaComito c4d6749
match broadcasting to numpy
ClaudiaComito 5782d6e
finalize broadcast_value and fix test
ClaudiaComito 2174e84
assignment to negative slice along split axis
ClaudiaComito 782bde2
getitem: index underlying tensor with processed key in non-distr case
ClaudiaComito 084371d
setitem: test neg step slice along non-zero split axis
ClaudiaComito b1aa7aa
allow for nominal value/self split mismatch
ClaudiaComito 1c2b71e
expand test negative step along split axis
ClaudiaComito 7201a89
allow value.ndim > indexed_dims if extra dims are singletons
ClaudiaComito dfc7266
BROKEN: expand negative step tests
ClaudiaComito 8bbe242
squeeze out singleton dimensions when broadcasting value
ClaudiaComito 00a17e6
fix negative step slicing on 1 process
ClaudiaComito bdd2dd8
setitem w. dimensional indexing, add tests
ClaudiaComito 1fbd4d6
setitem w. advanced indexing on first dim
ClaudiaComito 95d3c92
setitem: test boolean indexing, local and split=0
ClaudiaComito f335aa8
fix output shape for boolean indexing w. split>0
ClaudiaComito d520ddf
setitem with non-ordered, mask-like key and non-distr value
ClaudiaComito d754a9c
allow for partial boolean indexing on first key.ndim dims of array
ClaudiaComito 5e69fe6
remove unnecessary check
ClaudiaComito 8d9849e
add tests for partial boolean indexing
ClaudiaComito 66ae371
set w. single-tensor key and non-distr value
ClaudiaComito 980e8f0
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito ae4d423
non-ordered, non-mask-like key and local value
ClaudiaComito b695e5a
broken: set up comm map for full distributed setitem
ClaudiaComito e6c1e10
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d42f1cb
implement setitem w. distributed non-ordered key
ClaudiaComito 7868fa0
[skip ci] broken: add tests for distr value non-ordered key
ClaudiaComito f8055ff
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
ClaudiaComito 2944903
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1bffd26
__process_key(): refactor adv indexing tensor extraction
ClaudiaComito 83842ec
working: setitem w. mask-like adv indexing, non-ordered split key
ClaudiaComito 366aaf9
adapt tests
ClaudiaComito bbe0a7b
refactor __process_key(): address boolean ind within adv ind
ClaudiaComito 24c6cd1
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
ClaudiaComito 1c47b42
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4ee9b96
getitem: address mask-like key
ClaudiaComito 3c88f8b
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
ClaudiaComito 54db23d
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 15cce44
define nonzero_size in non-distr case
ClaudiaComito 09fb199
handle split_bookkeeping when key is mask-like
ClaudiaComito 9c8d051
fix key type mismatch in advanced indexing
ClaudiaComito 41fba0a
getitem: address n-D key along split axis, free memory
ClaudiaComito e4a90de
balance indexed array before eq()
ClaudiaComito c8967e7
remove print statements
ClaudiaComito 2d443d8
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 95eaaeb
test adv ind on non-consecutive dims
ClaudiaComito 835a13f
remove print statement
ClaudiaComito 216a1a0
setitem: mixed indexing w. shape broadcasting
ClaudiaComito b62bad2
expand tests for mixed indexing w. broadcasting
ClaudiaComito 7ea2abe
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 435ff0c
reinstate tests for specific bugs
ClaudiaComito 30efe59
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito e6679a0
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito ad96822
prep send_buffer - expand value dimension if necessary
ClaudiaComito c9d44ae
fix send_indices dims when key is not mask-like
ClaudiaComito cc70400
test split mismatch on comm.size > 1
ClaudiaComito b78de30
broadcasting assignment along split axis
ClaudiaComito a8f2d57
expand tests
ClaudiaComito 62b2142
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito ae10038
Created a test file mytest.py
Hakdag97 3b62d4f
Implementation of parallel initialization
0889330
Refined comments for better readability
f826b7e
Merge branch 'main' of github.com:helmholtz-analytics/heat into featu…
1a89328
Created skeleton for lof.
Hakdag97 672d32e
Added file for quick-testing parts of the implementation.
Hakdag97 6043489
Created a first draft of the distance matrix with reduced memory cons…
Hakdag97 a45bf7c
Added index tracking to cdist_small function. Validation still
Hakdag97 6bbbdba
Validated results of reduced distance matrix (cdist_small)
Hakdag97 d895adb
Implemented fit routine for lof
Hakdag97 7334f36
Test skeleton for reachability distance
Hakdag97 7bc14d3
Building communication for reachability distance v.0
Hakdag97 1d901b9
Built communication for reachability distance
Hakdag97 6d76b0c
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 1686b0a
Validated results
Hakdag97 8dfe79f
Added unit tests.
Hakdag97 ef6385f
Refined exceptions
Hakdag97 57d9dbb
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 86e89ee
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
mrfh92 58823f6
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 4a5ef0f
edits
ClaudiaComito 7388eb7
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
mrfh92 243084e
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 4525d54
Started implementation of fully distributed version
Hakdag97 51150af
Merge branch 'features/1758-Implementation_of_local_outlier_factor' o…
Hakdag97 d2c0fc4
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 941c28e
get rid of torch.tensor warning
ClaudiaComito fbb3fe5
fix dimension loss
ClaudiaComito 0a120d7
add edge case for boolean mask
ClaudiaComito 994c997
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
mrfh92 9657746
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito c4e9421
.
Hakdag97 0b7ca83
Merge branch 'features/1758-Implementation_of_local_outlier_factor' o…
Hakdag97 98dfc18
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 b0bfa08
do not index scalar value
ClaudiaComito dfb0667
debugging
ClaudiaComito 2e8001a
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 6b588df
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758…
Hakdag97 1d95084
.
Hakdag97 606b837
Adjustments according to most recent changes in available advanced in…
Hakdag97 ae2a5e8
Corrected Deadlock problem with large data sets
Hakdag97 39ab011
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito 5b64ff0
Added test cases for cdist_small
Hakdag97 cd6838e
Added option for chunk-wise computation to reduce memory consumption
Hakdag97 b64b63b
Bug fixes
Hakdag97 93894fc
adapted communication pattern in cdist_small
Hakdag97 ecb6feb
Added non-blocking sending and receiving in cdist_small
Hakdag97 5815e98
Bug fix in _chunk_wise_topk
Hakdag97 6f1ec62
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
mrfh92 03a7981
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
mrfh92 b8de0c6
Added parameter to speed-up computation using pytorch's advanced inde…
Hakdag97 8301a81
Merge branch 'features/1758-Implementation_of_local_outlier_factor' o…
Hakdag97 dc70e29
.
Hakdag97 332d49d
.
Hakdag97 cbb6e97
Added test case
Hakdag97 3c29366
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c9f0514
Made list of nearest neighbors accesible as a class attribute
Hakdag97 6ff8a04
Merge branch 'features/1758-Implementation_of_local_outlier_factor' o…
Hakdag97 cb03cb7
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
Hakdag97 6d848d6
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
ClaudiaComito cae4670
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
ClaudiaComito 9d74da2
I already hate talisman after 1 day
ClaudiaComito b204589
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 5a5ae6e
Fixed __setitem__ bug for unordered split_key
Hakdag97 0aa3ee0
Fixed bugs causing errors in test_getitem_boolean_fewer_dims
Hakdag97 36855d7
Bug fixes for test_setitem_edge_cases
Hakdag97 151d2b2
Further bug fixes
Hakdag97 960c5dd
All tests are running
Hakdag97 d67c5a9
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 3e0e6a6
Edge case handling for test_indexing intermediate results
Hakdag97 25e1b34
Fixed test_indexing.py
Hakdag97 95c72a2
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 0f09e7e
Bug fixes in function where()
Hakdag97 9956639
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
Hakdag97 aadcf35
Edge case handling for slice type keys in __getitem__
Hakdag97 b0147ce
Debugging tests for clustering - intermediate results
Hakdag97 f2e168c
Fixed edge case in indexing causing deadlock in kmedoids clustering
Hakdag97 638d1f8
Delete bug prints
Hakdag97 466c1f0
Edge case handling for keys like [:, -1], in order to fix test_basics
Hakdag97 047488c
Bug fixes for test_factories.py
Hakdag97 376cbb1
Fixed bug in test_cov (wrong balance)
Hakdag97 50f0ad1
Fixed bug in test_manipulations.py (function tile)
Hakdag97 c2ce57e
Drop tensor names in function tile
Hakdag97 595f84a
Handle edge case for test_svd and test_eigh
Hakdag97 28e46a1
Fix test_knn.py
Hakdag97 3eac84d
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
Hakdag97 2c34b36
Added edge case neccessary for local outlier factor
Hakdag97 d790865
.
Hakdag97 b9f132e
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 ba2a87a
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 8a373c9
Fixed device mismatch in process_key
Hakdag97 3c9ea98
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
Hakdag97 bc6616b
Refine test_dndarray
Hakdag97 43f73c3
Handling of duplicate advanced indices
Hakdag97 1fc4f1e
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 2240117
.
Hakdag97 e6256d5
Merge branch '914_adv-indexing-outshape-outsplit' of github.com:helmh…
Hakdag97 c6332a4
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 550f792
Avoid float64 tests in test_basic for mps
Hakdag97 6da8259
Improved code coverage in tests
Hakdag97 bde7e69
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 3c9b989
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
mrfh92 1673b00
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 0636e03
Robustified tests for lof
Hakdag97 e9e220c
Raised tolerances for LOF tests
Hakdag97 343fcee
Debug Ci fails with reduced tolerance in tests for lof and cdist_small
Hakdag97 6790b6a
Debug prints for CI
Hakdag97 96092b7
Remove bug prints
Hakdag97 a68a80a
.
Hakdag97 dd3762b
Measure against missing CUDA awareness of MPI
Hakdag97 d0a1324
Fixed typo
Hakdag97 ca5bb04
Bug fix
Hakdag97 9961791
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
Hakdag97 162e85a
Added comment
Hakdag97 33b2c32
Merge branch 'features/1758-Implementation_of_local_outlier_factor' o…
Hakdag97 879aa2a
test
Hakdag97 850e5a9
.
Hakdag97 cba9f58
Test more memory efficent implementation of cdist_small
Hakdag97 39c20d1
Refined comments
Hakdag97 6cfa7db
Adjusted Documentation and test according to review
Hakdag97 2bfdbc5
Test debugging advanced indexing for dmd
Hakdag97 0149260
Merge branch 'main' into 914_adv-indexing-outshape-outsplit
Hakdag97 17446a2
Fixed bug in process_key leading to failing dmd test
Hakdag97 9aa581e
Robustified edge cases in __process_key
Hakdag97 0ad0418
Merge branch '914_adv-indexing-outshape-outsplit' into features/1758-…
Hakdag97 fb07f9c
Consistent tie-break behaviour for arbitrary arbitrary number of MPI …
Hakdag97 e08dfaf
Extended tests in test_lof.py and test_distance.py
Hakdag97 2efe6df
Increase test coverage of test_lof.py
Hakdag97 f6a9447
Merge branch 'main' into features/1758-Implementation_of_local_outlie…
Hakdag97 bdcc09a
Refined test
Hakdag97 457b175
Refined test.lof
Hakdag97 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| fileignoreconfig: | ||
| - filename: heat/core/dndarray.py | ||
| checksum: 6f686fc92dc83c619144cfcde577b8f195213d3c02e9ba63b26760dd799e144d | ||
| version: "1.0" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,4 @@ | ||
| """Provides classification algorithms.""" | ||
|
|
||
| from .kneighborsclassifier import * | ||
| from .localoutlierfactor import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,304 @@ | ||
| """Implementation of the Local Outlier Factor (LOF) algorithm""" | ||
|
|
||
| import heat as ht | ||
| import torch | ||
| import warnings | ||
| from heat.core import types | ||
| from mpi4py import MPI | ||
| from heat.core.dndarray import DNDarray | ||
| from heat.spatial.distance import cdist, cdist_small, _euclidian, _manhattan, _gaussian | ||
|
|
||
| __all__ = ["LocalOutlierFactor"] | ||
|
|
||
|
|
||
| class LocalOutlierFactor: | ||
| """ | ||
| Class for the Local Outlier Factor (LOF) algorithm. The LOF algorithm is a density-based outlier detection method. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| n_neighbors : int, optional (default=20) | ||
| Number of neighbors used to calculate the density of points in the lof algorithm. Denoted as MinPts in [1]. | ||
| metric : str, optional (default=_euclidian) | ||
| The distance metric to use for the tree. | ||
| binary_decision : string, optional | ||
| Defines which classification method should be used: | ||
| - "threshold": everything greater or equal to the specified threshold is considered an outlier. | ||
| - "top_n": the data points with the ``top_n`` largest outlier scores are considered outliers. | ||
| Default is "threshold". | ||
| threshold : float, optional | ||
| The threshold value for the "threshold" method. Default is 1.5. | ||
| top_n : int, optional | ||
| The number of top outliers for the "top_n" method. Default is 10. | ||
|
|
||
| Attributes | ||
| ---------- | ||
| n_neighbors : int | ||
| Number of neighbors used to calculate the density of points in the lof algorithm. Denoted as MinPts in [1]. | ||
| binary_decision: string | ||
| Method that converts lof score into a binary decision of outlier and non-outlier. Can be "threshold" or "top_n". | ||
| metric : str | ||
| The measure of the distance. Can be "euclidian", "manhattan", or "gaussian". | ||
| threshold : float | ||
| The threshold value for the "threshold" method used for binary classification. | ||
| top_n : int | ||
| The number of top outliers for the "top_n" method used for binary classification. | ||
| lof_scores : DNDarray | ||
| The local outlier factor for each sample in the data set. | ||
| anomaly : DNDarray | ||
| Array with binary outlier classification (1 -> outlier, -1 -> inlier). | ||
| chunks : int | ||
| Compute the distance matrix iteratively in chunks to reduce memory consumption (but with larger runtime). | ||
| For ``chunks``= 2: first compute one half of the distance matrix and then the second half. | ||
| fully_distributed : bool | ||
| Decides whether to distribute auxiliary vectors during the computation among all MPI processes. | ||
| Only set to True for a very large number of data points that may already cause memory issues on their own. | ||
| True is more memory efficient, but much slower than False due to large communication overhead. | ||
| idx_n_neighbors : DNDarray | ||
| Indices of nearest neighbors for each sample in the data set. | ||
|
|
||
| References | ||
| ---------- | ||
| [1] Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| n_neighbors=20, | ||
| metric="euclidian", | ||
| binary_decision="threshold", | ||
| threshold=1.5, | ||
| chunks=1, | ||
| top_n=None, | ||
| fully_distributed=False, | ||
| ): | ||
| self.n_neighbors = n_neighbors | ||
| self.binary_decision = binary_decision | ||
| self.threshold = threshold | ||
| self.top_n = top_n | ||
| self.lof_scores = None | ||
| self.anomaly = None | ||
| self.metric = metric | ||
| self.chunks = chunks | ||
| self.fully_distributed = fully_distributed | ||
| self.idx_n_neighbors = None | ||
|
|
||
| self._input_sanitation() | ||
|
|
||
| def fit(self, X: DNDarray): | ||
| """ | ||
| Fit the LOF model to the data. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| X : DNDarray | ||
| Data points. | ||
| """ | ||
| # Compute the LOF for each sample in X | ||
| self._local_outlier_factor(X) | ||
| # Classifying the data points as outliers or inliers | ||
| self._binary_classifier() | ||
|
|
||
| def _local_outlier_factor(self, X: DNDarray): | ||
| """ | ||
| Compute the LOF for each sample in X. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| X : DNDarray | ||
| Data points. | ||
| """ | ||
| # number of data points | ||
| length = X.shape[0] | ||
|
|
||
| # input sanitation | ||
| # If n_neighbors is larger than or equal the number of samples, continue with the whole sample when evaluating the LOF | ||
| if self.n_neighbors >= length: | ||
| self.n_neighbors = length - 1 # length of data is n_neighbors + the point itself | ||
| # [1] suggests a minimum of 10 neighbors | ||
| if length <= 10: | ||
| raise ValueError( | ||
| f"The data set is too small for a reasonable LOF evaluation. The number of samples should be larger than 10, but was {X.shape[0]}." | ||
| ) | ||
|
|
||
| # Compute the distance matrix for the n_neighbors nearest neighbors of each point and the corresponding indices | ||
| # (only these are needed for the LOF computation). | ||
| size = X.comm.Get_size() | ||
|
|
||
| # If the amount of chosen neighbors is larger than the number of samples per process, one can use the classic cdist function | ||
| if self.n_neighbors + 1 > length // size: | ||
| dist, idx = ht.topk( | ||
| cdist(X), k=self.n_neighbors + 1, sorted=True, largest=False | ||
| ) # cdist stores also the distance of each point to itself, therefore use n_neighbors+1 | ||
| else: | ||
| # Note that cdist_small sorts from the lowest to the highest distance | ||
| dist, idx = cdist_small( | ||
| X, X, metric=self.metric, n_smallest=self.n_neighbors + 1, chunks=self.chunks | ||
| ) # cdist_small stores also the distance of each point to itself, therefore use n_neighbors+1 | ||
|
|
||
| # Extract the k-distance and the indices of the k-nearest neighbors | ||
| k_dist = dist[:, -1] | ||
| idx_neighbors = idx[:, 1 : self.n_neighbors + 1] | ||
| # Make the indices of the n-nearest neighbors available for a use outside this function | ||
| self.idx_n_neighbors = idx_neighbors | ||
|
|
||
| k_dist_neighbors = self._advanced_indexing(k_dist, idx_neighbors) | ||
|
|
||
| # Compute the reachability distance for each point | ||
| reachability_dist = ht.maximum(k_dist_neighbors, dist[:, 1 : self.n_neighbors + 1]) | ||
|
|
||
| # Compute the local reachability density (lrd) for each point | ||
| lrd = 1 / ( | ||
| ht.mean(reachability_dist, axis=1) + 1e-10 | ||
| ) # add 1e-10 to avoid division by zero (important for many duplicates in data) | ||
|
|
||
| # Calculate the local reachability distance for each point's neighbors | ||
| lrd_neighbors = self._advanced_indexing(lrd, idx[:, 1 : self.n_neighbors + 1]) | ||
|
|
||
| lof = ht.mean(lrd_neighbors, axis=1) / lrd | ||
|
|
||
| self.lof_scores = lof | ||
|
|
||
| def _binary_classifier(self): | ||
| """ | ||
| Binary classification of the data points as outliers (1) or inliers (-1) based on their non-binary LOF. According to the method, | ||
| the data points are classified as outliers if their LOF is greater or equal to a specified threshold or if they have one | ||
| of the top_n largest LOF scores. | ||
| """ | ||
| if self.binary_decision == "top_n": | ||
| # Determine the threshold based on the top_n largest LOF scores | ||
| self.threshold = ht.topk(self.lof_scores, k=self.top_n, sorted=True, largest=True)[0][ | ||
| -1 | ||
| ] | ||
| # Classify anomalies based on the threshold value | ||
| self.anomaly = ht.where(self.lof_scores >= self.threshold, 1, -1) | ||
|
|
||
| def _advanced_indexing(self, A: DNDarray, idx: DNDarray) -> DNDarray: | ||
| """ | ||
| Perform advanced indexing on a distributed DNDarray, allowing for optional runtime optimization. | ||
|
|
||
| This function handles advanced indexing for distributed DNDarrays. It supports two modes: | ||
| 1. Fully distributed mode (`fully_distributed=True`): handles indexing in a completely distributed manner. | ||
| This mode is memory safe but rather slow. | ||
| 2. Local mode (`fully_distributed=False`):uses local arrays (torch tensors) to perform indexing | ||
| efficiently, assuming that local arrays of dimension (A.shape[0], `n_neighbors`) fit into memory. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| A : DNDarray | ||
| The input DNDarray to be indexed. | ||
| idx : DNDarray | ||
| The indices used for advanced indexing. | ||
|
|
||
| Returns | ||
| ------- | ||
| indexed_A : DNDarray | ||
| The result of advanced indexing on the input array. | ||
| """ | ||
| # Using heat's advanced indexing for large data set | ||
| if self.fully_distributed is True: | ||
| indexed_A = A[idx] | ||
| # Use local arrays, i.e., torch.tensors, to reduce runtime while indexing | ||
| # (only possible if all local arrays defined below fit into memory) | ||
| else: | ||
| split = A.split | ||
| type = A.dtype | ||
| # Use none-split arrays to reduce communication overhead | ||
| A_ = A.resplit_(None).larray.contiguous() | ||
| idx_ = idx.resplit_(None).larray.contiguous() | ||
| # Apply standard advanced indexing | ||
| indexed_A_ = A_[idx_] | ||
| # Convert the result back to a distributed DNDarray | ||
| indexed_A = ht.array(indexed_A_, split=split, dtype=type) | ||
| return indexed_A | ||
|
|
||
| def _map_idx_to_proc(self, idx, comm): | ||
| """ | ||
| Auxiliary function to map indices to the corresponding MPI process ranks. | ||
|
|
||
| This function takes an array of indices and determines which MPI process | ||
| each index belongs to, based on the distribution of data across processes. | ||
| It returns an array where each index is replaced by the rank of the process | ||
| that contains the corresponding data. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| idx : DNDarray | ||
| The array of indices to be mapped to MPI process ranks. The array should | ||
| be distributed along the first axis (split=0). | ||
| comm: MPI.COMM_WORLD | ||
| The MPI communicator. | ||
|
|
||
| Returns | ||
| ------- | ||
| mapped_idx : DNDarray | ||
| An array of the same shape as `idx`, where each index is replaced by the | ||
| rank of the MPI process that contains the corresponding data. | ||
| """ | ||
| size = comm.Get_size() | ||
| _, displ, _ = comm.counts_displs_shape(idx.shape, idx.split) | ||
| mapped_idx = ht.zeros_like(idx) | ||
| for rank in range(size): | ||
| lower_bound = displ[rank] | ||
| if rank == size - 1: # size-1 is the last rank | ||
| upper_bound = idx.shape[0] | ||
| else: | ||
| upper_bound = displ[rank + 1] | ||
| mask = (idx >= lower_bound) & (idx < upper_bound) | ||
| mapped_idx[mask] = rank | ||
| return mapped_idx | ||
|
|
||
| def _input_sanitation(self): | ||
| """ | ||
| Check if the input parameters are valid and raise warnings or exceptions. | ||
| """ | ||
| # check number of neighbors, [1] suggests n_neighbors >= 10 | ||
| if self.n_neighbors < 1: | ||
| raise ValueError(f"n_neighbors must be great one. but was {self.n_neighbors}.") | ||
| if self.n_neighbors < 10 and self.n_neighbors > 100: | ||
| warnings.warn( | ||
| f"For reasonable results n_neighbors is expected between 10 and 100, but was {self.n_neighbors}.", | ||
| UserWarning, | ||
| ) | ||
|
|
||
| # check for correctly binary decision method | ||
| if self.binary_decision not in ["threshold", "top_n"]: | ||
| raise ValueError( | ||
| f"Unknown method for binary decision: {self.binary_decision}. Use 'threshold' or 'top_n'." | ||
| ) | ||
|
|
||
| # check if the top_n parameter is specified when using the top_n method | ||
| if self.binary_decision == "top_n": | ||
| if self.threshold != 1.5: | ||
| warnings.warn( | ||
| "You are specifying the parameter threshold, although binary_decision is set to 'top_n'. The threshold will be ignored.", | ||
| UserWarning, | ||
| ) | ||
| if self.top_n is None: | ||
| raise ValueError( | ||
| "For binary decision='top_n', the parameter 'top_n' has to be specified." | ||
| ) | ||
| if self.top_n < 1: | ||
| raise ValueError("The number of top outliers should be greater than one.") | ||
|
|
||
| if self.binary_decision == "threshold": | ||
| if self.threshold <= 1 or self.threshold is None: | ||
| raise ValueError("The threshold should be greater than one.") | ||
| if self.top_n is not None: | ||
| warnings.warn( | ||
| "You are specifying the parameter top_n, although binary_decision is set to 'threshold'. The value of top_n will be ignored.", | ||
| UserWarning, | ||
| ) | ||
|
|
||
| # check for valid metric | ||
| valid_metrics = ["euclidian", "gaussian", "manhattan"] | ||
| if self.metric not in valid_metrics: | ||
| raise ValueError(f"Invalid metric '{self.metric}'. Must be one of {valid_metrics}.") | ||
|
|
||
| # replace the name of the metric with the corresponding function | ||
| if self.metric == "gaussian": | ||
| self.metric = _gaussian | ||
| elif self.metric == "manhattan": | ||
| self.metric = _manhattan | ||
| elif self.metric == "euclidian": | ||
| self.metric = _euclidian | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we dont use Returns as section
(also at the other functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted this in all functions