ANN DreaMS benchmark

A benchmark for approximate nearest neighbor seacrh within large-scale DreaMS embeddings using matchms embedding similarity backends.

Setup

Environment prepeation

The environment requires a matchms installation and a couple of additional libraries pip install tqdm, pip install h5py.

Data download

# 1k query DreaMS embeddings
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/MassSpecGym_DreaMS_rand1k.npy

# 50k reference DreaMS embeddings
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand50k.npy
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand50k.benchmark.npy

# 500k reference DreaMS embeddings
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand500k.npy
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand500k.benchmark.npy

# 5M reference DreaMS embeddings
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand5M.npy
wget -P data https://huggingface.co/datasets/roman-bushuiev/ANN_DreaMS_benchmark/resolve/main/data/GeMS_A1_DreaMS_rand5M.benchmark.npy

Running a benchmark

To run a benchmark specify an ANN backend and a benchmarking dataset:

python3 benchmark.py --ann_backend pynndescent --dataset_name GeMS_A1_DreaMS_rand50k --index_kwargs '{"k": 100}'

Expected output:

Benchmark results:
index_backend: pynndescent
dataset_name: GeMS_A1_DreaMS_rand50k
Index construction memory [MB]: 1765.4844
Index construction time [s]: 29.3165
Recall @ 1 mean: 0.9680
Recall @ 1 std: 0.1760
Recall @ 10 mean: 0.9385
Recall @ 10 std: 0.0956
Query time mean [s]: 0.0099
Query time std [s]: 0.3067
index_kwargs: {'k': 100}

Implementing and benchmarking new `matchms` ANN backend

benchmark.py code executes BaseEmbeddingSimilarity(similarity="cosine", index_backend={ann_backend}) from matchms. So, to evaluate a new backend one needs to implement it within a BaseEmbeddingSimilarity class here, for example in a locally installed fork of the matchms repository.

TODO list:

[] Evaluate FAISS.
[] Evaluate annoy.
[] Evaluate Voyager.
Evaluate pynndescent (the only backend implemented in matchms so far).
[] Explore other packages.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
results		results
README.md		README.md
benchmark.py		benchmark.py
construct_benchmark.py		construct_benchmark.py
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANN DreaMS benchmark

Setup

Environment prepeation

Data download

Running a benchmark

Implementing and benchmarking new `matchms` ANN backend

TODO list:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ANN DreaMS benchmark

Setup

Environment prepeation

Data download

Running a benchmark

Implementing and benchmarking new matchms ANN backend

TODO list:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Implementing and benchmarking new `matchms` ANN backend

Packages