A reusable Python pipeline for turning downloaded embedding datasets into clean ANN benchmark artifacts from .npy embedding inputs.
Current outputs include:
- base vectors in
.fvecs - query vectors in
.fvecs - ground truth in
.ivecs
The pipeline is designed for large embedding datasets and supports a staged workflow with logs, a run summary, and cleanup of intermediate files.
This repository currently supports .npy input files only.
More input readers and formats may be added later, but the current code is focused on a working .npy-to-benchmark-artifacts pipeline.
Given one or more .npy embedding files, this project can:
- extract vectors into a single base
.fvecsfile - remove exact zero vectors
- normalize vectors when needed
- deduplicate vectors
- sample query vectors without replacement from the cleaned vector set
- generate exact ground-truth nearest neighbors for the final base/query split
- log progress, output stats, and errors at each stage
- clean up intermediate large
.fvecsfiles after successful downstream stages
The final outputs are named using a common dataset prefix and the actual final vector counts produced.
.
├── config.py
├── pipeline.py
├── readers.py
├── fvecs_writer.py
├── fvecs_remove_zeros.py
├── fvecs_normalize.py
├── fvecs_deduplicator.py
├── fvecs_split.py
├── ivecs_check.py
├── knn_utils.py
└── runs/
After editing config.py, run the pipeline from the repository root with:
python3 pipeline.pyOr with a specific Python interpreter:
/path/to/python pipeline.pyAt a minimum, you should set the following values in config.py before running the pipeline.
-
RUN_NAME
Name of the run directory underruns/ -
FILE_PREFIX
Common prefix used to name output artifacts
Example:
RUN_NAME = "wiki_mpnet_en_trial"
FILE_PREFIX = "wiki_mpnet_embeddings"-
SOURCE_TYPE
Currently should be set to"npy" -
INPUT_FILES
A list of.npyembedding files to process
Example:
SOURCE_TYPE = "npy"
INPUT_FILES = [DATASET_DIR / f"emb_{i:03d}.npy" for i in range(10)] Be sure to create a .env file in the project's root directory and define DATASET_ROOT.
-
NUM_BASE
Requested final number of base vectors -
NUM_QUERY
Requested final number of query vectors
The pipeline initially extracts at least:
NUM_BASE + NUM_QUERY
vectors from the input files.
Example:
NUM_BASE = 100000
NUM_QUERY = 10000-
GT_K
Number of nearest neighbors to compute -
GT_METRIC
"ip"or"l2" -
GT_SHUFFLE
Whether to letknn_utils.pyshuffle before ground truth generation -
GT_GPUS
"-1"for CPU, or values such as"0"or"0,1"for GPU execution
Example:
GT_K = 100
GT_METRIC = "ip"
GT_SHUFFLE = False
GT_GPUS = "-1"-
CLEANUP_INTERMEDIATE_FVECS
IfTrue, intermediate.fvecsfiles are deleted after successful downstream stages -
OVERWRITE
IfFalse, stages with existing outputs are skipped
Example:
CLEANUP_INTERMEDIATE_FVECS = True
OVERWRITE = Falsefrom pathlib import Path
import sys
RUN_NAME = "wiki_mpnet_en_trial"
FILE_PREFIX = "wiki_mpnet_embeddings"
CLEANUP_INTERMEDIATE_FVECS = True
OVERWRITE = False
RUN_DIR = Path("runs") / RUN_NAME
NUM_BASE = 100000
NUM_QUERY = 10000
GT_K = 100
GT_METRIC = "ip"
GT_SHUFFLE = False
GT_GPUS = "-1"
SOURCE_TYPE = "npy"
dataset_root = os.environ.get("DATASET_ROOT")
if not dataset_root:
raise RuntimeError(
"DATASET_ROOT is not set. "
"Example: export DATASET_ROOT=/path/to/your/datasets"
)
DATASET_ROOT = Path(dataset_root)
DATASET_NAME = "mpnet-43m" # Just an example. Put whatever you'd like
EMBED_SUBDIR = "data/en/embs"
DATASET_DIR = DATASET_ROOT / DATASET_NAME / EMBED_SUBDIR
INPUT_FILES = [DATASET_DIR / f"emb_{i:03d}.npy" for i in range(10)] # Match the file naming conventions from your downloadThe pipeline writes outputs into:
runs/<RUN_NAME>/
At the end of a successful run, the final artifacts are renamed using:
FILE_PREFIX- the actual final base count
- the actual final query count
- the ground truth metric and
k
Typical final outputs look like:
<prefix>_base_<actual_count>.fvecs<prefix>_query_<actual_count>.fvecs<prefix>_gt_<metric>_<k>.ivecs