Vector Embedding Dataset Processing Pipeline

A reusable Python pipeline for turning downloaded embedding datasets into clean ANN benchmark artifacts from .npy embedding inputs.

Current outputs include:

base vectors in .fvecs
query vectors in .fvecs
ground truth in .ivecs

The pipeline is designed for large embedding datasets and supports a staged workflow with logs, a run summary, and cleanup of intermediate files.

Current scope

This repository currently supports .npy input files only.

More input readers and formats may be added later, but the current code is focused on a working .npy-to-benchmark-artifacts pipeline.

What this project does

Given one or more .npy embedding files, this project can:

extract vectors into a single base .fvecs file
remove exact zero vectors
normalize vectors when needed
deduplicate vectors
sample query vectors without replacement from the cleaned vector set
generate exact ground-truth nearest neighbors for the final base/query split
log progress, output stats, and errors at each stage
clean up intermediate large .fvecs files after successful downstream stages

The final outputs are named using a common dataset prefix and the actual final vector counts produced.

Repository structure

.
├── config.py
├── pipeline.py
├── readers.py
├── fvecs_writer.py
├── fvecs_remove_zeros.py
├── fvecs_normalize.py
├── fvecs_deduplicator.py
├── fvecs_split.py
├── ivecs_check.py
├── knn_utils.py
└── runs/

Example run

After editing config.py, run the pipeline from the repository root with:

python3 pipeline.py

Or with a specific Python interpreter:

/path/to/python pipeline.py

Required `config.py` settings

At a minimum, you should set the following values in config.py before running the pipeline.

Run and naming settings

RUN_NAME
Name of the run directory under runs/
FILE_PREFIX
Common prefix used to name output artifacts

Example:

RUN_NAME = "wiki_mpnet_en_trial"
FILE_PREFIX = "wiki_mpnet_embeddings"

Input settings

SOURCE_TYPE
Currently should be set to "npy"
INPUT_FILES
A list of .npy embedding files to process

Example:

SOURCE_TYPE = "npy"

INPUT_FILES = [DATASET_DIR / f"emb_{i:03d}.npy" for i in range(10)]

Be sure to create a .env file in the project's root directory and define DATASET_ROOT.

Requested dataset sizes

NUM_BASE
Requested final number of base vectors
NUM_QUERY
Requested final number of query vectors

The pipeline initially extracts at least:

NUM_BASE + NUM_QUERY

vectors from the input files.

Example:

NUM_BASE = 100000
NUM_QUERY = 10000

Ground truth settings

GT_K
Number of nearest neighbors to compute
GT_METRIC
"ip" or "l2"
GT_SHUFFLE
Whether to let knn_utils.py shuffle before ground truth generation
GT_GPUS
"-1" for CPU, or values such as "0" or "0,1" for GPU execution

Example:

GT_K = 100
GT_METRIC = "ip"
GT_SHUFFLE = False
GT_GPUS = "-1"

Pipeline behavior settings

CLEANUP_INTERMEDIATE_FVECS
If True, intermediate .fvecs files are deleted after successful downstream stages
OVERWRITE
If False, stages with existing outputs are skipped

Example:

CLEANUP_INTERMEDIATE_FVECS = True
OVERWRITE = False

Minimal example `config.py` section

from pathlib import Path
import sys

RUN_NAME = "wiki_mpnet_en_trial"
FILE_PREFIX = "wiki_mpnet_embeddings"
CLEANUP_INTERMEDIATE_FVECS = True
OVERWRITE = False

RUN_DIR = Path("runs") / RUN_NAME

NUM_BASE = 100000
NUM_QUERY = 10000

GT_K = 100
GT_METRIC = "ip"
GT_SHUFFLE = False
GT_GPUS = "-1"

SOURCE_TYPE = "npy"

dataset_root = os.environ.get("DATASET_ROOT")
if not dataset_root:
    raise RuntimeError(
        "DATASET_ROOT is not set. "
        "Example: export DATASET_ROOT=/path/to/your/datasets"
    )

DATASET_ROOT = Path(dataset_root)
DATASET_NAME = "mpnet-43m" # Just an example.  Put whatever you'd like
EMBED_SUBDIR = "data/en/embs"

DATASET_DIR = DATASET_ROOT / DATASET_NAME / EMBED_SUBDIR
INPUT_FILES = [DATASET_DIR / f"emb_{i:03d}.npy" for i in range(10)] # Match the file naming conventions from your download

Notes on output

The pipeline writes outputs into:

runs/<RUN_NAME>/

At the end of a successful run, the final artifacts are renamed using:

FILE_PREFIX
the actual final base count
the actual final query count
the ground truth metric and k

Typical final outputs look like:

<prefix>_base_<actual_count>.fvecs
<prefix>_query_<actual_count>.fvecs
<prefix>_gt_<metric>_<k>.ivecs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Embedding Dataset Processing Pipeline

Current scope

What this project does

Repository structure

Example run

Required `config.py` settings

Run and naming settings

Input settings

Requested dataset sizes

Ground truth settings

Pipeline behavior settings

Minimal example `config.py` section

Notes on output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
config.py		config.py
fvecs_deduplicator.py		fvecs_deduplicator.py
fvecs_normalize.py		fvecs_normalize.py
fvecs_remove_zeros.py		fvecs_remove_zeros.py
fvecs_split.py		fvecs_split.py
fvecs_writer.py		fvecs_writer.py
hf_downloader.py		hf_downloader.py
ivecs_check.py		ivecs_check.py
knn_utils.py		knn_utils.py
pipeline.py		pipeline.py
readers.py		readers.py

Folders and files

Latest commit

History

Repository files navigation

Vector Embedding Dataset Processing Pipeline

Current scope

What this project does

Repository structure

Example run

Required config.py settings

Run and naming settings

Input settings

Requested dataset sizes

Ground truth settings

Pipeline behavior settings

Minimal example config.py section

Notes on output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Required `config.py` settings

Minimal example `config.py` section

Packages