T-BLAEQ (Tensor BLAS-based CAE Query) is the official implementation for our submitted paper to VLDB 2026 titled:
"T-BLAEQ: A Tensor-Based Multigrid Index for GPU-Accelerated Multi-dimensional Query Processing"
The code is fully implemented in CUDA C/C++.
T-BLAEQ is a GPU-accelerated hierarchical mesh index designed for fast range and K-nearest-neighbour (KNN) queries over large high-dimensional datasets.
It organises the dataset into a coarsening hierarchy of meshes connected by sparse prolongation tensors (P-tensors). Each query traverses the hierarchy from coarsest to finest, pruning irrelevant clusters at every level before expanding survivors via a sparse matrix-vector multiply (SpTSpM). A greedy memory-policy scheduler automatically decides how aggressively to cache index data on the GPU, adapting to whatever device memory is available.
- NVIDIA GPU with CUDA Compute Capability 7.0 or higher (Volta+)
- CUDA Toolkit: 12.0 or higher
- GCC / G++: 12.0 or higher
- CMake: 3.20 or higher
- cuVS / RAFT: CUDA Vector Search library (https://github.com/rapidsai/cuvs)
- Doxygen (optional, for building documentation): 1.9 or higher
- Graphviz (optional, for call/class graphs): any recent version
To build the project, use CMake. It is highly recommended to build in Release mode for optimal GPU performance:
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)After building, run the program from the build directory:
./T-BLAEQ [OPTIONS]--build-index- Build and save index for the specified dataset--test-query- Test queries on the specified dataset-d, --dataset TEXT- Path to the dataset file-i, --index-path TEXT- Path to save/load index (default:indexes/)-f, --query-file TEXT- Path to the query file-q, --max-queries INT- Maximum number of queries to process (default: 10)-t, --query-type INT- Type of query to test:0for Range Query,1for KNN Query-k, --K INT- Number of neighbors for KNN query (default: 10)-r TEXT- Path to save query results (e.g.,results.csv)--force-policy TEXT- Force a specific memory policy for benchmarking (e.g.,L0,L1,L2,L3)--height INT- Hierarchy height for index building (both--build-indexand--gen-index) (default: 4)--ratios TEXT- Comma-separated coarsening ratios for index building (both--build-indexand--gen-index) (default:100,50,20)--gen-seed INT- Random seed for RandomKmeans synthetic hierarchy (default: 12345)--gen-sigma-divisor FLOAT- Random spread control in RandomKmeans:sigma = spacing/divisor(default: 3.0)
Build an index:
./T-BLAEQ --build-index -d dataset.txt -i indexes/my_index/Generate a random index with controllable hierarchy/noise:
./T-BLAEQ --gen-index --gen-N 500000 --gen-D 5 --height 4 --ratios 80,40,20 --gen-seed 20260330 --gen-sigma-divisor 2.2 -i indexes/random_index/Run range queries:
./T-BLAEQ --test-query -i indexes/my_index/ -f queries/range_50.txt -t 0 -q 100 -r results.csvRun KNN queries:
./T-BLAEQ --test-query -i indexes/my_index/ -f queries/knn_points.txt -t 1 --K 20 -q 100 -r results.csvForce a specific memory policy (for benchmarking):
./T-BLAEQ --test-query -i indexes/my_index/ -f queries/range_50.txt -t 0 -r results.csv --force-policy L1Build a JSONL index (with binary payloads) and a co-located T-BLAEQ index:
./T-BLAEQ-JsonIndex build <dataset_root> <coord.csv> <index_root> --height 4 --ratios 100,50,20Query the combined indexes (T-BLAEQ coarse, JSONL refine, payload dump):
./T-BLAEQ-JsonIndex query <index_root> --mode knn --timestamp 1715064123.123456 --seq seq0001 --knn-k 10
./T-BLAEQ-JsonIndex query <index_root> --mode range --start 1715064100.0 --end 1715064200.0 --seq 1,2Note: The JSONL group size (N) and fixed file-name bytes are compile-time constants in
src/file_index/IndexJsonTblaeqTypes.cuh (kGroupSize, kFileNameBytes).
Start the server (JSONL index + T-BLAEQ coarse search; clients read payloads locally):
./T-BLAEQ-JsonIndex-Server --host 0.0.0.0 --port 8090 --index-cache ./tempJsonIndexCacheBuild an index via HTTP:
curl -X POST http://localhost:8090/build-index \
-H "Content-Type: application/json" \
-d '{"dataset_name":"demo","dataset_root":"/data/demo","csv_path":"/data/demo/coord.csv","height":4,"ratios":"100,50,20"}'Query (returns JSON index entries for matched slots):
curl -X POST http://localhost:8090/query \
-H "Content-Type: application/json" \
-d '{"dataset_name":"demo","mode":"knn","timestamp":1715064123.123456,"seq":"seq0001","knn_k":10}'The response fields mirror the HTTP server (success, dataset_name, query_type, result_count, result,
memory_policy, level_original_sizes, level_pruned_sizes, query_runtime). Each result entry contains
the minimal JSON index fields needed to locate payloads locally: data_file, slot, index_copy_offset,
binary_index_record_size, data_offsets, index_vector, timestamps, and slot_count. Payload offset can
be computed as:
data_base_offset = index_copy_offset + binary_index_record_size
payload_offset = data_base_offset + data_offsets[slot]
payload_size = data_offsets[slot + 1] - data_offsets[slot]
You can also integrate T-BLAEQ directly into your own C++ applications:
#include "QueryHandler.h"
// Load a pre-built index and run a range query
QueryHandler handler("indexes/my_index/", /*loadFromIndex=*/true);
// Auto-selects the optimal memory policy based on available GPU VRAM
handler.prepareForQuery();
// Execute query
QueryResult result = handler.performQuery("queries/range_50.txt", QueryType::RANGE);
saveQueryResult(result, "output.csv");QueryHandler (Public facade: build/load, prepare, query)
└── QueryEngine (Outer query loop, timing, inter-level grid lifetime)
└── IQueryStrategy (Per-level strategy interface)
├── L0Strategy (Zero cudaMalloc hot path)
├── L1Strategy (Full permanent cache)
├── L2Strategy (Selective lazy-load)
└── L3Strategy (Tiled fallback)
- IndexData is the central data container. It owns all host-side index arrays and all device buffers whose lifetimes are managed by the active strategy.
- PolicyScheduler analyses available device memory, assigns the most aggressive feasible policy to each hierarchy level, and instantiates the appropriate strategy object.
src/
core/ IndexData, QueryEngine, QueryHandler, MemoryPolicy
strategies/ L0–L3 strategy implementations and shared helpers
Data_Structures/ PointCloud, SparseGrid, SparseTensorCscFormat, Query
Kernel/ SpMSpV kernels and strategy-specific wrappers
Kmeans/ GPU KMeans clustering (RAFT/cuVS wrapper)
MergeSort/ Block merge sort kernel and host merge utilities
Query/ Pruning kernels, compact, refactor, CUDA error macros
Setup/ Index construction helpers (P-tensor, max-radius)
utils/ NVTXProfiler, path string utilities
docs/ Doxygen configuration and Markdown documentation pages
To build the full Doxygen documentation (including Memory Policies and Query Pipeline details):
cd docs
./buildDoc.shThe generated documentation will be available at docs/html/index.html.
See LICENSE file for details.
This project uses third-party libraries. See NOTICE and the THIRD_PARTY_LICENSES directory for details.