An extension for VillageSQL Server that adds a vector data type with external columnar storage (SVECTOR), enabling efficient vector operations and laying the foundation for future ANN search and indexing.
This extension is under active development and is not stable. It depends on VillageSQL experimental extension APIs that are subject to breaking changes without notice. It is not recommended for production use.
- SVECTOR Type: A float32 vector type with declared dimension and external columnar storage (up to 3072 dimensions)
- Distance Functions: L1 (Manhattan) and L2 (Euclidean) distance metrics
- Similarity Functions: Inner product, and angular distance via cosine distance
- Utility Functions: Norm computation, dimension query, hex dump, and formatted output
- Native InnoDB Integration: Columnar storage implemented via VillageSQL/InnoDB exposed storage APIs, inheriting ACID guarantees, MVCC, and buffer pool–based caching
- Efficient Storage: External columnar storage with heap-style organization (non-clustered), providing stable vector addresses suitable for ANN index structures
- High Performance: C++ implementation with bitmap-managed slot arrays and insert load distribution across pages via multiple free lists
- VillageSQL build tree (specified via
VillageSQL_BUILD_DIR) - CMake 3.16 or higher
- C++17 compatible compiler
-
Clone the repository:
git clone https://github.com/villagesql/vsql-vector.git cd vsql-vector -
Configure and build:
mkdir -p build && cd build cmake .. -DVillageSQL_BUILD_DIR=/path/to/villagesql/build make -j$(nproc)
This produces
vsql_vector.vebin the build directory.To build with debug symbols and assertions (no optimization):
cmake .. -DVillageSQL_BUILD_DIR=/path/to/villagesql/build -DWITH_DEBUG=ON
-
Install the VEB into the VillageSQL build tree (optional):
make install
This copies
vsql_vector.vebto theveb_output_directoryconfigured in the VillageSQL build tree, making it discoverable by the server without specifying a full path.
Before using any SVECTOR features, load the extension in your session:
INSTALL EXTENSION vsql_vector;To unload:
UNINSTALL EXTENSION vsql_vector;SVECTOR(N) declares a fixed length float32 vector column with dimension N. Every row in that column stores exactly N floats. The maximum supported dimension is 3072.
-- Create a table with a vector column
CREATE TABLE embeddings (
id INT PRIMARY KEY,
vec SVECTOR(4) NOT NULL
) ENGINE=InnoDB;
-- Add a vector column to an existing table
ALTER TABLE embeddings ADD COLUMN vec2 SVECTOR(4) NULL;
-- Insert vectors (number of elements must match the declared dimension)
INSERT INTO embeddings VALUES (1, '[1.0, 2.0, 3.0, 4.0]', NULL); -- reference
INSERT INTO embeddings VALUES (2, '[1.0, 2.0, 3.0, 5.0]', NULL); -- L1 dist from id=1: 1
INSERT INTO embeddings VALUES (3, '[1.0, 2.0, 5.0, 4.0]', NULL); -- L1 dist from id=1: 2
INSERT INTO embeddings VALUES (4, '[2.0, 4.0, 6.0, 8.0]', NULL); -- L1 dist from id=1: 10
INSERT INTO embeddings VALUES (5, '[9.0, 8.0, 7.0, 6.0]', NULL); -- L1 dist from id=1: 20| Function | Returns | Description |
|---|---|---|
SVECTOR_DIMENSION(v) |
INT | Declared dimension of the vector |
SVECTOR_MAX_DIMENSION() |
INT | Maximum supported dimension (3072) |
SVECTOR_NORM(v) |
REAL | L2 (Euclidean) norm |
SVECTOR_FORMAT(v, precision) |
STRING | Vector as a fixed-precision decimal string |
SVECTOR_HEX(v) |
STRING | Raw float bytes as uppercase hex (useful for debugging) |
| Function | Returns | Description |
|---|---|---|
SVECTOR_DISTANCE_L1(v1, v2) |
REAL | L1 (Manhattan) distance — sum of absolute differences |
SVECTOR_DISTANCE_L2(v1, v2) |
REAL | L2 (Euclidean) distance — square root of sum of squared differences |
SVECTOR_DISTANCE_COSINE(v1, v2) |
REAL | Cosine distance — 1 - cosine_similarity; range [0, 2] |
SVECTOR_INNER_PRODUCT(v1, v2) |
REAL | Dot product (similarity, not a metric); higher means more similar |
Both arguments to a distance/similarity function must have the same dimension. All functions return NULL if either argument is NULL.
-- Norm of a stored vector
SELECT id, SVECTOR_NORM(vec) AS norm FROM embeddings ORDER BY id;
-- L2 distance between two stored vectors
SELECT SVECTOR_DISTANCE_L2(a.vec, b.vec) AS dist
FROM embeddings a, embeddings b
WHERE a.id = 1 AND b.id = 2;
-- Nearest-neighbour search by L1 distance (full table scan)
-- Note: HNSW index support is planned; today this performs a sequential scan.
-- The query vector is stored in a table row and joined in as a workaround
-- for the current limitation on inline constant vectors (see Known Limitations).
SELECT id, SVECTOR_DISTANCE_L1(vec, query.ref_vec) AS dist
FROM embeddings,
(SELECT vec AS ref_vec FROM embeddings WHERE id = 1) AS query
ORDER BY dist ASC
LIMIT 2;
-- Expected result: id=1 dist=0, id=2 dist=1
-- TODO: once inline constant vector support is added, the intended syntax is:
-- SELECT id, SVECTOR_DISTANCE_L1(vec, '[1.0, 2.0, 3.0, 4.0]') AS dist
-- FROM embeddings
-- ORDER BY dist ASC
-- LIMIT 2;
-- Update a vector value
UPDATE embeddings SET vec = '[0.5, 0.5, 0.5, 0.5]' WHERE id = 1;
-- Delete a row containing a vector
DELETE FROM embeddings WHERE id = 2;- Inline constant vectors: Using
SVECTOR::FROM_STRING(...)as a direct function argument (e.g.,SVECTOR_DISTANCE_L2(col, SVECTOR::FROM_STRING('[1,2,3,4]'))) currently fails because constant folding for parameterized custom types is not yet supported. As a workaround, store the query vector in a table row and join against it (as shown in the nearest-neighbour example above).
The extension includes tests using the MySQL Test Runner (MTR) framework.
Option 1 (Default): Using installed VEB
cd /path/to/villagesql/build/mysql-test
perl mysql-test-run.pl --suite=/path/to/vsql-vector/testOption 2: Using a specific VEB file
cd /path/to/villagesql/build/mysql-test
VSQL_VECTOR_VEB=/path/to/vsql-vector/build/vsql_vector.veb \
perl mysql-test-run.pl --suite=/path/to/vsql-vector/testThe build also produces svector_page_dump, a standalone tool for inspecting SVECTOR storage pages in InnoDB tablespace files. See src/storage/tools/README.md for full usage.
The following example shows output after inserting 5 rows into a SVECTOR(4) column and then deleting two of them (rows with id=2 and id=4) before the InnoDB purge thread has run. The delete-marked slots remain physically present in the page until purge.
$ svector_page_dump embeddings.ibd 4 -d 5 -rIBD File: embeddings.ibd
Root Page Number: 4
SVECTOR Root Page
=================
Version: 1
Page Type: 1 (ROOT_PAGE)
Creator: SVECTOR
Column Size: 16 bytes (4-dim float vector)
Data Pages:
Total: 1
Free: 1
Head: Page #5
Tail: Page #5
Free Slot Array:
Max Capacity: 2048 slots
Current Size: 1 slots
Non-empty Slots: 1
================================================================================
SVECTOR Data Page
=================
Version: 1
Page Type: 2 (DATA_PAGE)
Free Slot Number: 0
SVECTOR Data Page Links:
Previous: Page #4294967295 (NULL)
Next: Page #4294967295 (NULL)
SVECTOR Free Page Links:
Previous: Page #4294967295 (NULL)
Next: Page #4294967295 (NULL)
Capacity:
Max Records: 673
Free Records: 668 (99.3%)
Allocated: 5 (0.7%)
Active: 3
Deleted: 2
Record Bitmap:
AADAD...................................
........................................
(remaining 633 free slots omitted)
(. = Free, A = Active, D = Deleted)
Records (showing from slot 0, up to 10 records):
[ 0] Trx ID: 1001 Data:[0.10, 0.20, 0.30, 0.40]
[ 1] Trx ID: 1002 Data:[0.90, 0.80, 0.70, 0.60] (DELETED)
[ 2] Trx ID: 1003 Data:[0.50, 0.50, 0.50, 0.50]
[ 3] Trx ID: 1004 Data:[3.14, 2.72, 1.41, 1.73] (DELETED)
[ 4] Trx ID: 1005 Data:[0.11, 0.22, 0.33, 0.44]
Key observations:
DELETEDrecords (slots 1 and 3) are still physically present and visible to concurrent transactions that started before the DELETE committed (MVCC). They are reclaimed by the purge thread once no active transaction can see them.- Record Bitmap encodes each slot's state in 2 bits:
A= active (occupied, not deleted),D= delete-marked (occupied, pending purge),.= free (available for insert). - Free Slot Number
0means this data page is tracked at index 0 in the root page's free slot array, making it eligible for the next insert without a root page scan. - Max Records
673is the page capacity forSVECTOR(4)on a 16 KB InnoDB page:54 (header) + ⌈673×2/8⌉ (bitmap) + 673×24 (records) + 8 (trailer) = 16383 bytes.
vsql-vector/
├── src/
│ ├── native_vector.h # Vector type definitions and distance functions
│ ├── native_vector.cc # Encoding/decoding implementations
│ ├── vector.cc # VDF implementations and extension registration
│ └── storage/
│ ├── storage.h # ColumnStorage class declarations
│ ├── storage.cc # Insert/delete/purge/fetch operations
│ ├── root_page.h # Root page structure and free slot management
│ ├── root_page.cc # Root page operations
│ ├── data_page.h # Data page structure and slot management
│ ├── data_page.cc # Record insert/purge and bitmap operations
│ └── tools/
│ ├── README.md # Page dump tool documentation
│ ├── svector_page_dump.cc # Page dump tool driver
│ ├── page_reader.h/cc # IBD file reader
│ ├── root_page_parser.h/cc # Root page parser
│ └── data_page_parser.h/cc # Data page parser
├── cmake/
│ └── FindVillageSQL.cmake # CMake module to locate VillageSQL SDK
├── test/
│ ├── t/ # MTR test files
│ └── r/ # MTR expected results
├── manifest.json # VEB package manifest
└── CMakeLists.txt # Build configuration
- Support for inline constant vectors and bound parameter vectors in distance functions
- HNSW index for approximate nearest-neighbour (ANN) search
If you encounter a bug or have a feature request, please open an issue using GitHub Issues.
License information can be found in the LICENSE file.
VillageSQL welcomes contributions from the community. For more information, please see the VillageSQL Contributing Guide.
- File a bug or issue and we will review
- Start a discussion in the project discussions
- Join the Discord channel