Skip to content

feat: Add C bindings#230

Closed
kdy1 wants to merge 3 commits intoalibaba:mainfrom
ZephyrCloudIO:kdy1/pr
Closed

feat: Add C bindings#230
kdy1 wants to merge 3 commits intoalibaba:mainfrom
ZephyrCloudIO:kdy1/pr

Conversation

@kdy1
Copy link

@kdy1 kdy1 commented Mar 16, 2026

This is a try to add C bindings to zvec.
It can be used with other language bindings, too.

Greptile Summary

This PR adds a C shared-library binding (libzvec_c) for zvec, enabling usage from Rust via FFI and other languages. It introduces a JSON-based API surface covering the full lifecycle (create, open, insert, upsert, update, delete, query, DDL), a CMake build option BUILD_C_BINDINGS, and a companion improvement to File::open/File::create that sets FD_CLOEXEC on all opened file descriptors to prevent handle leakage across exec calls.

Key findings from the review:

  • Memory leak in zvec_collection_destroy — the Collection::Ptr* wrapper allocated on the heap during zvec_create_and_open/zvec_open is never freed after Destroy() is called, contradicting the documented contract "free the handle." Only zvec_collection_free correctly deletes the wrapper.
  • Missing ARRAY_UINT32 / ARRAY_UINT64 serialization — these two types are handled in set_doc_field (write path) but absent from the field_to_json switch (read path), silently returning null for those fields in query/fetch responses.
  • VECTOR_FP64 documented but unimplemented — listed in the header's supported data-type table but missing from parse_data_type, set_doc_field, and field_to_json; conversely VECTOR_BINARY32/VECTOR_BINARY64 are implemented but undocumented.
  • #include directives inside extern "C" block — minor style issue; headers should be included before the linkage-specification to avoid potential ODR issues with transitively included C++ code.

Confidence Score: 2/5

  • Not safe to merge as-is — the memory leak in zvec_collection_destroy and the silent data loss for ARRAY_UINT32/ARRAY_UINT64 fields in read responses are correctness bugs that need to be fixed first.
  • Two logic bugs affect observable correctness: (1) zvec_collection_destroy leaks the heap-allocated Collection::Ptr* wrapper every time it is called, and (2) ARRAY_UINT32/ARRAY_UINT64 field values are silently dropped during serialization of query/fetch results. The FD_CLOEXEC refactor and the CMake scaffolding are solid, but the core binding implementation needs these fixes before it can be considered correct.
  • src/binding/c/zvec_c.cc requires the most attention for the memory-management and serialization bugs; src/binding/c/zvec_c.h needs the include placement fix and documentation alignment.

Important Files Changed

Filename Overview
src/binding/c/zvec_c.cc Core C binding implementation — contains a memory leak in zvec_collection_destroy (wrapper pointer not freed), missing ARRAY_UINT32/ARRAY_UINT64 cases in field_to_json, and undocumented/unimplemented VECTOR_FP64 handling.
src/binding/c/zvec_c.h Public C API header — well-structured and documented, but #include directives are placed inside the extern "C" block and VECTOR_FP64/VECTOR_BINARY32/VECTOR_BINARY64 support is inconsistent with the implementation.
src/binding/c/CMakeLists.txt Build configuration for the shared library — correctly replicates the Python binding's whole-archive linker strategy to include all algorithm implementations; install targets look correct.
src/ailego/io/file.cc Adds SetCloseOnExec and OpenWithCloseOnExec helpers to ensure file descriptors are not leaked across exec calls; change is self-contained and correct.
tests/ailego/io/file_test.cc Adds CreateAndOpen_SetsCloseOnExecFlag test that verifies the new FD_CLOEXEC behaviour via fcntl; test is guarded by the expected Unix platform check.
tests/db/collection_test.cc Adds a fork-based test verifying that open collection file descriptors are not inherited by child processes; test uses FD_CLOEXEC semantics and is correctly gated on non-Windows platforms.
src/binding/CMakeLists.txt Adds the c subdirectory conditionally when BUILD_C_BINDINGS is ON; correctly separated from the Python binding guard.
CMakeLists.txt Adds top-level BUILD_C_BINDINGS CMake option (default OFF); straightforward and follows existing conventions.

Sequence Diagram

sequenceDiagram
    participant Caller as C/FFI Caller
    participant API as zvec_c API
    participant Col as Collection::Ptr* (heap)
    participant DB as zvec::Collection

    Caller->>API: zvec_create_and_open(path, schema_json, &out)
    API->>DB: Collection::CreateAndOpen(path, schema, options)
    DB-->>API: Result<Collection::Ptr>
    API->>Col: new Collection::Ptr(move(*result))
    API-->>Caller: zvec_status_t + out = Col*

    Caller->>API: zvec_insert(col, docs_json, &results_json)
    API->>DB: collection->Schema()
    DB-->>API: CollectionSchema
    API->>DB: collection->Insert(docs)
    DB-->>API: Result<WriteResults>
    API-->>Caller: zvec_status_t + *results_json (heap string)

    Caller->>API: zvec_query(col, query_json, &results_json)
    API->>DB: collection->Query(query)
    DB-->>API: Result<vector<Doc*>>
    API-->>Caller: zvec_status_t + *results_json (heap string)

    Caller->>API: zvec_free_string(results_json)
    API->>API: delete[] s

    Caller->>API: zvec_collection_destroy(col)
    API->>DB: collection->Destroy()
    DB-->>API: Status
    Note over API,Col: ⚠️ Col* wrapper is never deleted (memory leak)
    API-->>Caller: zvec_status_t

    Caller->>API: zvec_collection_free(col)
    API->>Col: delete Col*
    API-->>Caller: void
Loading

Last reviewed commit: 27d34ac

Greptile also left 4 inline comments on this PR.

kdy1 added 3 commits March 17, 2026 08:14
## Summary
- set close-on-exec for file open/create paths to prevent lock FD
inheritance across exec
- add `zvec_open_read_only(const char* path, zvec_collection_t* out)` C
API and open collection with `read_only_=true`
- add regression tests for CLOEXEC behavior and lock FD inheritance
across fork+exec
- fix the lock-inheritance regression test to move ownership from the
create result and avoid shared ownership false negatives

## Testing
- `./build-1349/bin/file_test
--gtest_filter=File.CreateAndOpen_SetsCloseOnExecFlag`
- `./build-1349/bin/collection_test
--gtest_filter=CollectionTest.Feature_LockFdIsNotInheritedAcrossExec`
@CLAassistant
Copy link

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@kdy1 kdy1 marked this pull request as ready for review March 16, 2026 23:21
Comment on lines +474 to +481
zvec_status_t zvec_collection_destroy(zvec_collection_t col) {
if (!col) return make_error(Status::InvalidArgument("null collection"));
try {
return make_error((*static_cast<Collection::Ptr *>(col))->Destroy());
} catch (const std::exception &e) {
return make_exception_error(e);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak in zvec_collection_destroy

The function calls Destroy() on the collection to delete its on-disk data, but it never deletes the Collection::Ptr* wrapper that was heap-allocated in zvec_create_and_open / zvec_open / zvec_open_read_only (e.g. new Collection::Ptr(std::move(*result))). Compare with zvec_collection_free, which correctly calls delete static_cast<Collection::Ptr *>(col).

After this call returns, the caller still holds a zvec_collection_t pointing to the now-leaked allocation, which is also a potential use-after-free if the handle is accessed again.

The fix is to free the wrapper after calling Destroy():

Suggested change
zvec_status_t zvec_collection_destroy(zvec_collection_t col) {
if (!col) return make_error(Status::InvalidArgument("null collection"));
try {
return make_error((*static_cast<Collection::Ptr *>(col))->Destroy());
} catch (const std::exception &e) {
return make_exception_error(e);
}
}
zvec_status_t zvec_collection_destroy(zvec_collection_t col) {
if (!col) return make_error(Status::InvalidArgument("null collection"));
try {
auto *ptr = static_cast<Collection::Ptr *>(col);
auto status = make_error((*ptr)->Destroy());
delete ptr;
return status;
} catch (const std::exception &e) {
return make_exception_error(e);
}
}

Comment on lines +361 to +388
case DataType::ARRAY_BOOL: {
auto v = doc.get<std::vector<bool>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_INT32: {
auto v = doc.get<std::vector<int32_t>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_INT64: {
auto v = doc.get<std::vector<int64_t>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_FLOAT: {
auto v = doc.get<std::vector<float>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_DOUBLE: {
auto v = doc.get<std::vector<double>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_STRING: {
auto v = doc.get<std::vector<std::string>>(name);
return v ? json(*v) : json(nullptr);
}
default:
return json(nullptr);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARRAY_UINT32 and ARRAY_UINT64 missing from field_to_json

Both DataType::ARRAY_UINT32 and DataType::ARRAY_UINT64 are handled correctly in set_doc_field (lines 249–253) for writes, but are absent from the field_to_json switch. When a document containing these field types is read back through zvec_query, zvec_fetch, or zvec_sparse_query, those fields silently fall through to the default: return json(nullptr) branch, returning null instead of the actual array values.

Add the two missing cases:

    case DataType::ARRAY_UINT32: {
      auto v = doc.get<std::vector<uint32_t>>(name);
      return v ? json(*v) : json(nullptr);
    }
    case DataType::ARRAY_UINT64: {
      auto v = doc.get<std::vector<uint64_t>>(name);
      return v ? json(*v) : json(nullptr);
    }

Comment on lines +16 to +21
#ifdef __cplusplus
extern "C" {
#endif

#include <stddef.h>
#include <stdint.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include directives placed inside extern "C" block

The #include <stddef.h> and #include <stdint.h> directives appear after the opening extern "C" {, meaning they are textually included inside the linkage-specification block. While these two C headers are safe in practice, placing includes inside extern "C" is considered bad practice — any C++ headers transitively pulled in would have their symbols incorrectly declared with C linkage, which can cause ODR violations or link errors.

The conventional fix is to move the includes before the extern "C" guard:

Suggested change
#ifdef __cplusplus
extern "C" {
#endif
#include <stddef.h>
#include <stdint.h>
#pragma once
#include <stddef.h>
#include <stdint.h>
#ifdef __cplusplus
extern "C" {
#endif

Comment on lines +66 to +96
static DataType parse_data_type(const std::string &s) {
if (s == "BINARY") return DataType::BINARY;
if (s == "STRING") return DataType::STRING;
if (s == "BOOL") return DataType::BOOL;
if (s == "INT32") return DataType::INT32;
if (s == "INT64") return DataType::INT64;
if (s == "UINT32") return DataType::UINT32;
if (s == "UINT64") return DataType::UINT64;
if (s == "FLOAT") return DataType::FLOAT;
if (s == "DOUBLE") return DataType::DOUBLE;
if (s == "VECTOR_BINARY32") return DataType::VECTOR_BINARY32;
if (s == "VECTOR_BINARY64") return DataType::VECTOR_BINARY64;
if (s == "VECTOR_FP16") return DataType::VECTOR_FP16;
if (s == "VECTOR_FP32") return DataType::VECTOR_FP32;
if (s == "VECTOR_FP64") return DataType::VECTOR_FP64;
if (s == "VECTOR_INT4") return DataType::VECTOR_INT4;
if (s == "VECTOR_INT8") return DataType::VECTOR_INT8;
if (s == "VECTOR_INT16") return DataType::VECTOR_INT16;
if (s == "SPARSE_VECTOR_FP16") return DataType::SPARSE_VECTOR_FP16;
if (s == "SPARSE_VECTOR_FP32") return DataType::SPARSE_VECTOR_FP32;
if (s == "ARRAY_BINARY") return DataType::ARRAY_BINARY;
if (s == "ARRAY_STRING") return DataType::ARRAY_STRING;
if (s == "ARRAY_BOOL") return DataType::ARRAY_BOOL;
if (s == "ARRAY_INT32") return DataType::ARRAY_INT32;
if (s == "ARRAY_INT64") return DataType::ARRAY_INT64;
if (s == "ARRAY_UINT32") return DataType::ARRAY_UINT32;
if (s == "ARRAY_UINT64") return DataType::ARRAY_UINT64;
if (s == "ARRAY_FLOAT") return DataType::ARRAY_FLOAT;
if (s == "ARRAY_DOUBLE") return DataType::ARRAY_DOUBLE;
return DataType::UNDEFINED;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VECTOR_FP64 documented but not handled; VECTOR_BINARY32/VECTOR_BINARY64 handled but not documented

There is a mismatch between the public API docs in zvec_c.h and the implementation here:

  • VECTOR_FP64 is listed in the header's Supported data_type values comment but has no branch in parse_data_type, set_doc_field, or field_to_json. Users passing "VECTOR_FP64" will silently get DataType::UNDEFINED, likely causing unexpected errors or silent data loss downstream.
  • VECTOR_BINARY32 and VECTOR_BINARY64 are parsed correctly (lines 76–77) but are absent from the header documentation, so callers have no way to know they can be used.

Consider either adding proper handler cases for VECTOR_FP64 in parse_data_type/set_doc_field/field_to_json, or removing it from the documentation and adding VECTOR_BINARY32/VECTOR_BINARY64 in its place.

@kdy1
Copy link
Author

kdy1 commented Mar 17, 2026

Closing as the JSON approach is not good enough.
Please see #167 and #234

@kdy1 kdy1 closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants