Conversation
## Summary - set close-on-exec for file open/create paths to prevent lock FD inheritance across exec - add `zvec_open_read_only(const char* path, zvec_collection_t* out)` C API and open collection with `read_only_=true` - add regression tests for CLOEXEC behavior and lock FD inheritance across fork+exec - fix the lock-inheritance regression test to move ownership from the create result and avoid shared ownership false negatives ## Testing - `./build-1349/bin/file_test --gtest_filter=File.CreateAndOpen_SetsCloseOnExecFlag` - `./build-1349/bin/collection_test --gtest_filter=CollectionTest.Feature_LockFdIsNotInheritedAcrossExec`
| zvec_status_t zvec_collection_destroy(zvec_collection_t col) { | ||
| if (!col) return make_error(Status::InvalidArgument("null collection")); | ||
| try { | ||
| return make_error((*static_cast<Collection::Ptr *>(col))->Destroy()); | ||
| } catch (const std::exception &e) { | ||
| return make_exception_error(e); | ||
| } | ||
| } |
There was a problem hiding this comment.
Memory leak in zvec_collection_destroy
The function calls Destroy() on the collection to delete its on-disk data, but it never deletes the Collection::Ptr* wrapper that was heap-allocated in zvec_create_and_open / zvec_open / zvec_open_read_only (e.g. new Collection::Ptr(std::move(*result))). Compare with zvec_collection_free, which correctly calls delete static_cast<Collection::Ptr *>(col).
After this call returns, the caller still holds a zvec_collection_t pointing to the now-leaked allocation, which is also a potential use-after-free if the handle is accessed again.
The fix is to free the wrapper after calling Destroy():
| zvec_status_t zvec_collection_destroy(zvec_collection_t col) { | |
| if (!col) return make_error(Status::InvalidArgument("null collection")); | |
| try { | |
| return make_error((*static_cast<Collection::Ptr *>(col))->Destroy()); | |
| } catch (const std::exception &e) { | |
| return make_exception_error(e); | |
| } | |
| } | |
| zvec_status_t zvec_collection_destroy(zvec_collection_t col) { | |
| if (!col) return make_error(Status::InvalidArgument("null collection")); | |
| try { | |
| auto *ptr = static_cast<Collection::Ptr *>(col); | |
| auto status = make_error((*ptr)->Destroy()); | |
| delete ptr; | |
| return status; | |
| } catch (const std::exception &e) { | |
| return make_exception_error(e); | |
| } | |
| } |
| case DataType::ARRAY_BOOL: { | ||
| auto v = doc.get<std::vector<bool>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| case DataType::ARRAY_INT32: { | ||
| auto v = doc.get<std::vector<int32_t>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| case DataType::ARRAY_INT64: { | ||
| auto v = doc.get<std::vector<int64_t>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| case DataType::ARRAY_FLOAT: { | ||
| auto v = doc.get<std::vector<float>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| case DataType::ARRAY_DOUBLE: { | ||
| auto v = doc.get<std::vector<double>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| case DataType::ARRAY_STRING: { | ||
| auto v = doc.get<std::vector<std::string>>(name); | ||
| return v ? json(*v) : json(nullptr); | ||
| } | ||
| default: | ||
| return json(nullptr); | ||
| } | ||
| } |
There was a problem hiding this comment.
ARRAY_UINT32 and ARRAY_UINT64 missing from field_to_json
Both DataType::ARRAY_UINT32 and DataType::ARRAY_UINT64 are handled correctly in set_doc_field (lines 249–253) for writes, but are absent from the field_to_json switch. When a document containing these field types is read back through zvec_query, zvec_fetch, or zvec_sparse_query, those fields silently fall through to the default: return json(nullptr) branch, returning null instead of the actual array values.
Add the two missing cases:
case DataType::ARRAY_UINT32: {
auto v = doc.get<std::vector<uint32_t>>(name);
return v ? json(*v) : json(nullptr);
}
case DataType::ARRAY_UINT64: {
auto v = doc.get<std::vector<uint64_t>>(name);
return v ? json(*v) : json(nullptr);
}
| #ifdef __cplusplus | ||
| extern "C" { | ||
| #endif | ||
|
|
||
| #include <stddef.h> | ||
| #include <stdint.h> |
There was a problem hiding this comment.
#include directives placed inside extern "C" block
The #include <stddef.h> and #include <stdint.h> directives appear after the opening extern "C" {, meaning they are textually included inside the linkage-specification block. While these two C headers are safe in practice, placing includes inside extern "C" is considered bad practice — any C++ headers transitively pulled in would have their symbols incorrectly declared with C linkage, which can cause ODR violations or link errors.
The conventional fix is to move the includes before the extern "C" guard:
| #ifdef __cplusplus | |
| extern "C" { | |
| #endif | |
| #include <stddef.h> | |
| #include <stdint.h> | |
| #pragma once | |
| #include <stddef.h> | |
| #include <stdint.h> | |
| #ifdef __cplusplus | |
| extern "C" { | |
| #endif |
| static DataType parse_data_type(const std::string &s) { | ||
| if (s == "BINARY") return DataType::BINARY; | ||
| if (s == "STRING") return DataType::STRING; | ||
| if (s == "BOOL") return DataType::BOOL; | ||
| if (s == "INT32") return DataType::INT32; | ||
| if (s == "INT64") return DataType::INT64; | ||
| if (s == "UINT32") return DataType::UINT32; | ||
| if (s == "UINT64") return DataType::UINT64; | ||
| if (s == "FLOAT") return DataType::FLOAT; | ||
| if (s == "DOUBLE") return DataType::DOUBLE; | ||
| if (s == "VECTOR_BINARY32") return DataType::VECTOR_BINARY32; | ||
| if (s == "VECTOR_BINARY64") return DataType::VECTOR_BINARY64; | ||
| if (s == "VECTOR_FP16") return DataType::VECTOR_FP16; | ||
| if (s == "VECTOR_FP32") return DataType::VECTOR_FP32; | ||
| if (s == "VECTOR_FP64") return DataType::VECTOR_FP64; | ||
| if (s == "VECTOR_INT4") return DataType::VECTOR_INT4; | ||
| if (s == "VECTOR_INT8") return DataType::VECTOR_INT8; | ||
| if (s == "VECTOR_INT16") return DataType::VECTOR_INT16; | ||
| if (s == "SPARSE_VECTOR_FP16") return DataType::SPARSE_VECTOR_FP16; | ||
| if (s == "SPARSE_VECTOR_FP32") return DataType::SPARSE_VECTOR_FP32; | ||
| if (s == "ARRAY_BINARY") return DataType::ARRAY_BINARY; | ||
| if (s == "ARRAY_STRING") return DataType::ARRAY_STRING; | ||
| if (s == "ARRAY_BOOL") return DataType::ARRAY_BOOL; | ||
| if (s == "ARRAY_INT32") return DataType::ARRAY_INT32; | ||
| if (s == "ARRAY_INT64") return DataType::ARRAY_INT64; | ||
| if (s == "ARRAY_UINT32") return DataType::ARRAY_UINT32; | ||
| if (s == "ARRAY_UINT64") return DataType::ARRAY_UINT64; | ||
| if (s == "ARRAY_FLOAT") return DataType::ARRAY_FLOAT; | ||
| if (s == "ARRAY_DOUBLE") return DataType::ARRAY_DOUBLE; | ||
| return DataType::UNDEFINED; | ||
| } |
There was a problem hiding this comment.
VECTOR_FP64 documented but not handled; VECTOR_BINARY32/VECTOR_BINARY64 handled but not documented
There is a mismatch between the public API docs in zvec_c.h and the implementation here:
VECTOR_FP64is listed in the header'sSupported data_type valuescomment but has no branch inparse_data_type,set_doc_field, orfield_to_json. Users passing"VECTOR_FP64"will silently getDataType::UNDEFINED, likely causing unexpected errors or silent data loss downstream.VECTOR_BINARY32andVECTOR_BINARY64are parsed correctly (lines 76–77) but are absent from the header documentation, so callers have no way to know they can be used.
Consider either adding proper handler cases for VECTOR_FP64 in parse_data_type/set_doc_field/field_to_json, or removing it from the documentation and adding VECTOR_BINARY32/VECTOR_BINARY64 in its place.
This is a try to add C bindings to zvec.
It can be used with other language bindings, too.
Greptile Summary
This PR adds a C shared-library binding (
libzvec_c) for zvec, enabling usage from Rust via FFI and other languages. It introduces a JSON-based API surface covering the full lifecycle (create, open, insert, upsert, update, delete, query, DDL), a CMake build optionBUILD_C_BINDINGS, and a companion improvement toFile::open/File::createthat setsFD_CLOEXECon all opened file descriptors to prevent handle leakage acrossexeccalls.Key findings from the review:
zvec_collection_destroy— theCollection::Ptr*wrapper allocated on the heap duringzvec_create_and_open/zvec_openis never freed afterDestroy()is called, contradicting the documented contract "free the handle." Onlyzvec_collection_freecorrectly deletes the wrapper.ARRAY_UINT32/ARRAY_UINT64serialization — these two types are handled inset_doc_field(write path) but absent from thefield_to_jsonswitch (read path), silently returningnullfor those fields in query/fetch responses.VECTOR_FP64documented but unimplemented — listed in the header's supported data-type table but missing fromparse_data_type,set_doc_field, andfield_to_json; converselyVECTOR_BINARY32/VECTOR_BINARY64are implemented but undocumented.#includedirectives insideextern "C"block — minor style issue; headers should be included before the linkage-specification to avoid potential ODR issues with transitively included C++ code.Confidence Score: 2/5
zvec_collection_destroyand the silent data loss forARRAY_UINT32/ARRAY_UINT64fields in read responses are correctness bugs that need to be fixed first.zvec_collection_destroyleaks the heap-allocatedCollection::Ptr*wrapper every time it is called, and (2)ARRAY_UINT32/ARRAY_UINT64field values are silently dropped during serialization of query/fetch results. The FD_CLOEXEC refactor and the CMake scaffolding are solid, but the core binding implementation needs these fixes before it can be considered correct.src/binding/c/zvec_c.ccrequires the most attention for the memory-management and serialization bugs;src/binding/c/zvec_c.hneeds the include placement fix and documentation alignment.Important Files Changed
zvec_collection_destroy(wrapper pointer not freed), missingARRAY_UINT32/ARRAY_UINT64cases infield_to_json, and undocumented/unimplementedVECTOR_FP64handling.#includedirectives are placed inside theextern "C"block andVECTOR_FP64/VECTOR_BINARY32/VECTOR_BINARY64support is inconsistent with the implementation.SetCloseOnExecandOpenWithCloseOnExechelpers to ensure file descriptors are not leaked acrossexeccalls; change is self-contained and correct.CreateAndOpen_SetsCloseOnExecFlagtest that verifies the newFD_CLOEXECbehaviour viafcntl; test is guarded by the expected Unix platform check.FD_CLOEXECsemantics and is correctly gated on non-Windows platforms.csubdirectory conditionally whenBUILD_C_BINDINGSis ON; correctly separated from the Python binding guard.BUILD_C_BINDINGSCMake option (default OFF); straightforward and follows existing conventions.Sequence Diagram
sequenceDiagram participant Caller as C/FFI Caller participant API as zvec_c API participant Col as Collection::Ptr* (heap) participant DB as zvec::Collection Caller->>API: zvec_create_and_open(path, schema_json, &out) API->>DB: Collection::CreateAndOpen(path, schema, options) DB-->>API: Result<Collection::Ptr> API->>Col: new Collection::Ptr(move(*result)) API-->>Caller: zvec_status_t + out = Col* Caller->>API: zvec_insert(col, docs_json, &results_json) API->>DB: collection->Schema() DB-->>API: CollectionSchema API->>DB: collection->Insert(docs) DB-->>API: Result<WriteResults> API-->>Caller: zvec_status_t + *results_json (heap string) Caller->>API: zvec_query(col, query_json, &results_json) API->>DB: collection->Query(query) DB-->>API: Result<vector<Doc*>> API-->>Caller: zvec_status_t + *results_json (heap string) Caller->>API: zvec_free_string(results_json) API->>API: delete[] s Caller->>API: zvec_collection_destroy(col) API->>DB: collection->Destroy() DB-->>API: Status Note over API,Col: ⚠️ Col* wrapper is never deleted (memory leak) API-->>Caller: zvec_status_t Caller->>API: zvec_collection_free(col) API->>Col: delete Col* API-->>Caller: voidLast reviewed commit: 27d34ac