The obvious way to convert a katana::LargeArray into an arrow::Buffer leaks memory because Buffer often doesn't own it's underlying data and assumes some other object owns it and that the Buffer lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:
|
auto numeric_array_out_indices = |
|
std::make_shared<arrow::NumericArray<arrow::UInt64Type>>( |
|
static_cast<int64_t>(num_nodes), |
|
arrow::MutableBuffer::Wrap(out_indices.release()->data(), num_nodes)); |
This code, and other code like it that probably exists in the codebase, leaks:
- The LargeBuffer instance
out_indices because of the call to release without passing ownership of the instance to some other object.
- The buffer
out_indices->data() is leaking because arrow::MutableBuffer does not own it's data by default, so it will never deallocate the buffer passed to arrow::MutableBuffer::Wrap.
Further, arrow would not be able to correctly deallocate the LargeArray data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simple take_ownership flag.
The solution will be to do the following:
- Implement a
KatanaMemoryPool subclass of arrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed to arrow::ArrayBuilders, arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed.
- Make
katana::LargeArray into a subclass of arrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics of LargeArray are simple enough they should fit into the Buffer interface just fine.
In fixing this, be careful of the ownership assumptions in Arrow. Buffer often doesn't own it's data. ArrayData and Array share their data with other instances (to implement zero-copy slicing). These objects generally use shared_ptr to manage the underlying data, but be a little careful.
The obvious way to convert a
katana::LargeArrayinto anarrow::Bufferleaks memory becauseBufferoften doesn't own it's underlying data and assumes some other object owns it and that theBufferlifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:katana/libgalois/src/analytics/subgraph_extraction/subgraph_extraction.cpp
Lines 92 to 95 in 5a8c284
This code, and other code like it that probably exists in the codebase, leaks:
out_indicesbecause of the call to release without passing ownership of the instance to some other object.out_indices->data()is leaking becausearrow::MutableBufferdoes not own it's data by default, so it will never deallocate the buffer passed toarrow::MutableBuffer::Wrap.Further, arrow would not be able to correctly deallocate the
LargeArraydata pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simpletake_ownershipflag.The solution will be to do the following:
KatanaMemoryPoolsubclass ofarrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed toarrow::ArrayBuilders,arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed.katana::LargeArrayinto a subclass ofarrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics ofLargeArrayare simple enough they should fit into theBufferinterface just fine.In fixing this, be careful of the ownership assumptions in Arrow.
Bufferoften doesn't own it's data.ArrayDataandArrayshare their data with other instances (to implement zero-copy slicing). These objects generally useshared_ptrto manage the underlying data, but be a little careful.