Commit 8950913 is a large architectural change. It introduced at least one regression (#645), which is getting fixed. But there are several TODOs/FIXMEs.
Major Incomplete Aspects from the Segmentation Commit
1. Segment splitting can produce over-sized segments (FIXME)
ColumnChunkData::split() splits into chunks of 64 values per pass, but doesn't handle the case where a single batch of 64 values still exceeds MAX_SEGMENT_SIZE. The FIXME reads:
│ "we either need to split recursively, or detect individual values which bring the size above MAX_SEGMENT_SIZE, since this will still sometimes produce segments larger than MAX_SEGMENT_SIZE"
So for wide rows (e.g., a struct column with many large string children), segments can still exceed the configured cap.
2. CSR (adjacency-list) node groups lack segmentation (FIXME)
│ "this needs segmentation. Maybe this should use (or share code with) checkpointOutOfPlace"
The segmentation was implemented for regular ChunkedNodeGroup columns, but CSR chunked node groups — which store the adjacency lists for relationships — still flush all their data chunks as monolithic blocks without splitting. This is a significant gap since CSR data can be very large.
3. shouldSplit() computes on-disk size serially (TODO)
│ "this should use the inMemoryStats to avoid scanning the data, however not all functions update them"
To decide whether a segment needs splitting, shouldSplit() calls getSizeOnDisk(), which serializes the entire segment to measure its byte size. This is expensive for in-memory data. It should use cached in-memory statistics instead, but those aren't reliably maintained by all code paths.
4. O(N²) ALP metadata in split (fixed post-commit)
The initial split() implementation recomputed ALP compression metadata from scratch for every 64-value block, causing quadratic behavior for large segments. This was fixed in commit 8102088 (after the segmentation commit).
5. Linear segment scan (TODO)
genericFindSegment does a linear walk through the segment list:
│ "try binary search (might only make a difference for a very large number of segments)"
For tables with many small segments, this adds O(N) overhead per lookup. The TODO was left to switch to binary search if profiling showed it mattered.
6. getSegments() is a workaround (TODO)
getSegments() builds a new std::vector<const ColumnChunkData*> by collecting raw pointers from the internal std::vector<unique_ptr> each time, just to satisfy storage_info queries. The TODO suggests providing a proper std::span instead.
7. Const-correctness workaround for ALP (TODO)
rangeSegments has a const overload that does const_cast because ALP exception chunks modify state during scanning. The TODO notes this should be fixed.
Commit 8950913 is a large architectural change. It introduced at least one regression (#645), which is getting fixed. But there are several TODOs/FIXMEs.
Major Incomplete Aspects from the Segmentation Commit
1. Segment splitting can produce over-sized segments (FIXME)
ColumnChunkData::split() splits into chunks of 64 values per pass, but doesn't handle the case where a single batch of 64 values still exceeds MAX_SEGMENT_SIZE. The FIXME reads:
│ "we either need to split recursively, or detect individual values which bring the size above MAX_SEGMENT_SIZE, since this will still sometimes produce segments larger than MAX_SEGMENT_SIZE"
So for wide rows (e.g., a struct column with many large string children), segments can still exceed the configured cap.
2. CSR (adjacency-list) node groups lack segmentation (FIXME)
│ "this needs segmentation. Maybe this should use (or share code with) checkpointOutOfPlace"
The segmentation was implemented for regular ChunkedNodeGroup columns, but CSR chunked node groups — which store the adjacency lists for relationships — still flush all their data chunks as monolithic blocks without splitting. This is a significant gap since CSR data can be very large.
3. shouldSplit() computes on-disk size serially (TODO)
│ "this should use the inMemoryStats to avoid scanning the data, however not all functions update them"
To decide whether a segment needs splitting, shouldSplit() calls getSizeOnDisk(), which serializes the entire segment to measure its byte size. This is expensive for in-memory data. It should use cached in-memory statistics instead, but those aren't reliably maintained by all code paths.
4. O(N²) ALP metadata in split (fixed post-commit)
The initial split() implementation recomputed ALP compression metadata from scratch for every 64-value block, causing quadratic behavior for large segments. This was fixed in commit 8102088 (after the segmentation commit).
5. Linear segment scan (TODO)
genericFindSegment does a linear walk through the segment list:
│ "try binary search (might only make a difference for a very large number of segments)"
For tables with many small segments, this adds O(N) overhead per lookup. The TODO was left to switch to binary search if profiling showed it mattered.
6. getSegments() is a workaround (TODO)
getSegments() builds a new std::vector<const ColumnChunkData*> by collecting raw pointers from the internal std::vector<unique_ptr> each time, just to satisfy storage_info queries. The TODO suggests providing a proper std::span instead.
7. Const-correctness workaround for ALP (TODO)
rangeSegments has a const overload that does const_cast because ALP exception chunks modify state during scanning. The TODO notes this should be fixed.