Fix O(N²) ALP metadata recomputation in ColumnChunkData::split() by adsharma · Pull Request #649 · LadybugDB/ladybug

adsharma · 2026-07-04T00:05:46Z

Fixes: #645

# kuzu 0.11.3
--- 7. Create REL table (Owns) with 500,000 edges via COPY ---
  COPY completed in 0.25s (0.49 µs/edge)

vs

# https://github.com/kuzudb/kuzu/commit/9f5acab2e  + this commit
--- 7. Create REL table (Owns) with 500,000 edges via COPY ---
  COPY completed in 2.19s (4.37 µs/edge)

So it's slower due to segmentation and better estimation of how the DOUBLE column would compress, but not by 450x.

The Segmentation commit (8950913) introduced split(), which called getSizeOnDiskInMemoryStats() every 64 values in a tight inner loop. For DOUBLE/FLOAT columns, each call triggers the full ALP compression analysis (find_top_k_combinations + createFloatMetadata), creating O(N²) behavior: ~512 metadata recomputations per 256KB segment instead of one.

Fix this by computing the compression metadata once for the whole chunk, then deriving a count-based bound (maxPages × valuesPerPage) to use as the inner loop condition for fixed-size types. The bound respects page boundaries and avoids the integer-division bias of a simple avg-bytes-per-value approach. A single verification call per segment confirms the segment fits within targetSize.

Variable-size types (strings, lists) don't trigger expensive ALP analysis, so they keep the original metadata-based check.

The Segmentation commit (8950913) introduced split(), which called getSizeOnDiskInMemoryStats() every 64 values in a tight inner loop. For DOUBLE/FLOAT columns, each call triggers the full ALP compression analysis (find_top_k_combinations + createFloatMetadata), creating O(N²) behavior: ~512 metadata recomputations per 256KB segment instead of one. Fix this by computing the compression metadata once for the whole chunk, then deriving a count-based bound (maxPages × valuesPerPage) to use as the inner loop condition for fixed-size types. The bound respects page boundaries and avoids the integer-division bias of a simple avg-bytes-per-value approach. A single verification call per segment confirms the segment fits within targetSize. Variable-size types (strings, lists) don't trigger expensive ALP analysis, so they keep the original metadata-based check.

adsharma · 2026-07-04T01:06:17Z

Why take a 10x hit on COPY? Here's the likely motivation for introducing segmentation:

Data is now stored in independently-manageable segments rather than one monolithic chunk. This enables:

More granular memory management — segments can be individually spilled to disk, flushed, or checkpointed.
Better I/O — checkpointing can write only the affected segments instead of the entire column.
Finer control over data size — MAX_SEGMENT_SIZE prevents any single segment from growing unbounded, which is important for databases handling large tables with wide columns.

adsharma force-pushed the fix/split-alp-performance branch from a8991cd to 8102088 Compare July 4, 2026 01:00

adsharma merged commit e179d72 into main Jul 4, 2026
4 checks passed

adsharma deleted the fix/split-alp-performance branch July 4, 2026 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix O(N²) ALP metadata recomputation in ColumnChunkData::split()#649

Fix O(N²) ALP metadata recomputation in ColumnChunkData::split()#649
adsharma merged 1 commit into
mainfrom
fix/split-alp-performance

adsharma commented Jul 4, 2026

Uh oh!

adsharma commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adsharma commented Jul 4, 2026

Uh oh!

adsharma commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant