Skip to content

Fix O(N²) ALP metadata recomputation in ColumnChunkData::split()#649

Merged
adsharma merged 1 commit into
mainfrom
fix/split-alp-performance
Jul 4, 2026
Merged

Fix O(N²) ALP metadata recomputation in ColumnChunkData::split()#649
adsharma merged 1 commit into
mainfrom
fix/split-alp-performance

Conversation

@adsharma

@adsharma adsharma commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Fixes: #645

# kuzu 0.11.3
--- 7. Create REL table (Owns) with 500,000 edges via COPY ---
  COPY completed in 0.25s (0.49 µs/edge)

vs

# https://github.com/kuzudb/kuzu/commit/9f5acab2e  + this commit
--- 7. Create REL table (Owns) with 500,000 edges via COPY ---
  COPY completed in 2.19s (4.37 µs/edge)

So it's slower due to segmentation and better estimation of how the DOUBLE column would compress, but not by 450x.

The Segmentation commit (8950913) introduced split(), which called getSizeOnDiskInMemoryStats() every 64 values in a tight inner loop. For DOUBLE/FLOAT columns, each call triggers the full ALP compression analysis (find_top_k_combinations + createFloatMetadata), creating O(N²) behavior: ~512 metadata recomputations per 256KB segment instead of one.

Fix this by computing the compression metadata once for the whole chunk, then deriving a count-based bound (maxPages × valuesPerPage) to use as the inner loop condition for fixed-size types. The bound respects page boundaries and avoids the integer-division bias of a simple avg-bytes-per-value approach. A single verification call per segment confirms the segment fits within targetSize.

Variable-size types (strings, lists) don't trigger expensive ALP analysis, so they keep the original metadata-based check.

The Segmentation commit (8950913) introduced split(), which called
getSizeOnDiskInMemoryStats() every 64 values in a tight inner loop.
For DOUBLE/FLOAT columns, each call triggers the full ALP compression
analysis (find_top_k_combinations + createFloatMetadata), creating
O(N²) behavior: ~512 metadata recomputations per 256KB segment
instead of one.

Fix this by computing the compression metadata once for the whole
chunk, then deriving a count-based bound (maxPages × valuesPerPage)
to use as the inner loop condition for fixed-size types. The bound
respects page boundaries and avoids the integer-division bias of a
simple avg-bytes-per-value approach. A single verification call per
segment confirms the segment fits within targetSize.

Variable-size types (strings, lists) don't trigger expensive ALP
analysis, so they keep the original metadata-based check.
@adsharma adsharma force-pushed the fix/split-alp-performance branch from a8991cd to 8102088 Compare July 4, 2026 01:00
@adsharma

adsharma commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Why take a 10x hit on COPY? Here's the likely motivation for introducing segmentation:

Data is now stored in independently-manageable segments rather than one monolithic chunk. This enables:

  1. More granular memory management — segments can be individually spilled to disk, flushed, or checkpointed.
  2. Better I/O — checkpointing can write only the affected segments instead of the entire column.
  3. Finer control over data size — MAX_SEGMENT_SIZE prevents any single segment from growing unbounded, which is important for databases handling large tables with wide columns.

@adsharma adsharma merged commit e179d72 into main Jul 4, 2026
4 checks passed
@adsharma adsharma deleted the fix/split-alp-performance branch July 4, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Importing FLOAT/DOUBLE extremely slow (450x) due to alp::AlpEncode<> being slow

1 participant