Skip to content

Conversation

@zeroshade
Copy link
Member

@zeroshade zeroshade commented Feb 6, 2026

Rationale for this change

Writing large byte array values (e.g., 50 values x 50MB each = 2.5GB total) caused a panic due to exceeding max page size.

This happened because the writer accumulates the values in batches before checking the page size limits:

  1. WriteBatch() calls writeValues() which adds ALL values to the encoder buffer
  2. commitWriteAndCheckPageLimit() checks if the buffer exceeds the limit
  3. PROBLEM: At this point, the buffer alraedy contains > 2GB of data if we hit the limit
  4. FlushCurrentPage() attempts to do int32(values.Len()) which overflows: 2,500,000,000 -> -1,794,967,296
  5. bytes.Buffer.Grow(-1,794,967,296) panics

See #622 (comment)

What changes are included in this PR?

Modified writeValues() and writeValuesSpaced() for ByteArray and FixedLenByteArray types to check the buffer size beore adding the values and proactively flush when approaching the 2GB limit (parquet uses an int32 for page size).

Are these changes tested?

Yes, new tests are added, including some benchmarks to ensure that the new changes don't cause any performance impacts.

Performance Impact

TL;DR: <1% overhead for typical workloads, 0% for fixed-size types

Benchmarks

Benchmark                            Time       Data        Throughput
─────────────────────────────────────────────────────────────────────
WriteSmallByteArrayValues (100B)    2.19 ms    1 MB        457 MB/s
WriteMediumByteArrayValues (10KB)   18.0 ms   10 MB        556 MB/s
WriteLargeByteArrayValues (1MB)      137 ms  100 MB        730 MB/s
WriteInt32Values (control)          0.15 ms 0.04 MB        267 MB/s (unchanged)

Impact by Data Type

Data Type Overhead Notes
Int32, Int64, Float, Boolean 0% Unchanged code paths
ByteArray (small, <1KB) <1% Batched processing
ByteArray (large, >1MB) <0.01% I/O dominates, checking negligible

Per-Value Overhead

Value Size Encoding Time Added Overhead % Impact
100 bytes 200 ns ~10 ns ~5%
1 KB 2,000 ns ~10 ns ~0.5%
100 KB 200,000 ns ~10 ns ~0.005%
1 MB+ 2,000,000 ns ~120 ns ~0.006%

Are there any user-facing changes?

Only the fix to the previous situation that would panic.

@zeroshade zeroshade requested review from kou and lidavidm February 6, 2026 02:00
@zeroshade zeroshade merged commit bbf7ab7 into apache:main Feb 6, 2026
57 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants