fix(parquet/file): write large string values #655
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Writing large byte array values (e.g., 50 values x 50MB each = 2.5GB total) caused a panic due to exceeding max page size.
This happened because the writer accumulates the values in batches before checking the page size limits:
WriteBatch()callswriteValues()which adds ALL values to the encoder buffercommitWriteAndCheckPageLimit()checks if the buffer exceeds the limitFlushCurrentPage()attempts to doint32(values.Len())which overflows:2,500,000,000 -> -1,794,967,296bytes.Buffer.Grow(-1,794,967,296)panicsSee #622 (comment)
What changes are included in this PR?
Modified
writeValues()andwriteValuesSpaced()forByteArrayandFixedLenByteArraytypes to check the buffer size beore adding the values and proactively flush when approaching the 2GB limit (parquet uses an int32 for page size).Are these changes tested?
Yes, new tests are added, including some benchmarks to ensure that the new changes don't cause any performance impacts.
Performance Impact
TL;DR: <1% overhead for typical workloads, 0% for fixed-size types
Benchmarks
Impact by Data Type
Per-Value Overhead
Are there any user-facing changes?
Only the fix to the previous situation that would panic.