-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add regression tests for Parquet large binary offset overflow #9361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add regression tests for Parquet large binary offset overflow #9361
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds regression tests for a Parquet reader issue where reading files with more than 2GB of binary data in a single RecordBatch causes an offset overflow panic. The tests document the failure mode across multiple Parquet encodings and are marked as ignored to avoid excessive memory usage in CI.
Changes:
- Added test file
large_string_overflow.rswith four regression tests covering different Parquet encodings (PLAIN, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, and RLE_DICTIONARY) for large binary data - Tests create Parquet files exceeding 2GB when decoded into a single BinaryArray
- All tests are marked with
#[ignore]to document the existing failure without causing CI issues
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn large_binary_rle_dictionary_encoding_overflow() { | ||
| let array = make_large_binary_array(); | ||
| let file = | ||
| write_parquet_with_encoding(array, Encoding::RLE_DICTIONARY); |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will panic when run. The WriterProperties builder at line 40 calls set_encoding(encoding) with Encoding::RLE_DICTIONARY, but according to the parquet properties implementation (parquet/src/file/properties.rs:1087-1089), calling set_encoding with RLE_DICTIONARY or PLAIN_DICTIONARY will panic with "Dictionary encoding can not be used as fallback encoding".
To enable dictionary encoding, you should only use set_dictionary_enabled(true) without calling set_encoding(Encoding::RLE_DICTIONARY). The dictionary encoding will be used automatically when dictionary is enabled.
| let props = WriterProperties::builder() | ||
| .set_dictionary_enabled(true) | ||
| .set_encoding(encoding) | ||
| .build(); | ||
|
|
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write_parquet_with_encoding function sets both set_dictionary_enabled(true) and set_encoding(encoding). According to the WriterProperties documentation, when dictionary is enabled and a specific encoding is set, the specified encoding becomes the "fallback" encoding (see parquet/src/file/properties.rs:1079-1081). This means these tests may not actually be testing the specified encodings directly, but rather testing dictionary encoding with these as fallbacks.
For more accurate testing of specific encodings, consider setting set_dictionary_enabled(false) for non-dictionary encodings (PLAIN, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY). The dictionary test should only use set_dictionary_enabled(true) without calling set_encoding.
| let props = WriterProperties::builder() | |
| .set_dictionary_enabled(true) | |
| .set_encoding(encoding) | |
| .build(); | |
| let mut builder = WriterProperties::builder(); | |
| // For non-dictionary encodings, disable dictionary and set the encoding directly. | |
| // For dictionary encoding, enable dictionary and rely on the default dictionary behavior. | |
| let builder = match encoding { | |
| Encoding::RLE_DICTIONARY => builder.set_dictionary_enabled(true), | |
| _ => builder | |
| .set_dictionary_enabled(false) | |
| .set_encoding(encoding), | |
| }; | |
| let props = builder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an important comment to resolve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @vigneshsiva11 -- it is a good start. I left some comments -- let me know what you think
| let props = WriterProperties::builder() | ||
| .set_dictionary_enabled(true) | ||
| .set_encoding(encoding) | ||
| .build(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an important comment to resolve
| @@ -0,0 +1,116 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order for this test to actually be run, I think you need to add it to mod.rs
As it is the file is ignored and the tests are not run
index ffc36655b3..33fc59d494 100644
--- a/parquet/tests/arrow_reader/mod.rs
+++ b/parquet/tests/arrow_reader/mod.rs
@@ -48,6 +48,7 @@ mod io;
mod predicate_cache;
mod row_filter;
mod statistics;
+mod large_string_overflow;
// returns a struct array with columns "int32_col", "float32_col" and "float64_col" with the specified values
fn struct_array(input: Vec<(Option<i32>, Option<f32>, Option<f64>)>) -> ArrayRef {When I add that this file doesn't compile
I had to change the building ot
fn make_large_binary_array() -> ArrayRef {
let value = vec![b'a'; VALUE_SIZE];
let mut builder = BinaryBuilder::new();
for _ in 0..ROWS {
builder.append_value(&value);
}
Arc::new(builder.finish())
}And then it seems to work well
| } | ||
|
|
||
| #[test] | ||
| #[ignore = "regression test for >2GB binary offset overflow"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than ignoring the tests, can you please
- add an assertion that it should panic?
- Add a comment with a link to the ticket?
That way we are sure the test properly covers what you want
For example
#[test]
// Panics until https://github.com/apache/arrow-rs/issues/7973 is fixed
#[should_panic(expected = "byte array offset overflow")]
fn large_binary_plain_encoding_overflow() {|
Hi @alamb, Thanks for the feedback. I’ve updated the PR to address your comments:
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vigneshsiva11 -- getting close
| } | ||
|
|
||
| #[test] | ||
| // Panics until https://github.com/apache/arrow-rs/issues/7973 is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please apply this same pattern to the other tests in this file (so they all use should_panic and hav a comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb, thanks for the review!
I’ve updated the PR to apply the same pattern to all tests in this file:
Converted the remaining cases to use #[should_panic(expected = "byte array offset overflow")]
Added a comment on each test linking to #7973 to document why the panic is currently expected
Removed the ignored tests so all cases are explicitly asserted
I’ve also reformatted the code and confirmed CI passes. Please let me know if this looks good or if you’d like any further adjustments.
Thanks again!
|
This failure looks like a transient CI issue on Windows while downloading protoc ('protoc-21.4-win64.zip' unzip error). There are no changes in this PR related to CI or protoc. Could someone please re-run the failing job? |
|
Thanks @vigneshsiva11 -- I restarted the windows job |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vigneshsiva11 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[test] | ||
| // Panics until https://github.com/apache/arrow-rs/issues/7973 is fixed | ||
| #[should_panic(expected = "byte array offset overflow")] | ||
| fn large_binary_plain_encoding_overflow() { |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests allocate/write ~3GB of binary data and currently lack #[ignore], which will likely cause OOM/timeouts in CI. This also contradicts the PR description stating the tests are ignored. Consider marking these tests #[ignore] (or gating behind an env flag), and make the assertions reflect the intended post-fix behavior (i.e., successful RecordBatch decoding) rather than using #[should_panic].
| // Panics until https://github.com/apache/arrow-rs/issues/7973 is fixed | ||
| #[should_panic(expected = "byte array offset overflow")] | ||
| fn large_binary_plain_encoding_overflow() { |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[should_panic(expected = "byte array offset overflow")] doesn't appear to match the actual overflow error text produced by the Parquet arrow reader (e.g. "index overflow decoding byte array" from parquet/src/arrow/buffer/offset_buffer.rs). As written, these tests may fail even when the overflow occurs. If you keep a panic-based test, align the expected substring with the real message, or (preferably) assert on the returned Err explicitly instead of relying on unwrap() panics.
| mod checksum; | ||
| mod int96_stats_roundtrip; | ||
| mod io; | ||
| mod large_string_overflow; |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test module/file name (large_string_overflow) is misleading: this code is exercising DataType::Binary / BinaryBuilder, not strings. Renaming to something like large_binary_overflow (and updating the mod declaration) would make the purpose clearer and align with the PR title.
| mod large_string_overflow; | |
| mod large_binary_overflow; |
Which issue does this PR close?
Rationale for this change
This PR adds regression coverage for an offset overflow panic encountered when reading Parquet files with large binary columns using the Arrow Parquet reader.
The issue occurs when a single RecordBatch contains more than 2GB of binary data and is decoded into a 'BinaryArray' using 32-bit offsets. Similar failures are observed across multiple Parquet encodings. Adding regression tests helps document the failure mode and provides a foundation for validating a follow-up fix.
What changes are included in this PR?
This PR introduces a new test file under 'parquet/tests/arrow_reader' that adds regression tests covering large binary columns for the following Parquet encodings:
The tests construct Parquet files that exceed the 32-bit offset limit when decoded into a single Arrow 'BinaryArray' and assert successful RecordBatch decoding.
Are these changes tested?
Yes. This PR consists entirely of regression tests.
The tests are currently marked as ignored to avoid excessive memory usage in CI and to document the existing failure mode. They are intended to be enabled once the underlying reader logic is updated to safely handle large binary data without overflowing offsets.
Are there any user-facing changes?
No. This PR only adds tests and does not introduce any user-facing changes.