Parquet: prevent binary offset overflow by stopping batch early #9362

vigneshsiva11 · 2026-02-05T17:19:16Z

Which issue does this PR close?

Closes Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973.

Rationale for this change

When reading Parquet files containing very large binary or string values, the Arrow Parquet reader can attempt to construct a RecordBatch whose total value buffer exceeds the maximum representable offset size. This can lead to an overflow error or panic during decoding.

Instead of allowing the buffer to overflow and failing late, the reader should detect this condition early and stop decoding before the offset exceeds the representable limit. This behavior is consistent with other Arrow implementations (for example, PyArrow), which emit smaller batches when encountering very large row groups.

What changes are included in this PR?

Add an early overflow check when appending binary values to the Arrow offset buffer.
Ensure the overflow condition is detected before mutating internal buffers.
Return a controlled error instead of panicking when the offset limit would be exceeded.
Apply the fix uniformly across all byte array decoding paths (plain, dictionary, and delta encodings) via the shared offset buffer logic.

Are these changes tested?

Yes.

Regression tests covering large binary values were added in a separate PR.
Existing Parquet reader and writer tests continue to pass in CI.

Note: Some Parquet and Arrow integration tests require external test data provided via git submodules (parquet-testing and testing). These submodules are not present in a minimal local checkout but are initialized in CI.

Are there any user-facing changes?

Yes.

Reading Parquet files with very large binary or string columns will no longer panic or fail late due to offset overflow.
The reader now stops batch construction early and reports the error safely.

There are no breaking changes to public APIs.

Copilot

Pull request overview

This PR fixes a critical bug where reading Parquet files containing very large binary or string values could cause an offset overflow error or panic. The fix moves the overflow check to occur before buffer mutation, ensuring that the internal state remains consistent if an overflow would occur.

Changes:

Modified try_push method in OffsetBuffer to calculate and validate the next offset before mutating internal buffers
The overflow detection now happens before calling extend_from_slice and push, preventing partial state corruption

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Parquet: prevent binary offset overflow by stopping batch early

3ed8ec5

Copilot AI review requested due to automatic review settings February 5, 2026 17:19

github-actions bot added the parquet Changes to the parquet crate label Feb 5, 2026

Copilot started reviewing on behalf of vigneshsiva11 February 5, 2026 17:19 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: prevent binary offset overflow by stopping batch early #9362

Parquet: prevent binary offset overflow by stopping batch early #9362

Uh oh!

vigneshsiva11 commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parquet: prevent binary offset overflow by stopping batch early #9362

Are you sure you want to change the base?

Parquet: prevent binary offset overflow by stopping batch early #9362

Uh oh!

Conversation

vigneshsiva11 commented Feb 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant