-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix skip_records over-counting when partial record precedes num_rows page skip #9374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix skip_records over-counting when partial record precedes num_rows page skip #9374
Conversation
|
Thanks for tracking this down @jonded94! Rows spanning page boundaries have been the bane of my existence 😬. I want to give this a good look this afternoon. |
|
I've looked at the code and the test, and I'm not quite sure this is the correct fix. Given the data ( Maybe something like pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
let mut remaining_records = num_records;
let mut all_v2_pages = true; // << add this boolean
while remaining_records != 0 {
if self.num_buffered_values == self.num_decoded_values {
...
if let Some(rows) = rows {
// this short circuits decoding levels, but can be incorrect if V1 pages are
// also present
if rows <= remaining_records && all_v2_pages { // << add to this test
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
} else {
all_v2_pages = false; // << set to false if we encounter V1 page
} |
|
Hey @etseidl 👋 Thanks for your review! I'm not super well versed around data page decoding but I agree that after skipping 2 records, I'm really sorry I can't invest more time into that right now, but I hope what I committed is now finally correct! |
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // offset index), its records are self-contained, so the | ||
| // partial from the previous page is complete at this boundary. | ||
| if let Some(decoder) = self.rep_level_decoder.as_mut() { | ||
| if decoder.flush_partial() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct. If all pages are V2, then has_partial will never be true, because V2 pages must start on a new record (R=0). If all pages are V1 this will never trigger because num_rows will be None. This case only applies when switching from V1 to V2, in which case it's appropriate to call flush_partial because, as said above, V2 pages must start at a new record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means that this bug requires a parquet file with a column chunk that has both V1 and V2 pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made such a file (that triggers the error with a RowSelection): 🤯 #9395
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
run benchmark arrow_reader |
|
I would just like to make sure this isn't causing some major regression in performance (I don't think it is) and then it looks good to me. It would be nice to add a "end to end" reproducer with an actual parquet file -- I made one for your consideration on #9395 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Some of the benchmarks look like they got slower, but it might be noise. I will run them again to see |
|
run benchmark arrow_reader |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Actually, what I stated (and tested for) here was wrong: This isn't about mixing v1 and v2 data pages, but can be reproduced with v2 data pages only. That's what my new commit 365bd9a is about. Additionally, I verified that the original file I had this issue with also only contains v2 data pages. Please read this comment for a very detailed overview over all data pages in the affected row group: #9370 (comment) I also added a correctly currently panicking test in PR #9399 that I validated to no longer panic when the bugfix of this PR is introduced. Regarding the actual bugfix introduced in this PR: It seems to lead to some quite hefty performance regressions, especially in these benchmarks launched by @alamb |
|
Hey, unfortunately this is a bit out of my skill set/expertise, but here is a slightly different fix: I validated that the bugfix still lets the test in this PR and also my end-to-end regression test in #9399 succeed (i.e., for the latter I had to remove the If you think the new bugfix would make more sense, I'll commit it. Claude output about why this new bugfix is potentially faster:
|
|
Hmm, I had assumed I think maybe we should only trust Something like // If page has less rows than the remaining records to
// be skipped, skip entire page
// If no repetition levels, num_levels == num_rows
// We cannot trust V2 pages to always respect record boundaries, so we'll only
// perform this short-circuit when no lists are present.
if self.rep_level_decoder.is_none() {
if let Some(rows) = metadata.num_levels {
if rows <= remaining_records {
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
}
} |
# Which issue does this PR close? Related to #9374 #9370 # Rationale for this change In #9374, I only came up with a unit-test that didn't really throw the message "Not all children array length are the same!" error the issue #9370 is about. In this PR, tests are introduced that are able to reproduce this issue end-to-end and are expected to still panic. In test `test_list_struct_page_boundary_desync_produces_length_mismatch`, we exactly get the error message the original issue is about, while `test_row_selection_list_column_v2_page_boundary_skip` shows a slightly different error message. # What changes are included in this PR? Two test are introduced, which currently are expected to panic: - test_row_selection_list_column_v2_page_boundary_skip - test_list_struct_page_boundary_desync_produces_length_mismatch # Are there any user-facing changes? No
…state-after-skipping-all-records
|
I merged up from main and up "should_panic"d the tests I need to devote more time to studying this PR (and trying to reproduce the benchmark differences locally) tomorrow |
|
This is great work @jonded94 |
Hmm, never mind. This causes some async tests to fail (I suspect due to skipping the fetching/reading of pages via the page indexes). I'm testing a different approach in #9406. |
Which issue does this PR close?
RowSelection#9370 .Rationale for this change
The bug occurs when using RowSelection with nested types (like List) when:
The issue was in
arrow-rs/parquet/src/column/reader.rs:287-382(skip_recordsfunction).Root cause: When
skip_recordscompleted successfully after crossing page boundaries, thehas_partialstate in theRepetitionLevelDecodercould incorrectly remain true.This happened when:
For a more descriptive explanation, look here: #9370 (comment)
What changes are included in this PR?
Added code at the end of skip_records to reset the partial record state when all requested records have been successfully skipped.
This ensures that after skip_records completes, we're at a clean record boundary with no lingering partial record state, fixing the array length mismatch in StructArrayReader.
Are these changes tested?
Commit 365bd9a introduces a test showcasing this issue with v2 data pages only on a unit-test level. PR #9399 could be used to showcase the issue in an end-to-end way.
Previously wrong assumption that thought it had to do with mixing v1 and v2 data pages: