-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Optimize data page statistics conversion (up to 4x) #9303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
run benchmark arrow_statistics |
|
🤖 |
| let mut b = UInt8Builder::with_capacity(capacity); | ||
| for (len, index) in chunks { | ||
| match index { | ||
| ColumnIndexMetaData::INT32(index) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is great
|
🤖: Benchmark completed Details
|
Almost 4x even! |
|
run benchmark arrow_statistics |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @Dandandan
| _ => b.append_nulls(len), | ||
| } | ||
| } | ||
| Ok(Arc::new(b.finish())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing all these calls to finish with a builder that is not re-used looks wasteful to me -- I also tried to code up a PR to make this faster too (both finish as well as add a new build):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, in this case it probably does not help that much (most of the time is spent during allocations / capacity checks / copies... while building the values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree the win is likely pretty small, but every little allocation helps
FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the latter would be best (and preferably also do validation directly on read/convert them to the right type, so it could just pass them on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone else followoing along, this is tracked here
| Ok(UInt64Array::from_iter(iter)) | ||
| let chunks: Vec<_> = iterator.collect(); | ||
| let total_capacity: usize = chunks.iter().map(|(len, _)| *len).sum(); | ||
| let mut builder = UInt64Builder::with_capacity(total_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it saves an allocation too
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
Benchmark script failed with exit code 101. Last 10 lines of output: Click to expand |
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive @Dandandan -- thank you
I have one small comment (found by codex and I verified) about a potential change in FixedSizeBinary stats handling, but otherwise this looks great
| _ => b.append_nulls(len), | ||
| } | ||
| } | ||
| Ok(Arc::new(b.finish())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree the win is likely pretty small, but every little allocation helps
FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )
| } | ||
| /// Returns the null pages. | ||
| /// | ||
| /// Values may be `None` when [`ColumnIndex::is_null_page()`] is `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment -- this API returns a vector of what pages are nulls. What is None? Or is it trying to say calling Self::min_value and Self::max_value will return None when is_null_page is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I should have removed that one (I think the comment was autogenerated).
I thought I could pass the values slice and null_pages slice to the builder (it is another 30% or so faster) instead of the iterator code, but the null_pages unfortunately has the valid/invalid reversed compared to "validity".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the null_pages unfortunately has the valid/invalid reversed compared to "validity".
Maybe it is time (another PR) to actually implement something like unary_mut for boolean buffers (so we could invert the bits without a new allocation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps - although I think here we ideally leave the original in place and copy it (or avoid the copy altogether by returning vec/iterator prefarably...)
And now I see you have already filed 👍 |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
🚀 |
Which issue does this PR close?
Rationale for this change
Loading statis is notably inefficient. This makes the conversion from the structure to arrow arrays a bit faster by avoiding allocations, until we get a more efficient encoding directly (#9296)
Details
What changes are included in this PR?
Converts the inefficient iterator-based code (which doesn't really fit the iterator pattern well) to just traverse the values and use the builders. (I think it's just converting a bunch of ugly code to another bunch of ugly code).
Additionally tries to preallocate where possible.
Are these changes tested?
Existing tests
Are there any user-facing changes?