-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Optimize from_bitwise_unary_op
#9297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Need to address the issues (might be code that does not expect the extra padding). |
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
arrow-buffer/src/buffer/boolean.rs
Outdated
| return result; | ||
| let (prefix, aligned_u64s, suffix) = | ||
| unsafe { aligned_start.as_ref().align_to::<u64>() }; | ||
| if prefix.is_empty() && suffix.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle aligned + suffix could maybe be a bit better for x86 (couldn't measure it on Apple M2 - I believe there is no performance difference).
Handling both prefix + suffix was a slightly slower than the unaligned version.
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
FYI @alamb I think it's as good as it can be now. |
|
|
||
| let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end]; | ||
|
|
||
| let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think previous benchmark results are sometimes noisy because underlying buffer is not always aligned to 64 bits (not 100% sure but it would explain it) and then take the slower (unaligned) path.
Now the slower path is not much slower. We probably also want to make sure the array creation path aligns to u64 in most cases and we also keep the alignment in kernels.
|
@jhorstmann perhaps you want to take a look? |
from_bitwise_unary_opfrom_bitwise_unary_op for byte aligned case
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clever @Dandandan -- thank you
I don't understand the changes to the binary operations, and I do wonder if the "not creating aligned output" change is a concern.
arrow-buffer/src/buffer/boolean.rs
Outdated
| bit_offset: 0, | ||
| bit_len: self.bit_len, | ||
| } | ||
| BooleanBuffer::from_bitwise_binary_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated to the improvements in bitwise binary op and is perhaps the source of the 50% reported slowdown of and?
group main optimize_from_bitwise_unary_op
----- ---- ------------------------------
and 1.00 209.3±2.38ns ? ?/sec 1.48 310.3±1.11ns ? ?/sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm not sure if the change is due to this, but the changes look unneeded I agree for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the results for and might be noisy just like the results before for not were noisy - it sometimes hits the aligned case and sometimes not (due to if buffer is allocated aligned or not).
(See example of the same perf a run earlier:)
and 1.01 209.9±3.67ns ? ?/sec 1.00 208.4±0.58ns ? ?/sec
Also, the implemantation for buffer_bin_and is currently as follows (showing the difference should indeed be due to noise):
BooleanBuffer::from_bitwise_binary_op(
left,
left_offset_in_bits,
right,
right_offset_in_bits,
len_in_bits,
|a, b| a & b,
)
.into_inner()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We can make the same change for the binary case, I think the speedup there might be even ~5x)
| } | ||
|
|
||
| BooleanBuffer::from_bits(self.as_slice(), offset, len).into_inner() | ||
| let chunks = self.bit_chunks(offset, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change also seems unrelated -- perhaps we can pull it into its own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to make it work/tests pass as into_inner throws away the bit offset and length.
| let remainder = chunks.remainder(); | ||
| let iter = chunks.map(|c| u64::from_le_bytes(c.try_into().unwrap())); | ||
| let vec_u64s: Vec<u64> = if remainder.is_empty() { | ||
| iter.map(&mut op).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory the remainder should never be empty right? Otherwise the aligned path above would be hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I think the buffer itself (the address at offset 0) could still be not aligned to 64 bytes and has a prefix in the path above (thus go to this path). It could still be the entire buffer from beginning to end is a multiple of 64 bits and the remainder is empty.
| let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect(); | ||
| let buffer = Buffer::from(result_u64s); | ||
| Some(BooleanBuffer::new(buffer, 0, len_in_bits)) | ||
| BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a key difference in the two approaches -- the current code on main will produce an output buffer that is aligned (offset is 0), but this code will produce an output buffer that is not aligned (same as the input)
That is probably why the benchmark results can be so much better in this case -- because the output is different (though still correct)
This is probably ok, but I wanted to point it out as a potential side effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's indeed the main reason (do not bitshift to create offset of 0 which is ~3.5x speedup).The other part (~15% or so) is to align to 8 bytes instead of 1 byte as much as possible to be able to use the fast path as much as possible.
I also found the combination of collect/from_trusted_len_iterator + either iterstor is slow due to a non-existent implementation of fold /being able to use it in from_trusted_len_iterator (which probably makes sense to still PR) but using chunks_exact it's not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mark this PR as an API change to account for this
| bit_len: len_in_bits, | ||
| } | ||
| } | ||
| // align to byte boundaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I made the other path too unlikely by "aligning" input to 64 bits, let's add a case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran coverage locally again and it still seems uncovered
cargo llvm-cov --html test -p arrow -p arrow-arith -p arrow-array -p arrow-buffer -p arrow-cast -p arrow-csv -p arrow-data -p arrow-ipc -p arrow-ord -p arrow-schema -p arrow-select -p arrow-string
I made a PR (with codex) to add appropriate test coverage
from_bitwise_unary_op for byte aligned casefrom_bitwise_unary_op
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan -- I will keep reviewing this in the morning.
I think there are a few more doc changes needed and I want to give this another review with a fresh pair of eyes
| let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect(); | ||
| let buffer = Buffer::from(result_u64s); | ||
| Some(BooleanBuffer::new(buffer, 0, len_in_bits)) | ||
| BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mark this PR as an API change to account for this

Which issue does this PR close?
from_bitwise_unary_op#9364Rationale for this change
This is way better on non-byte-offsets (not_sliced_1). Also rounds down to 64 bits instead of by byte, so it's more likely the aligned path is taken (not_slice_24):
What changes are included in this PR?
chunks_exact(stable since version 1.31)Are these changes tested?
cargo test -p arrow-buffercargo llvm-cov --html test -p arrow-bufferAre there any user-facing changes?
Yes (
api-change).BooleanBuffer::from_bits,BooleanBuffer::from_bitwise_unary_op, and unaryNotnow preserve a non-zero bit offset (offset % 64) when applicable, instead of always producing offset0values()Upgrade note:
If downstream code assumed unary outputs always had
offset() == 0or consumedvalues()directly as fully-normalized data, switch to logical access (value(i), iterators,offset()+len()), or normalize explicitly when needed.