Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 29, 2026

Which issue does this PR close?

Rationale for this change

This is way better on non-byte-offsets (not_sliced_1). Also rounds down to 64 bits instead of by byte, so it's more likely the aligned path is taken (not_slice_24):

main                                                    optimize_from_bitwise_unary_op
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec

What changes are included in this PR?

  • Change the code to use the 64-bit aligned (or aligned + suffix) path as much as possible
  • Speed up the non-aligned path using chunks_exact (stable since version 1.31)
  • Avoid truncation to avoid the need to use the suffix later
  • Update code that used the inner buffer and assumed truncation
  • Update docs/tests to reflect non-zero unary offsets and add explicit fallback-path coverage

Are these changes tested?

  • cargo test -p arrow-buffer
  • cargo llvm-cov --html test -p arrow-buffer
  • Added targeted unit test coverage for unaligned unary fallback branches

Are there any user-facing changes?

Yes (api-change).

  • BooleanBuffer::from_bits, BooleanBuffer::from_bitwise_unary_op, and unary Not now preserve a non-zero bit offset (offset % 64) when applicable, instead of always producing offset 0
  • Unary outputs may retain padding bytes outside the logical bit range in values()

Upgrade note:

If downstream code assumed unary outputs always had offset() == 0 or consumed values() directly as fully-normalized data, switch to logical access (value(i), iterators, offset() + len()), or normalize explicitly when needed.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 29, 2026
@Dandandan Dandandan marked this pull request as draft January 29, 2026 20:19
@Dandandan
Copy link
Contributor Author

Dandandan commented Jan 29, 2026

Need to address the issues (might be code that does not expect the extra padding).
We could perhaps reintroduce the truncation.

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (585b9f8) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    208.7±5.12ns        ? ?/sec    1.01    211.5±5.11ns        ? ?/sec
and_sliced_1     1.01  1104.4±41.22ns        ? ?/sec    1.00   1095.5±2.44ns        ? ?/sec
and_sliced_24    1.00    245.8±1.88ns        ? ?/sec    1.36    335.4±0.46ns        ? ?/sec
not              1.01    145.9±0.42ns        ? ?/sec    1.00    144.8±0.28ns        ? ?/sec
not_slice_24     1.01    195.0±2.04ns        ? ?/sec    1.00    193.6±2.00ns        ? ?/sec
not_sliced_1     3.41    621.0±6.17ns        ? ?/sec    1.00    182.2±0.19ns        ? ?/sec
or               1.00    197.8±4.69ns        ? ?/sec    1.01    199.7±0.28ns        ? ?/sec
or_sliced_1      1.00  1101.3±19.05ns        ? ?/sec    1.03   1136.4±1.73ns        ? ?/sec
or_sliced_24     1.00    247.0±1.67ns        ? ?/sec    1.16    285.8±2.09ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (ccc9fe2) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    208.9±2.37ns        ? ?/sec    1.00    207.5±0.38ns        ? ?/sec
and_sliced_1     1.00   1095.8±1.65ns        ? ?/sec    1.00   1096.6±6.02ns        ? ?/sec
and_sliced_24    1.00    245.7±3.72ns        ? ?/sec    1.37    335.8±1.56ns        ? ?/sec
not              1.03    146.8±2.26ns        ? ?/sec    1.00    142.0±0.71ns        ? ?/sec
not_slice_24     1.04    195.6±2.39ns        ? ?/sec    1.00    188.3±0.33ns        ? ?/sec
not_sliced_1     3.48    620.1±2.75ns        ? ?/sec    1.00    178.0±5.03ns        ? ?/sec
or               1.00    197.3±0.53ns        ? ?/sec    1.01    198.8±2.65ns        ? ?/sec
or_sliced_1      1.00   1096.2±1.38ns        ? ?/sec    1.04   1135.9±3.96ns        ? ?/sec
or_sliced_24     1.00    246.7±0.50ns        ? ?/sec    1.17    289.1±3.16ns        ? ?/sec

@Dandandan Dandandan marked this pull request as ready for review February 5, 2026 19:40
@Dandandan Dandandan requested a review from alamb February 5, 2026 19:51
return result;
let (prefix, aligned_u64s, suffix) =
unsafe { aligned_start.as_ref().align_to::<u64>() };
if prefix.is_empty() && suffix.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle aligned + suffix could maybe be a bit better for x86 (couldn't measure it on Apple M2 - I believe there is no performance difference).
Handling both prefix + suffix was a slightly slower than the unaligned version.

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (df25192) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    212.4±3.72ns        ? ?/sec    1.00    207.2±0.88ns        ? ?/sec
and_sliced_1     1.01   1101.7±5.11ns        ? ?/sec    1.00   1091.6±1.36ns        ? ?/sec
and_sliced_24    1.00    248.1±4.08ns        ? ?/sec    1.34    332.7±1.07ns        ? ?/sec
not              1.04    148.9±3.67ns        ? ?/sec    1.00    143.0±0.99ns        ? ?/sec
not_slice_24     1.03    197.0±4.09ns        ? ?/sec    1.00    191.6±0.48ns        ? ?/sec
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
or               1.00    199.4±3.54ns        ? ?/sec    1.00    199.8±0.71ns        ? ?/sec
or_sliced_1      1.00  1112.4±44.84ns        ? ?/sec    1.02   1139.1±1.96ns        ? ?/sec
or_sliced_24     1.00    251.9±8.50ns        ? ?/sec    1.14    286.5±1.98ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (6e95b3a) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    209.7±5.22ns        ? ?/sec    1.00    206.1±0.60ns        ? ?/sec
and_sliced_1     1.00   1096.4±3.50ns        ? ?/sec    1.00  1092.0±20.97ns        ? ?/sec
and_sliced_24    1.00    245.4±1.05ns        ? ?/sec    1.34    329.6±1.54ns        ? ?/sec
not              1.01    146.2±0.59ns        ? ?/sec    1.00    144.6±2.28ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec
not_sliced_1     3.60    619.4±2.46ns        ? ?/sec    1.00    172.1±0.73ns        ? ?/sec
or               1.00    196.5±1.35ns        ? ?/sec    1.01    197.5±0.76ns        ? ?/sec
or_sliced_1      1.00  1100.6±14.77ns        ? ?/sec    1.04   1139.3±8.90ns        ? ?/sec
or_sliced_24     1.00    247.2±1.04ns        ? ?/sec    1.16    286.2±2.74ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    209.9±3.67ns        ? ?/sec    1.00    208.4±0.58ns        ? ?/sec
and_sliced_1     1.01   1098.0±9.70ns        ? ?/sec    1.00   1088.3±8.18ns        ? ?/sec
and_sliced_24    1.00    245.0±1.25ns        ? ?/sec    1.34    329.3±2.20ns        ? ?/sec
not              1.67    239.3±2.81ns        ? ?/sec    1.00    143.1±1.07ns        ? ?/sec
not_slice_24     1.31    227.1±0.56ns        ? ?/sec    1.00    173.2±2.35ns        ? ?/sec
not_sliced_1     3.70    641.0±5.75ns        ? ?/sec    1.00    173.4±4.10ns        ? ?/sec
or               1.15    229.2±4.99ns        ? ?/sec    1.00    199.6±1.35ns        ? ?/sec
or_sliced_1      1.00  1123.3±11.82ns        ? ?/sec    1.02  1141.6±16.14ns        ? ?/sec
or_sliced_24     1.00    282.8±1.55ns        ? ?/sec    1.01    286.4±1.70ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    209.3±2.38ns        ? ?/sec    1.48    310.3±1.11ns        ? ?/sec
and_sliced_1     1.01   1096.9±4.72ns        ? ?/sec    1.00   1088.9±8.76ns        ? ?/sec
and_sliced_24    1.00    244.8±0.94ns        ? ?/sec    1.35    330.2±3.40ns        ? ?/sec
not              1.02    146.4±1.28ns        ? ?/sec    1.00    143.2±1.52ns        ? ?/sec
not_slice_24     1.12    194.0±0.38ns        ? ?/sec    1.00    172.8±0.68ns        ? ?/sec
not_sliced_1     3.58    619.9±7.97ns        ? ?/sec    1.00    173.0±1.64ns        ? ?/sec
or               1.00    196.7±1.83ns        ? ?/sec    1.01    199.4±0.89ns        ? ?/sec
or_sliced_1      1.00   1098.2±3.17ns        ? ?/sec    1.04  1141.9±19.01ns        ? ?/sec
or_sliced_24     1.00    244.3±0.89ns        ? ?/sec    1.18    287.6±2.47ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

FYI @alamb I think it's as good as it can be now.


let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end];

let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() };
Copy link
Contributor Author

@Dandandan Dandandan Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previous benchmark results are sometimes noisy because underlying buffer is not always aligned to 64 bits (not 100% sure but it would explain it) and then take the slower (unaligned) path.
Now the slower path is not much slower. We probably also want to make sure the array creation path aligns to u64 in most cases and we also keep the alignment in kernels.

@Dandandan
Copy link
Contributor Author

@jhorstmann perhaps you want to take a look?

@alamb alamb changed the title Optimize from_bitwise_unary_op Optimize from_bitwise_unary_op for byte aligned case Feb 8, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever @Dandandan -- thank you

I don't understand the changes to the binary operations, and I do wonder if the "not creating aligned output" change is a concern.

bit_offset: 0,
bit_len: self.bit_len,
}
BooleanBuffer::from_bitwise_binary_op(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the improvements in bitwise binary op and is perhaps the source of the 50% reported slowdown of and?

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    209.3±2.38ns        ? ?/sec    1.48    310.3±1.11ns        ? ?/sec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure if the change is due to this, but the changes look unneeded I agree for this PR.

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the results for and might be noisy just like the results before for not were noisy - it sometimes hits the aligned case and sometimes not (due to if buffer is allocated aligned or not).

(See example of the same perf a run earlier:)

and              1.01    209.9±3.67ns        ? ?/sec    1.00    208.4±0.58ns        ? ?/sec

Also, the implemantation for buffer_bin_and is currently as follows (showing the difference should indeed be due to noise):

BooleanBuffer::from_bitwise_binary_op(
        left,
        left_offset_in_bits,
        right,
        right_offset_in_bits,
        len_in_bits,
        |a, b| a & b,
    )
    .into_inner()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We can make the same change for the binary case, I think the speedup there might be even ~5x)

}

BooleanBuffer::from_bits(self.as_slice(), offset, len).into_inner()
let chunks = self.bit_chunks(offset, len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also seems unrelated -- perhaps we can pull it into its own PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to make it work/tests pass as into_inner throws away the bit offset and length.

let remainder = chunks.remainder();
let iter = chunks.map(|c| u64::from_le_bytes(c.try_into().unwrap()));
let vec_u64s: Vec<u64> = if remainder.is_empty() {
iter.map(&mut op).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory the remainder should never be empty right? Otherwise the aligned path above would be hit

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think the buffer itself (the address at offset 0) could still be not aligned to 64 bytes and has a prefix in the path above (thus go to this path). It could still be the entire buffer from beginning to end is a multiple of 64 bits and the remainder is empty.

let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect();
let buffer = Buffer::from(result_u64s);
Some(BooleanBuffer::new(buffer, 0, len_in_bits))
BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key difference in the two approaches -- the current code on main will produce an output buffer that is aligned (offset is 0), but this code will produce an output buffer that is not aligned (same as the input)

That is probably why the benchmark results can be so much better in this case -- because the output is different (though still correct)

This is probably ok, but I wanted to point it out as a potential side effect

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's indeed the main reason (do not bitshift to create offset of 0 which is ~3.5x speedup).The other part (~15% or so) is to align to 8 bytes instead of 1 byte as much as possible to be able to use the fast path as much as possible.

I also found the combination of collect/from_trusted_len_iterator + either iterstor is slow due to a non-existent implementation of fold /being able to use it in from_trusted_len_iterator (which probably makes sense to still PR) but using chunks_exact it's not required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark this PR as an API change to account for this

bit_len: len_in_bits,
}
}
// align to byte boundaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This codepath appears untested by unit tests

cargo llvm-cov --html test -p arrow-buffer
Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I made the other path too unlikely by "aligning" input to 64 bits, let's add a case for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran coverage locally again and it still seems uncovered

cargo llvm-cov --html test -p arrow -p arrow-arith -p arrow-array -p arrow-buffer -p arrow-cast -p arrow-csv -p arrow-data  -p arrow-ipc -p arrow-ord -p arrow-schema -p arrow-select -p arrow-string
Image

I made a PR (with codex) to add appropriate test coverage

@Dandandan Dandandan changed the title Optimize from_bitwise_unary_op for byte aligned case Optimize from_bitwise_unary_op Feb 8, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan -- I will keep reviewing this in the morning.

I think there are a few more doc changes needed and I want to give this another review with a fresh pair of eyes

let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect();
let buffer = Buffer::from(result_u64s);
Some(BooleanBuffer::new(buffer, 0, len_in_bits))
BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark this PR as an API change to account for this

@alamb alamb added the api-change Changes to the arrow API label Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize from_bitwise_unary_op

3 participants