Segmented prime sieve for fun and profit#87
Conversation
|
Is there a documentation or a paper this algorithm is based on? I would like to help review it. |
|
I think primesieve.org is a good reference. The wheel is a generalization of the idea of skipping multiples of 2 (which is easy to code), see https://en.wikipedia.org/wiki/Wheel_factorization for a good overview. If you take a {2, 3, 5}-wheel, then consider any number The next optimization after compressing the number line by a constant factor is to work on constant size segments of the number line. So instead of for (a, b) in segments # sieve a:b
fill!(segment, 1)
for p in primes_so_far
# find the smallest number q st p * q >= a.
# then remove p * q, p * (q + 1), p * (q + 2), ... until p * (q + m) > b from segment.
end
# find all primes in `segment` and push to `primes_so_far`
endthe inner loop is finicky, cause (a) you don't want to always compute the smallest q s.t. p * q >= a, instead you already know what it is thanks to the last iteration in the previous segment. and (b) you aren't going to remove multiples q, q+1, q+2, ... as you're skipping all p * (q+i)'s that aren't in the {1, 7, 11, 13, 17, 19, 23, 29} spokes -- and this is where the loop is unrolled by hand, and where all the pain and suffering comes from. There's going to be 8 loops for all possible p mod 30's, and 8 steps inside that loop for all q mod 30 values, leading to 64 entrypoints into the loops, and the main optimization is going to be here that instead of doing "cross off q * p, increment q, check if p * q is inside the segment" you're adding a hot loop where you're checking if you can cross off e.g. p * [q, q + 6, q + 10, q + 12, q + 16, q + 18, q + 22, q + 28] at once. |
|
Thank you. I am already familiar with wheel factorization and the the Segmented Sieve (Bays-Hudson). I have to think a bit about combining the two. I appreciate the long response and I will think about it. I really like the idea of combining the 8 bits into one byte. |
|
I have not thought about it very much, but this paper has some other little optimizations, under Algorithms B and C. https://link.springer.com/content/pdf/10.1007/BF01932283.pdf Is your segmented sieve similar to this? Or are you using a completely different approach which is incompatible. Maybe it is incompatible, because this paper basically uses the usual 2-wheel, but I really have to think about it. |
|
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance. |
1 similar comment
|
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance. |
Closes #45.
I have this code lying around for a bit now, but never turned it into a PR. Now I'm turning it into a PR to force myself to finish it 😅. It could use some tests and maybe some code style improvements / cleanup.
What it brings is an iterator:
which makes some operations very fast, as they operate on O(1) memory instead of O(n). For reference:
this is 16x faster.
There's some neat optimizations: you can pick a buffer size to make things fit in L1 cache (above it's 32kb); the inner loop is unrolled by hand such that 8 multiples are removed per iteration; memory is compressed s.t. 1 byte equals an interval of 30 numbers (through a {2, 3, 5}-wheel); multiples of {7, 11, 13, 17} are "pre-sieved" in the sense that a buffer of size 7 * 11 * 13 * 17 is sieved and repeatedly copied over, instead of actually sieving 7, 11, 13 and 17 each over the full range.
You can also print the segment if you like:
So, here it has a buffer of 2 bytes, resembling an interval of 60 numbers, meaning it needs 3 iterations. Every
xmarks a prime number.