-
Notifications
You must be signed in to change notification settings - Fork 7
Description
The reservoir sampling algorithms will be implemented using a generic Iterator in the new QueryResultIterator (or Entities) code. The real performance advantage of this algorithm is that it skips over elements of the sequence it doesn't need to consider for inserting into the reservoir. However, the way we are implementing it, we still iterate over every element. For the specific case that the Iterator has a constant time nth method implementation (e.g. Vec<T>), this leaves a ton of performance on the table unnecessarily.
Instead of computing next_pick_position, then iterating over each element and checking if position == next_pick_position, we should compute skip_count and set item = iter.nth(skip_count) directly. (Beware off-by-one, as iter.nth(0) is equivalent to iter.next().)
Unfortunately, this change won't give us much of an advantage in our most interesting use-case of sampling from index sets, because HashSet's Iterator::nth implementation has to iterate over each "slot" to account for the fact that some slots might be empty. (This is why we haven't bothered to make this change yet.) However, this could be very relevant if we use alternative data structures for different kinds of indexes.