Skip to content

Use Iterator::nth() for reservoir sampling algorithms. #667

@RobertJacobsonCDC

Description

@RobertJacobsonCDC

The reservoir sampling algorithms will be implemented using a generic Iterator in the new QueryResultIterator (or Entities) code. The real performance advantage of this algorithm is that it skips over elements of the sequence it doesn't need to consider for inserting into the reservoir. However, the way we are implementing it, we still iterate over every element. For the specific case that the Iterator has a constant time nth method implementation (e.g. Vec<T>), this leaves a ton of performance on the table unnecessarily.

Instead of computing next_pick_position, then iterating over each element and checking if position == next_pick_position, we should compute skip_count and set item = iter.nth(skip_count) directly. (Beware off-by-one, as iter.nth(0) is equivalent to iter.next().)

Unfortunately, this change won't give us much of an advantage in our most interesting use-case of sampling from index sets, because HashSet's Iterator::nth implementation has to iterate over each "slot" to account for the fact that some slots might be empty. (This is why we haven't bothered to make this change yet.) However, this could be very relevant if we use alternative data structures for different kinds of indexes.

Metadata

Metadata

Labels

small taskGood for new contributors

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions