chore: Sampling algorithms now use nth method. Fixes #667. by RobertJacobsonCDC · Pull Request #760 · CDCgov/ixa

RobertJacobsonCDC · 2026-02-11T03:34:18Z

This PR

modifies our built-in sampling algorithms to use Iterator::nth instead of iterating over every element.
adds a missing check along one execution path to dispatch to a faster sampling algorithm, giving ~8x speedup for that particular path
adds two benchmarks to the suit of sampling benchmarks to fill a gap in coverage

There are some dramatic speedups that do not represent actual performance improvements in the code, and there are small apparent regressions that are not real in the sense that the code in their execution path is unchanged. See notes on benchmarks below for details.

`EntitySetIterator` now selects sampling algorithm based on size hint (like `sample_entity` already does).

…chmarks.

RobertJacobsonCDC · 2026-02-11T03:54:33Z

Sampling algorithm benchmarks

Summary

Some algorithm benches (which apply sampling functions directly to a Vec) show extreme speedups (up to 60×), but these are misleading: ixa never samples from Vec iterators. They are still useful, though, as they isolate the overhead of the reservoir algorithm itself from the cost of nth. The only real-world improvement is the ~8× speedup for sampling_multiple_known_length, where IndexSet's O(1) nth lets the algorithm skip between selected indices in constant time. The reservoir path shows no meaningful improvement (≤1.23×), because the filtering iterators that trigger reservoir sampling have O(n) nth regardless.

Benchmarks

Smaller (faster) time in bold, relative speedup = main ÷ dev

> 1 → dev is faster
< 1 → dev is slower

Algorithm & Sampling Benchmarks (main vs dev)

benchmark	main	dev	relative speedup (main ÷ dev)
algorithm_sampling_single_known_length	6.0698 ns	6.0833 ns	1.00×
algorithm_sampling_single_l_reservoir	30.041 µs	498.42 ns	≈60.3×
algorithm_sampling_single_rand_reservoir	155.04 µs	157.62 µs	0.98×
algorithm_sampling_multiple_known_length	32.474 µs	1.2966 µs	≈25.1×
algorithm_sampling_multiple_l_reservoir	69.347 µs	17.879 µs	3.88×

sample_entity Benchmarks (main vs dev)

benchmark	main	dev	relative speedup (main ÷ dev)
whole_population / 1 000	11.380 ns	12.313 ns	0.92×
whole_population / 10 000	11.420 ns	12.374 ns	0.92×
whole_population / 100 000	11.395 ns	12.314 ns	0.93×
single_property_indexed / 1 000	82.356 ns	83.488 ns	0.99×
single_property_indexed / 10 000	81.921 ns	82.532 ns	0.99×
single_property_indexed / 100 000	81.593 ns	83.125 ns	0.98×
multi_property_indexed / 1 000	194.19 ns	186.21 ns	1.04×
multi_property_indexed / 10 000	188.18 ns	193.71 ns	0.97×
multi_property_indexed / 100 000	194.23 ns	186.61 ns	1.04×

The apparent regression in whole_population* isn't real: the execution paths between the branches are identical. (Link order alone can cause +/-40% performance changes. See also Emery Berger.)

sampling Benchmarks (main vs dev)

benchmark	main	dev	relative speedup (main ÷ dev)
sampling_single_known_length	83.469 µs	84.998 µs	0.98×
sampling_single_l_reservoir	6.0256 ms	4.8935 ms	1.23×
sampling_multiple_known_length	6.5298 ms	778.81 µs	≈8.39×
sampling_multiple_l_reservoir	6.7035 ms	6.6157 ms	1.01×

Running the benchmarks

# On main:
cargo bench -p ixa-bench 'sampl' -- --save-baseline 'main'
# On the dev branch:
cargo bench -p ixa-bench 'sampl' -- --baseline 'main'

github-actions · 2026-02-11T04:02:47Z

Benchmark Results

Hyperfine

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `large_sir::baseline` | 2.9 ± 0.0 | 2.8 | 3.0 | 1.00 |
| `large_sir::entities` | 13.3 ± 0.5 | 12.9 | 15.9 | 4.65 ± 0.17 |

Criterion

Note: A comparison could not be generated. Maybe you added new benchmarks?

RobertJacobsonCDC · 2026-02-11T04:03:05Z

A dump of some of my notes for posterity.

Sampling Benchmark Coverage Analysis

After switching the sampling algorithms to use Iterator::nth for skipping (instead of
iterating over every element and comparing positions), performance depends on two
independent properties of the iterator being sampled from:

Known vs. unknown length. If the iterator is an ExactSizeIterator, we use the
sample_*_from_known_length functions, which select random indices up front and
collect them in one pass. Otherwise, we fall back to reservoir sampling
(sample_*_l_reservoir).
Fast vs. slow nth. If the iterator supports O(1) nth (e.g. Vec, slice, or
PopulationIterator), the skip steps are genuinely constant-time. If nth is O(n)
(e.g. a filtering iterator, or HashSet's iterator), the skips still have to walk
every element, and the nth optimization has no effect.

These two properties combine into three execution paths that matter in practice:

Execution path	When it arises	Benchmark coverage
Known length + fast `nth`	Query is fully resolved by a single index (or multi-property index), and the backing store supports O(1) skip.	`algorithm_sampling_single_known_length`, `algorithm_sampling_multiple_known_length` (algorithm benches on `Vec`); all `sample_entity_scaling` benchmarks; `sampling_single_known_length_entities`, `sampling_multiple_known_length_entities` (sample_people)
Unknown length + fast `nth`	The reservoir algorithm is applied to an iterator with O(1) `nth`. This path never arises in practice: if the length is unknown, it is because the iterator filters, which makes `nth` linear. However, it is useful for isolating reservoir overhead from `nth` cost.	`algorithm_sampling_single_l_reservoir`, `algorithm_sampling_multiple_l_reservoir` (reservoir algorithm applied directly to a `Vec`)
Unknown length + slow `nth`	Query involves properties that are not jointly indexed, requiring an `EntitySetIterator` that filters one source against others. Each `nth` call must evaluate the filter for every skipped element.	`sampling_single_l_reservoir_entities`, `sampling_multiple_l_reservoir_entities` (sample_people: query on `(Property10, Property100)` where each property is individually indexed but the pair is not)

Gap in benchmark coverage

There are two ways to end up with an EntitySetIterator that has unknown length and
slow nth, corresponding to different SourceIterator variants:

SourceIterator::IndexIter with nonempty sources -- The source is an indexed
set, but additional unindexed filter constraints remain in EntitySetIterator::sources.
The iterator walks the index set and filters each candidate against the remaining
constraints. Both sampling_*_l_reservoir_entities benchmarks in sample_people.rs
exercise this path.
SourceIterator::PropertyVecIter -- No index is available at all. The source
iterates over a property's value vector (via ConcretePropertySource or
DerivedPropertySource), checking every entity. No benchmark currently exercises
this path.

We add two benchmarks to cover case 2: sampling_single_unindexed_entities and sampling_multiple_unindexed_entities.

Running the benchmarks

# On main:
cargo bench -p ixa-bench 'sampl' -- --save-baseline 'main'
# On the dev branch:
cargo bench -p ixa-bench 'sampl' -- --baseline 'main'

src/random/sampling_algorithms.rs

k88hudson-cfa · 2026-02-12T20:09:13Z

src/random/sampling_algorithms.rs

+    // index `idx` we skip `idx - consumed` where `consumed` tracks how many
+    // elements have already been consumed.
+    for idx in indexes {
+        if let Some(item) = iter.nth(idx - consumed) {


I'm confused by this change, why don't you break early anymore

Previously we iterated over the elements of iter (the source set), checked if we found the next index, and if we did land on the next index, updated the next index from the list of precomputed indexes. We break if there are no more indexes.

In the new version, we iterate over the precomputed indexes. There's no need to break, because it's implicit in the for idx in indexes. We move through the iter iterator by calling iter.nth on it.

ixa-bench/criterion/sample_people.rs

src/random/sampling_algorithms.rs

k88hudson-cfa

LGTM with extra comment removed and adding a commment about whether the size is known being runtime dependent

github-actions · 2026-02-13T19:16:48Z

Benchmark Results

Hyperfine

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `large_sir::baseline` | 3.0 ± 0.0 | 3.0 | 3.2 | 1.00 |
| `large_sir::entities` | 11.8 ± 0.3 | 11.1 | 12.7 | 3.88 ± 0.11 |

Criterion

Note: A comparison could not be generated. Maybe you added new benchmarks?

RobertJacobsonCDC added 2 commits February 10, 2026 17:07

Sampling algorithms now use nth method.

248baa5

`EntitySetIterator` now selects sampling algorithm based on size hint (like `sample_entity` already does).

Added two benchmarks to cover missing cases in sampling algorithm ben…

8bbf13f

…chmarks.

RobertJacobsonCDC linked an issue Feb 11, 2026 that may be closed by this pull request

Use Iterator::nth() for reservoir sampling algorithms. #667

Closed

k88hudson-cfa requested changes Feb 12, 2026

View reviewed changes

k88hudson-cfa reviewed Feb 13, 2026

View reviewed changes

src/random/sampling_algorithms.rs Outdated Show resolved Hide resolved

k88hudson-cfa approved these changes Feb 13, 2026

View reviewed changes

chore: added benchmark to sample_entity_scaling, tightened comments

0143e9f

RobertJacobsonCDC merged commit 752e12d into main Feb 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Sampling algorithms now use nth method. Fixes #667.#760

chore: Sampling algorithms now use nth method. Fixes #667.#760
RobertJacobsonCDC merged 3 commits intomainfrom
RobertJacobsonCDC_667_sample_with_nth

RobertJacobsonCDC commented Feb 11, 2026

Uh oh!

RobertJacobsonCDC commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

RobertJacobsonCDC commented Feb 11, 2026

Uh oh!

Uh oh!

k88hudson-cfa Feb 12, 2026

Uh oh!

RobertJacobsonCDC Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k88hudson-cfa left a comment •

edited

Loading

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RobertJacobsonCDC commented Feb 11, 2026

Uh oh!

RobertJacobsonCDC commented Feb 11, 2026

Sampling algorithm benchmarks

Summary

Benchmarks

Algorithm & Sampling Benchmarks (main vs dev)

sample_entity Benchmarks (main vs dev)

sampling Benchmarks (main vs dev)

Running the benchmarks

Uh oh!

github-actions bot commented Feb 11, 2026

Benchmark Results

Hyperfine

Criterion

Uh oh!

RobertJacobsonCDC commented Feb 11, 2026

Sampling Benchmark Coverage Analysis

Gap in benchmark coverage

Running the benchmarks

Uh oh!

Uh oh!

k88hudson-cfa Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

RobertJacobsonCDC Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k88hudson-cfa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 13, 2026

Benchmark Results

Hyperfine

Criterion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

k88hudson-cfa left a comment •

edited

Loading