Skip to content

Refactor pybigtools values() rasterization#98

Open
nvictus wants to merge 17 commits into
jackh726:masterfrom
nvictus:refactor-pbt-values
Open

Refactor pybigtools values() rasterization#98
nvictus wants to merge 17 commits into
jackh726:masterfrom
nvictus:refactor-pbt-values

Conversation

@nvictus
Copy link
Copy Markdown
Collaborator

@nvictus nvictus commented May 13, 2026

This PR reimplements the binned/base-level rasterization in pybigtools around a cleaner API, with explicit semantics distinguishing the inclusion of uncovered bases (resolves #94) in summary statistic calculations and the post-rasterization filling of empty bins or positions. Internally, it also introduces a clean sweepline-based iterator for generating depth-of-coverage intervals from BigBed entries (the prerequisite for BigBed rasterization) and exposes the full set of internal summary stats through BBIReader.values().

1. Uncovered-base inclusion/exclusion in statistics and postfill

Previously, the missing argument served the purpose of post-rasterization fill-in but not inclusion of uncovered bases in summary calculations, and it's naming was ambiguous.

There is demand for summary statistics that consider uncovered bases as zeros (summarize over the entire bin) rather than only over covered bases, which is not yet exposed via values() #94. UCSC exposes some of these with names like mean0 in averageOverBed but this is all somewhat obscure and many people don't even realize that plain mean excludes uncovered bases.

The goal here is to make both functionality possible via two arguments:

  • uncovered: float | None -- determines whether uncovered bases are included in statistics. By default (None) they are excluded and any empty positions or summary bins are returned as NaN. You can opt-in to treating all uncovered bases as having value 0.0 via uncovered=0.
  • fillna: float | None -- replaces missing and determines how any in-bounds NaN values are filled after rasterization. The previous default behavior was to replace them with 0. The new default fillna=None is to leave them as is (which is coincidentally what pyBigWig does). Users can opt in to the old behavior with fillna=0, which is much more ergonomic than post-filling using numpy primitives.

Through the combination of summary and uncovered, we expose the equivalents of mean0, std0, min0, max0. As a bonus, you can insert any finite background value for uncovered bases you like.

This PR also exposes the full set of internal summary stats, including sum, sum_squares, bases_covered (number of bases covered by features), and bin_covered (fraction of the bin covered by features). The latter two are not affected by the uncovered argument by definition.

Deprecation cycle

Default behavior change: when fillna is not passed, empty bins / uncovered positions are now returned as NaN instead of being filled with 0. A DeprecationWarning is raised on the first call without an explicit fillna to flag the change for existing users; pass fillna=0 to keep prior behavior or fillna=None to silence the warning.

In order to emit warnings, a thin Python wrapper was introduced around the Rust BBIReader to detect "user did not pass fillna" (a distinction PyO3 can't make at the Rust boundary) and emit the DeprecationWarning. It can be removed in a future release when the deprecation is done.

API changes (Python BBIReader.values())

Parameter Status Semantics Default
uncovered new Per-base value for uncovered bases used during the summary computation. None (default) excludes them (classical stats); 0.0 reproduces UCSC's *0 stats; any finite value is a well-defined generalization. None (exclude)
fillna new Post-rasterization fill for in-bounds NaN positions/bins. OOB positions are not touched by fillna. None (keep NaNs)
missing deprecated Accepted as an alias for fillna with a DeprecationWarning. old default was 0.0; now opt-in with fillna=0
oob unchanged Fill for positions outside the chromosome's valid range. NaN

2. UCSC parity in zoom summary interpolation (ceil + rescale normalization)

The zoom-interpolation code path now matches UCSC's bbiSummarySlice normalization: per-bin bases_covered is rounded up to the next integer, with sum and sum_squares rescaled by ceil(bc)/bc so that mean = sum / bases_covered is invariant. The ceil direction preserves the invariant that bases_covered ≥ 1 whenever any data overlaps the bin (a floor would erase data near zoom-record boundaries). std is not algebraically preserved under the rescale, but matching UCSC's reported std requires applying the same step.

3. Coverage iterator

CoverageIterator, a 1-D sweep-line over endpoint events that converts sorted BED entries to a run-length-encoded coverage track (matching bedtools genomecov -bg semantics).

Tests

  • 31 Rust unit tests (up from 4): CoverageIterator (15 cases including sort-violation, malformed-input, stress, batching), fill_binned (8 cases covering every Summary variant plus the integer-bin-width edge case).
  • 715 Python tests (up from ~580): parity vs pybbi for sum/mean/ std/min/max/bin_covered/bases_covered across many start/end/bin combinations and both base-level and binned modes. Adds zoom-interpolation parity tests (test_bbi_consistent_binned_zoom_bw/bb) againstpybbi. Plus tests for fillna/uncovered semantics, the deprecation warnings, the arr= parameter, invalid arguments, and fully out-of-bounds queries.
  • pyproject.toml: filterwarnings ignores DeprecationWarning from pybigtools during the test run; deprecation tests opt back in locally via pytest.warns(...).

Migration notes for users

# old
bw.values("chr1", 0, 100, bins=10, missing=-1)                # → DeprecationWarning
bw.values("chr1", 0, 100, bins=10)                            # → DeprecationWarning, NaN on empty (changed)

# new
bw.values("chr1", 0, 100, bins=10, summary="mean", uncovered=0)  # translates to "mean0"
bw.values("chr1", 0, 100, bins=10, fillna=-1)                    # = old missing=-1
bw.values("chr1", 0, 100, bins=10, fillna=None)                  # opt in, no warning
bw.values("chr1", 0, 100, bins=10, fillna=0)                     # opt back into old default

@jackh726
Copy link
Copy Markdown
Owner

I'm trying to remember - I had thought a bunch about this, and there was some reason (or some difference) in choosing missing versus the fillna parameter. I'll have to think a bit about it and try to recall. Maybe splitting into uncovered and fillna fixes this.

@nvictus
Copy link
Copy Markdown
Collaborator Author

nvictus commented May 13, 2026

missing is the name I used in pybbi which I now regret given the confusing semantics. Whatever we do here I may backport there.

@nvictus
Copy link
Copy Markdown
Collaborator Author

nvictus commented May 18, 2026

Fixed the nightly build issue @jackh726.

nvictus added 17 commits May 18, 2026 13:59
UCSC's bbiIntervalSlice/bbiSummarySlice round per-bin bases_covered up
to the next integer and rescale sum and sum_squares by ceil(bc)/bc so
that mean = sum / bases_covered is preserved. This matters for the
zoom path, where a partial overlap of a zoom record contributes a
fractional share of its validCount. The Value path is unaffected
since overlaps are integer-valued.

The ceil direction (vs. floor) preserves the invariant that
bases_covered >= 1 whenever any data overlaps the bin -- a tiny
boundary clip of a zoom record won't spuriously erase known-existing
data. std is not algebraically preserved under the rescale, but
matching UCSC's reported std requires applying the same step.
The output is now a run-length-encoded coverage track: adjacent
intervals with the same depth are merged, matching the conventional
`bedtools genomecov -bg` semantics. Previously the iterator emitted
at every event position, producing many same-depth intervals in
plateau regions.

The internal algorithm now uses a Peekable iterator for start events
and an ends-only heap (one entry per active interval, half the size
of the previous start+end heap). The next event position is the
min of the next entry's start and the smallest pending end. The
two-phase batching collapses to a single drain-starts + drain-ends
pass per event position. `new()` no longer eagerly pulls the first
entry.

Renames: `heap` -> `ends`, `prev_pos` -> `run_start`.

Behavior change for downstream consumers: `fill_binned` is
unaffected (same total contribution per bin) but advances the
iterator fewer times in dense regions. Two tests were updated to
match the new output (adjacent same-coverage intervals merge);
four new tests cover malformed input, sort violations, the
many-same-position batching case, and a 10k-entry stress check.
…in values()

Reshape the `values()` API around two orthogonal concerns:

  - `uncovered: Option[float]`: per-base value assigned to uncovered
    bases during stat computation. `None` (default) excludes them
    from the summary (classical statistics over only the covered
    bases); `0.0` reproduces UCSC's `*0` statistics; any other finite
    value is a well-defined generalization. For `bases_covered` and
    `bin_covered` this is ignored — those always report counts of
    actual data bases.

  - `fillna: Option[float]`: post-rasterization fill for any NaN
    positions or bins that remain after stat computation. Independent
    of `oob`: out-of-chromosome positions take the `oob` value and
    are not touched by `fillna`, so the two parameters control
    different concerns.

Drops the `Mean0`, `Std0`, `Min0`, `Max0` enum variants and string
mappings — they're now expressible as the regular summaries with
`uncovered=0.0`. Drops the previous `missing` parameter, which was
overloaded between "per-base fill" (base-level) and "empty-bin fill"
(binned) semantics.
Wrap the Rust `BBIReader` in a thin Python class (in `__init__.py`) that
stages a deprecation cycle for `values()`:

  - `missing` is accepted as a deprecated alias for `fillna`. Using it
    raises a `DeprecationWarning`.
  - Not passing `fillna` at all raises a `DeprecationWarning` noting
    that the default behavior has changed: empty bins / uncovered
    positions are now returned as NaN instead of filled with 0. The
    Rust-level default (`fillna=None`) is applied. Users who want the
    old behavior pass `fillna=0` explicitly; users who want to silence
    the warning pass `fillna=None`.

The wrapper uses a sentinel to distinguish "user did not pass `fillna`"
from "user passed `fillna=None`" — a distinction PyO3 can't make at
the Rust boundary. Other attributes (`chroms`, `info`, `values`'s
non-deprecated kwargs, context-manager protocol, etc.) are delegated
to the underlying Rust object via `__getattr__` and explicit forwarders.

Tests:
  - Two new tests pin the deprecation behavior:
    `test_values_missing_alias_deprecated` and
    `test_values_default_fillna_deprecated`.
  - Existing `.values(...)` calls in tests have `fillna=...` added
    explicitly (53 in test_raster.py, 38 in test_api.py) so tests don't
    rely on the default and don't trigger spurious warnings.
  - `pyproject.toml`: ignore `DeprecationWarning` from pybigtools
    during test runs so warnings don't flood output. Tests that
    exercise the deprecation path opt back in via `pytest.warns(...)`.
@nvictus nvictus force-pushed the refactor-pbt-values branch from 638eae7 to c85a934 Compare May 18, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding more summary options to values binning

2 participants