Refactor pybigtools values() rasterization#98
Open
nvictus wants to merge 17 commits into
Open
Conversation
Owner
|
I'm trying to remember - I had thought a bunch about this, and there was some reason (or some difference) in choosing |
Collaborator
Author
|
|
Collaborator
Author
|
Fixed the nightly build issue @jackh726. |
UCSC's bbiIntervalSlice/bbiSummarySlice round per-bin bases_covered up to the next integer and rescale sum and sum_squares by ceil(bc)/bc so that mean = sum / bases_covered is preserved. This matters for the zoom path, where a partial overlap of a zoom record contributes a fractional share of its validCount. The Value path is unaffected since overlaps are integer-valued. The ceil direction (vs. floor) preserves the invariant that bases_covered >= 1 whenever any data overlaps the bin -- a tiny boundary clip of a zoom record won't spuriously erase known-existing data. std is not algebraically preserved under the rescale, but matching UCSC's reported std requires applying the same step.
The output is now a run-length-encoded coverage track: adjacent intervals with the same depth are merged, matching the conventional `bedtools genomecov -bg` semantics. Previously the iterator emitted at every event position, producing many same-depth intervals in plateau regions. The internal algorithm now uses a Peekable iterator for start events and an ends-only heap (one entry per active interval, half the size of the previous start+end heap). The next event position is the min of the next entry's start and the smallest pending end. The two-phase batching collapses to a single drain-starts + drain-ends pass per event position. `new()` no longer eagerly pulls the first entry. Renames: `heap` -> `ends`, `prev_pos` -> `run_start`. Behavior change for downstream consumers: `fill_binned` is unaffected (same total contribution per bin) but advances the iterator fewer times in dense regions. Two tests were updated to match the new output (adjacent same-coverage intervals merge); four new tests cover malformed input, sort violations, the many-same-position batching case, and a 10k-entry stress check.
…in values()
Reshape the `values()` API around two orthogonal concerns:
- `uncovered: Option[float]`: per-base value assigned to uncovered
bases during stat computation. `None` (default) excludes them
from the summary (classical statistics over only the covered
bases); `0.0` reproduces UCSC's `*0` statistics; any other finite
value is a well-defined generalization. For `bases_covered` and
`bin_covered` this is ignored — those always report counts of
actual data bases.
- `fillna: Option[float]`: post-rasterization fill for any NaN
positions or bins that remain after stat computation. Independent
of `oob`: out-of-chromosome positions take the `oob` value and
are not touched by `fillna`, so the two parameters control
different concerns.
Drops the `Mean0`, `Std0`, `Min0`, `Max0` enum variants and string
mappings — they're now expressible as the regular summaries with
`uncovered=0.0`. Drops the previous `missing` parameter, which was
overloaded between "per-base fill" (base-level) and "empty-bin fill"
(binned) semantics.
Wrap the Rust `BBIReader` in a thin Python class (in `__init__.py`) that
stages a deprecation cycle for `values()`:
- `missing` is accepted as a deprecated alias for `fillna`. Using it
raises a `DeprecationWarning`.
- Not passing `fillna` at all raises a `DeprecationWarning` noting
that the default behavior has changed: empty bins / uncovered
positions are now returned as NaN instead of filled with 0. The
Rust-level default (`fillna=None`) is applied. Users who want the
old behavior pass `fillna=0` explicitly; users who want to silence
the warning pass `fillna=None`.
The wrapper uses a sentinel to distinguish "user did not pass `fillna`"
from "user passed `fillna=None`" — a distinction PyO3 can't make at
the Rust boundary. Other attributes (`chroms`, `info`, `values`'s
non-deprecated kwargs, context-manager protocol, etc.) are delegated
to the underlying Rust object via `__getattr__` and explicit forwarders.
Tests:
- Two new tests pin the deprecation behavior:
`test_values_missing_alias_deprecated` and
`test_values_default_fillna_deprecated`.
- Existing `.values(...)` calls in tests have `fillna=...` added
explicitly (53 in test_raster.py, 38 in test_api.py) so tests don't
rely on the default and don't trigger spurious warnings.
- `pyproject.toml`: ignore `DeprecationWarning` from pybigtools
during test runs so warnings don't flood output. Tests that
exercise the deprecation path opt back in via `pytest.warns(...)`.
638eae7 to
c85a934
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reimplements the binned/base-level rasterization in
pybigtoolsaround a cleaner API, with explicit semantics distinguishing the inclusion of uncovered bases (resolves #94) in summary statistic calculations and the post-rasterization filling of empty bins or positions. Internally, it also introduces a clean sweepline-based iterator for generating depth-of-coverage intervals from BigBed entries (the prerequisite for BigBed rasterization) and exposes the full set of internal summary stats throughBBIReader.values().1. Uncovered-base inclusion/exclusion in statistics and postfill
Previously, the
missingargument served the purpose of post-rasterization fill-in but not inclusion of uncovered bases in summary calculations, and it's naming was ambiguous.There is demand for summary statistics that consider uncovered bases as zeros (summarize over the entire bin) rather than only over covered bases, which is not yet exposed via
values()#94. UCSC exposes some of these with names likemean0in averageOverBed but this is all somewhat obscure and many people don't even realize that plainmeanexcludes uncovered bases.The goal here is to make both functionality possible via two arguments:
uncovered: float | None-- determines whether uncovered bases are included in statistics. By default (None) they are excluded and any empty positions or summary bins are returned asNaN. You can opt-in to treating all uncovered bases as having value 0.0 viauncovered=0.fillna: float | None-- replacesmissingand determines how any in-boundsNaNvalues are filled after rasterization. The previous default behavior was to replace them with 0. The new defaultfillna=Noneis to leave them as is (which is coincidentally what pyBigWig does). Users can opt in to the old behavior withfillna=0, which is much more ergonomic than post-filling using numpy primitives.Through the combination of
summaryanduncovered, we expose the equivalents ofmean0,std0,min0,max0. As a bonus, you can insert any finite background value for uncovered bases you like.This PR also exposes the full set of internal summary stats, including
sum,sum_squares,bases_covered(number of bases covered by features), andbin_covered(fraction of the bin covered by features). The latter two are not affected by theuncoveredargument by definition.Deprecation cycle
Default behavior change: when
fillnais not passed, empty bins / uncovered positions are now returned as NaN instead of being filled with 0. ADeprecationWarningis raised on the first call without an explicitfillnato flag the change for existing users; passfillna=0to keep prior behavior orfillna=Noneto silence the warning.In order to emit warnings, a thin Python wrapper was introduced around the Rust
BBIReaderto detect "user did not passfillna" (a distinction PyO3 can't make at the Rust boundary) and emit theDeprecationWarning. It can be removed in a future release when the deprecation is done.API changes (Python
BBIReader.values())uncoveredNone(default) excludes them (classical stats);0.0reproduces UCSC's*0stats; any finite value is a well-defined generalization.fillnafillna.missingfillnawith aDeprecationWarning.fillna=0oob2. UCSC parity in zoom summary interpolation (ceil + rescale normalization)
The zoom-interpolation code path now matches UCSC's
bbiSummarySlicenormalization: per-binbases_coveredis rounded up to the next integer, withsumandsum_squaresrescaled byceil(bc)/bcso thatmean = sum / bases_coveredis invariant. Theceildirection preserves the invariant thatbases_covered ≥ 1whenever any data overlaps the bin (afloorwould erase data near zoom-record boundaries).stdis not algebraically preserved under the rescale, but matching UCSC's reportedstdrequires applying the same step.3. Coverage iterator
CoverageIterator, a 1-D sweep-line over endpoint events that converts sorted BED entries to a run-length-encoded coverage track (matchingbedtools genomecov -bgsemantics).Tests
CoverageIterator(15 cases including sort-violation, malformed-input, stress, batching),fill_binned(8 cases covering everySummaryvariant plus the integer-bin-width edge case).pybbiforsum/mean/std/min/max/bin_covered/bases_coveredacross many start/end/bin combinations and both base-level and binned modes. Adds zoom-interpolation parity tests (test_bbi_consistent_binned_zoom_bw/bb) againstpybbi. Plus tests forfillna/uncoveredsemantics, the deprecation warnings, thearr=parameter, invalid arguments, and fully out-of-bounds queries.pyproject.toml:filterwarningsignoresDeprecationWarningfrompybigtoolsduring the test run; deprecation tests opt back in locally viapytest.warns(...).Migration notes for users