-
Notifications
You must be signed in to change notification settings - Fork 2
Add binned summary statistic aggregation for genomic intervals — Closes #61 #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
conradbzura
wants to merge
49
commits into
main
Choose a base branch
from
61-binned-coverage-operator
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
0f00f06
feat: Add GIQLCoverage expression node and parser registration
conradbzura 08ffb4d
feat: Add CoverageTransformer for binned genome coverage
conradbzura a97f829
test: Add parsing and transpilation tests for COVERAGE operator
conradbzura 38f9ac0
docs: Add COVERAGE operator reference and recipes
conradbzura 76d36ba
feat: Support => (standard SQL) named parameter syntax in COVERAGE
conradbzura 75bfd14
fix: Stop treating = as named parameter syntax in COVERAGE
conradbzura 9a5a1fd
refactor: Remove dead code and fix LATERAL syntax for DuckDB compat
conradbzura 8b8eaee
feat: Add target parameter and default alias to COVERAGE operator
conradbzura 462e436
fix: Move COVERAGE WHERE clause into LEFT JOIN ON condition
conradbzura 6e7b21b
test: Rewrite COVERAGE tests to spec with full API coverage
conradbzura 4ddb5de
test: Add unit tests for bedtools test utilities
conradbzura ecf2b1a
test: Add unit tests for GIQL parsing, generation, and transpilation
conradbzura 4a09eb7
test: Add bedtools integration tests for operator correctness
conradbzura 76a988f
docs: Clarify score column reference and add sample output table
conradbzura 67f8459
test: Add property-based tests for COVERAGE transpilation
conradbzura 185b716
fix: Align unit tests with := named parameter syntax and fix CTE pres…
conradbzura 1fba22a
fix: Compare only coordinates in merge-then-intersect workflow test
conradbzura c25a2ff
fix: Count non-null source column to preserve zero-coverage bins
conradbzura 1adfd5d
fix: Propagate table alias into chroms subquery
conradbzura 23205ed
fix: Preserve user CTEs in CoverageTransformer output
conradbzura 2faa7c4
fix: Reject non-positive COVERAGE resolution at transpile time
conradbzura 0966f27
refactor: Reuse _split_named_and_positional in GIQLCoverage
conradbzura 47a5dd3
refactor: Delegate table and column lookup to ClusterTransformer
conradbzura b0c4507
style: Move public transform above private helpers in CoverageTransfo…
conradbzura 368e812
fix: Clamp generate_series upper bound to avoid trailing empty bin
conradbzura 63e3ac5
fix: Raise when COVERAGE FROM clause is not a named table
conradbzura 2b698ab
fix: Require stat and target to be string literals in COVERAGE
conradbzura b427c53
docs: Clarify supported COVERAGE FROM clauses and CTE workaround
conradbzura e5297c0
docs: List COVERAGE in the dialect aggregation operators table
conradbzura 97c7cd4
docs: Quote reserved column identifiers in 5' end counting recipe
conradbzura e82ae47
test: Make adjacent-neighbor nearest test honest about what it verifies
conradbzura 1278d27
test: Execute full intersect/merge/nearest pipeline through GIQL
conradbzura 3f2ced4
test: Register and propagate integration marker
conradbzura 8a2f29b
test: Apply BDD naming, GWT docstrings, and AAA comments to integrati…
conradbzura 806511e
test: Move bedtools helper tests next to the helpers they cover
conradbzura 837927c
test: Consolidate COVERAGE tests into tests/unit/ and drop root-level…
conradbzura 0df62b1
test: Move test_data_models alongside its target module
conradbzura 91c19c4
test: Apply BDD naming, GWT docstrings, and AAA comments across unit …
conradbzura a3a8611
test: Set explicit max_examples on all Hypothesis property tests
conradbzura d1ac6be
test: Rename _transform_and_sql helper to reflect full pipeline scope
conradbzura e1d01c5
style: Apply small hygiene fixes to COVERAGE source files
conradbzura 52f092a
refactor: Use typed SQLGlot aggregate nodes in COVERAGE transformer
conradbzura 3d5f682
docs: Polish COVERAGE operator reference and recipes
conradbzura 1f3fdd4
test: Strengthen integration-test rigor and extract random-interval h…
conradbzura 54b8815
test: Tighten unit-test rigor for COVERAGE and bedtools helpers
conradbzura 5ae4536
refactor: Scope COVERAGE to count statistic only
conradbzura 93c1ff9
docs: Trim COVERAGE reference and recipes to count-only scope
conradbzura e3dd879
refactor: Rename COVERAGE operator to RASTERIZE
conradbzura 539d5a7
docs: Rename COVERAGE references to RASTERIZE
conradbzura File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,145 @@ | ||
| Rasterize | ||
| ========= | ||
|
|
||
| This section covers patterns for projecting interval data onto a fixed-resolution bin grid using GIQL's ``RASTERIZE`` operator. | ||
|
|
||
| Basic Usage | ||
| ----------- | ||
|
|
||
| Rasterized counts underpin most genome-wide signal summaries — read-pileup plots for ChIP-seq, exon-level depth in RNA-seq, and peak-density overviews across megabases. The recipes below start from a canonical per-bin count and build toward more specialised variants. | ||
|
|
||
| Count Overlapping Features | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Count the number of features overlapping each 1 kb bin across the genome: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT RASTERIZE(interval, 1000) AS depth | ||
| FROM features | ||
|
|
||
| **Sample output:** | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| ┌────────┬────────┬────────┬───────┐ | ||
| │ chrom │ start │ end │ depth │ | ||
| ├────────┼────────┼────────┼───────┤ | ||
| │ chr1 │ 0 │ 1000 │ 3 │ | ||
| │ chr1 │ 1000 │ 2000 │ 1 │ | ||
| │ chr1 │ 2000 │ 3000 │ 0 │ | ||
| │ ... │ ... │ ... │ ... │ | ||
| └────────┴────────┴────────┴───────┘ | ||
|
|
||
| Each row represents one genomic bin. Bins with no overlapping features appear with a count of zero. An interval that spans more than one bin is counted in each bin it overlaps (the ``bedtools coverage`` convention), so the sum of bin counts is generally greater than the number of source intervals. | ||
|
|
||
| **Use case:** Compute read depth or feature density at a fixed resolution. | ||
|
|
||
| Custom Bin Size | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| Use a finer resolution of 100 bp: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT RASTERIZE(interval, 100) AS depth | ||
| FROM reads | ||
|
|
||
| **Use case:** High-resolution count tracks for visualisation. | ||
|
|
||
| Named Resolution Parameter | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The resolution can also be supplied by name: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT RASTERIZE(interval, resolution := 500) AS depth | ||
| FROM features | ||
|
|
||
| Both ``:=`` and ``=>`` are accepted for named parameters. | ||
|
|
||
| .. note:: | ||
|
|
||
| Weighted summary statistics (mean, sum, min, max over interval values, with bin-boundary-aware weighting) are not yet implemented. See the project tracker for the follow-up. | ||
|
|
||
| Filtered Rasterization | ||
| ---------------------- | ||
|
|
||
| Strand-Specific Counts | ||
| ~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Compute per-bin counts for each strand separately by filtering: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| -- Plus strand | ||
| SELECT RASTERIZE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE strand = '+' | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| -- Minus strand | ||
| SELECT RASTERIZE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE strand = '-' | ||
|
|
||
| **Use case:** Strand-specific signal tracks for RNA-seq or stranded assays. | ||
|
|
||
| High-Scoring Features | ||
| ~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Restrict counts to features above a quality threshold: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT RASTERIZE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE score > 10 | ||
|
|
||
| 5' End Counting | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| To count only the 5' ends of features (e.g. TSS or read starts), first | ||
| create a view or CTE that trims each interval to its 5' end, then apply | ||
| ``RASTERIZE``: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| WITH five_prime AS ( | ||
| SELECT chrom, "start", "start" + 1 AS "end" | ||
| FROM features | ||
| WHERE strand = '+' | ||
| UNION ALL | ||
| SELECT chrom, "end" - 1 AS "start", "end" | ||
| FROM features | ||
| WHERE strand = '-' | ||
| ) | ||
| SELECT RASTERIZE(interval, 1000) AS tss_count | ||
| FROM five_prime | ||
|
|
||
| Normalised Counts | ||
| ----------------- | ||
|
|
||
| RPM Normalisation | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Normalise bin counts to reads per million (RPM) by dividing by the total | ||
| number of reads: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| WITH bins AS ( | ||
| SELECT RASTERIZE(interval, 1000) AS depth | ||
| FROM reads | ||
| ), | ||
| total AS ( | ||
| SELECT COUNT(*) AS n FROM reads | ||
| ) | ||
| SELECT | ||
| bins.chrom, | ||
| bins.start, | ||
| bins.end, | ||
| bins.depth * 1000000.0 / total.n AS rpm | ||
| FROM bins, total |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.