feat!: use MolecularAnnotation spec and library#107
Conversation
The molecular_annotation crate moved strand from AnnotationType to per-Annotation and replaced from_ma_record/write_ma_record with from_record + from_tags / to_record. Update src/utils/ma_io.rs and tests/molecular_annotation.rs to call sites. read_ma_tags now extracts MA/AL/AQ/AN aux fields explicitly and feeds them through MolecularAnnotations::from_tags, then attaches AlignedBlocks for liftover.
Updates the module-level doc comment to reflect the post-Phase-0 API shape, the planned producer/consumer split for nuc/msp/fire, and the in-memory-only m6a/cpg story (MM/ML stays the on-disk truth).
Adds FIRE_TYPE constant, build_annotations (nuc/msp/fire-aware), and extract_fire_arrays. build_nuc_msp_annotations is preserved as a thin wrapper so existing producers keep compiling until they migrate to the fire-aware builder.
This wires up nuc,msp, and fire annotations to go thru MA spec. Does not cover m6a/cpg.
Adopt MA spec's inward liftover semantics for query-space ranges. Legacy liftover_closest used inclusive endpoint matching (q_pos <= q_en), which mapped positions at block ends back into the previous block, over-extending ranges across deletions by 1bp. Inward semantics treat blocks as half-open, so a query position at the edge of a deletion lifts to the next aligned reference base. Nucleosome and MSP lengths now match the read's actual aligned footprint. Snapshot regenerated.
Replace direct field access (fiber.msp, fiber.nuc, etc.) with method accessors that return AnnotationTypeView<'_> derived from the spec's MolecularAnnotations. read_annotations + BaseMods::populate_ma now populate the spec object at construction with nuc/msp/fire from MA tags and m6a/cpg from ML/MM, so the spec is the source of truth for read-side coordinates. AnnotationTypeView delegates to the library's iter_type / AnnotationInfo for per-annotation iteration and ref-coord lifting, and provides Vec<i64> / Vec<Option<i64>> column accessors for the legacy column-oriented call sites. No bespoke per-annotation wrapper struct; call sites read AnnotationInfo fields directly. Legacy Ranges fields on FiberseqData are retained for the centering path, which still mutates them; they'll be dropped after centering migrates to the spec's translate. All read-site call sites (decorator, fire, footprint, pileup, qc, validate, utils/fire) updated to call fiber.X() instead of fiber.X. Centering deliberately left on field access.
0b62961 to
0262039
Compare
ee4e71a to
b9b8ac8
Compare
Add `primary_qual(&[u8], &str) -> u8` and `AnnotationTypeView::qual_at(idx)` to make the single-quality assumption explicit. `qual()` now delegates to `qual_at(0)` with `debug_assert!(qualities.len() <= 1, …)`, so a future multi-quality annotation type can't silently lose qualities through `first().copied().unwrap_or(0)`. Migrates the existing single-quality call sites in ma_io (3 sites), footprint, and qc to the helper. Same byte-for-byte output; the assert fires only in debug builds if a multi-quality type appears.
CenteredFiberData no longer mutates a cloned FiberseqData. Instead it stashes both offsets and projects on-demand via the spec library's project_query / project_reference (which return ProjectedAnnotation with the centered coords already computed). grab_data collapses to a single helper per type. validate_msp_m6a_alignment moved onto CenteredFiberData, also using project_query. FiberseqData drops msp/nuc/m6a/cpg Ranges fields; read sites already go through the AnnotationTypeView accessors over self.annotations. apply_filter_fsd switched to MolecularAnnotations::retain. Call sites in pileup, add_nucleosomes, fiber_hmm, qc, fire, input_bam adjusted to the new field-less FiberseqData; non-regression test files (extract_test, m6a_prediction_test) also migrated. Dead code left for a follow-up: FiberAnnotations::apply_offset / apply_offset_helper, and bamlift.rs (still used by Ranges::new 1bp path and CenteredFiberData::find_offset).
0c66b9f to
650a2bf
Compare
Adds `read_fibertig_tags` / `write_fibertig_tags` helpers and a `FIBERTIG_TYPE = "fibertig"` annotation type name. The fs/fl/fa BAM tag wire format is unchanged (positions in molecular orientation, fa as `|`-separated chunks with `;`-joined columns), so existing fibertig-annotated BAMs decode byte-for-byte identically. In memory the pipeline now uses MolecularAnnotations / Annotation throughout: - read_bed_annotations returns HashMap<String, MolecularAnnotations>. - approximately_divide_annotations_by_window_size operates on MolecularAnnotations. - create_annotated_records_from_splits shifts coords record-relative and writes via write_fibertig_tags. - extract_to_bed reads via read_fibertig_tags and lifts ref coords through the record's aligned blocks. The Annotation.name field carries the BED column 4+ payload as a single semicolon-joined string, so each annotation owns its own extras — the historical fa-pairing concern around shared/overlapping peaks (and the `forward_extras` plumbing in FiberAnnotations) is structurally impossible now. Tests in tests/fibertig_test.rs updated to the new API. Storage is molecular ascending (matches fs input order), and iter_type yields BAM-orient query coords + lifted ref coords. Remaining users of FiberAnnotations: BaseMods (ML/MM-derived ranges held internally). `FiberAnnotations::from_bam_tags` / `write_to_bam_tags` on bamannotations.rs are now unused dead code; left in place for a followup that also migrates BaseMods and lets us delete FiberAnnotations + bamlift.rs entirely.
Delete the BaseMods / BaseMod / ModCall intermediate. MM/ML is now pure I/O on a MolecularAnnotations object: parse_mm_ml_into_ma fills it, write_mm_ml emits it. The legacy FiberAnnotation / FiberAnnotations / Ranges types and their iterators are deleted entirely. Behavior change - MM tag canonicalization Canonical fiberseq groups (mod_type 'a' -> m6a, 'm' -> cpg) have their per-group (base, strand_sign) reconstructed from the forward sequence on write. Non-canonical groups (e.g. C+h, T+f, ChEBI numeric IDs) are preserved verbatim under an annotation type named after the MM group header and round-trip unchanged. | Input MM tag | Output MM tag | Note | |-----------------------------|---------------------|--------------------------------------| | A+a,5;T-a,3;C+m,2; | A+a,5;C+m,2;T-a,3; | unchanged; emitted in sorted order | | N+a,3,2,1; (calls on A/T/G) | A+a,3;G+a,0;T-a,2; | regrouped from forward seq | | C+h,5;C+m,2; | C+h,5;C+m,2; | non-canonical group preserved | Positions and ML qualities round-trip exactly in every case.
bamlift's lift_query_range / lift_reference_positions are gone - the
spec library's AlignedBlocks::{lift_to_reference, lift_to_query} is
the single liftover path. center::find_offset is rewired onto
lift_to_query.
The boundary-bug regression (fiberseq#90) and coordinate-lift test are kept
and migrated, not deleted. Both now exercise AlignedBlocks. The
spec's inward-snap semantics natively fix fiberseq#90: an MSP that ends
exactly on a block boundary now lifts to the end of the current
block, not the start of the next one.
Collapses MA reads/writes to one spot each, addressing collaborator feedback that nuc/fire were doing their own from_record + write round-trips: - Add FiberseqData::serialize_annotations(legacy) as the single write entry point for code paths that have a FiberseqData. Wraps ma_io::write_annotations. - nucleosome::add_nucleosomes_to_record -> add_nucleosomes_to_annotations: takes &mut MolecularAnnotations and mutates in place; caller serializes. - fire::add_fire_to_rec uses rec.annotations directly instead of re-reading the record; emits via serialize_annotations. - subcommands/add_nucleosomes.rs constructs FiberseqData once per record and calls serialize_annotations at the end. - mock_fire.rs keeps its direct ma_io::write_annotations call (no FiberseqData since records are synthesized from BED); commented. While here: ma_io::write_annotations now strips m6a / cpg before emission as a safety net. The module already documented that base-mod types live in MM/ML, never in the MA tag set, but the writer wasn't enforcing it. Without this guard the new centralized flow would double-emit m6a (once via MM/ML, once via MA). Read path unchanged: FiberseqData::new is still the only place that reads MA tags + MM/ML.
Finishes the fibertig migration: the BED ↔ BAM pipeline now goes through
ma_io::{read,write}_annotations on the `fibertig` type, so fibertig data
lives on the MA/AL/AN tag set like every other fibertools annotation
type. Deletes read_fibertig_tags / write_fibertig_tags and the fs/fl/fa
wire format entirely - no fallback, no legacy flag.
Percent-encode `,` in BED extras
The MA-spec AN tag uses `,` as its separator with no escape mechanism
defined, but BED column 4+ payload can legally contain commas. The
fibertig BED reader percent-encodes `,` (-> %2C) and `%` (-> %25) into
Annotation.name on the way in; extract_to_bed reverses it on the way
out. AN on disk stays spec-compliant; round-trip is lossless for any
ASCII input.
This is a breaking change for any external tool that reads fs/fl/fa
from fibertig-annotated BAMs.
Post-MA-migration `add_fire_to_rec` writes FIRE precisions to a separate `fire+P` annotation type and leaves MSP with `QualitySpec::none()`. But several consumers still read `msp.qual()` expecting FIRE precision, the legacy convention from when fibertools stored precisions on the MSP `aq` tag.
Three producers loaded the existing annotation set from the record and appended their fresh output without first dropping the type they were about to re-emit. Re-running the command on its own output silently doubled the annotation list (or, in the case of nuc+msp where the existing type had a different QualitySpec, panicked inside a par_iter_mut worker): - nucleosome::add_nucleosomes_to_annotations Now drops pre-existing nuc, msp, AND fire types before adding. fire is included because it's paired 1:1 with msp by index - leaving fire behind while replacing msp would strand fire precisions against unrelated MSPs. Affects `ft add-nucleosomes` and `ft predict-m6a` (which also runs nucleosome calling on its output). - fire::add_fire_to_rec Drops the pre-existing fire type before adding. Affects `ft fire` re-runs (e.g. re-running with a different model). - ddda_to_m6a::ddda_to_m6a_record Drops pre-existing m6a before appending the Y/R-derived calls. The old code would emit duplicate MM positions for any Y/R base that also had a prior m6a call. Tests: two new integration tests in tests/molecular_annotation.rs exercise re-run idempotency for nucleosomes (including the stale-fire case) and fire. Both fail without these fixes.
Two latent correctness bugs in `parse_mm_ml_into_ma`: 1. **Multi-char alpha mod types collided with canonical dispatch.** The bucket selector matched the first character of `mod_type_str` (`'a'`/`'m'`), so any non-canonical alpha code starting with one of those — `C+ac` (acetyl), hypothetical `A+myloride`, etc. — silently routed into the `m6a` / `cpg` bucket and was rewritten as fiberseq canonical on the next write. The SAM regex permits multi-char alpha mod codes (`[A-Za-z]+`), so the input is valid. Now matches the exact string `"a"` / `"m"`; everything else lands in the verbatim non-canonical bucket and round-trips unchanged. 2. **Pre-existing basemod types in `annot` silently merged.** `add_calls` ultimately calls `MolecularAnnotations::add_annotation_type`, which returns the existing type if the name+QualitySpec match, and then appends. Two paths could hit this: repeated calls to `parse_mm_ml_into_ma` on the same `annot`, or a third-party BAM that wrongly emits `m6a` into both the MA tag set AND MM/ML — `FiberseqData::new` runs `read_annotations` before `parse_mm_ml_into_ma`, so the latter would merge on top instead of overwriting. MM/ML is the on-disk source of truth for basemods; now the parser clears any m6a/cpg/non-canonical type off `annot` first, making the function idempotent and ensuring MM/ML always wins on conflict. Tests: - `test_multichar_alpha_modtype_not_treated_as_canonical` synthesizes a `C+ac` MM tag and asserts it lands in its own type, not m6a. - `test_parse_mm_ml_idempotent_on_rerun` confirms double-invocation is a no-op and stale pre-seeded m6a is replaced, not merged. Both tests fail without these fixes.
`parse_mm_ml_into_ma` previously aborted on any MM group that claimed more `mod_base` positions than the forward sequence contained — the `assert_eq!` after the resolver loop panicked the whole worker thread. A single bad-actor BAM record could take down a `par_iter` chunk. Now warns and truncates the resolved positions. Critically, the ML offset arithmetic still advances by the *on-disk* slot count (`mod_dists.len()`), not the resolved count — otherwise the next group's ML qualities would be shifted by the truncated remainder and silently re-align with the wrong positions. Test: synthesizes a record whose first MM group `A+a` claims `total_a + 2` positions (impossible), followed by a legitimate `C+m` group with a sentinel quality. Asserts (a) no panic, (b) the m6a count is truncated to actual `A` count, (c) the C+m sentinel quality survives — proving the ML offset stayed correctly aligned.
| linear-map = "1.2.0" | ||
| regex = "1.9.1" | ||
| rust-htslib = "0.46" | ||
| rust-htslib = "0.47" |
There was a problem hiding this comment.
We were having lots of breaks in the static binary builds when we tried this last. Worth checking out
There was a problem hiding this comment.
Interesting, I will look into that.
| if starts.is_empty() { | ||
| return; | ||
| } | ||
| let t = annot.add_annotation_type(FIRE_TYPE, "P".parse::<QualitySpec>().expect("P parses")); |
| return; | ||
| } | ||
|
|
||
| let starts: Vec<u32> = msp.annotations.iter().map(|a| a.start).collect(); |
There was a problem hiding this comment.
Not all msps should become a fire annotation. Only ones with non zero scores. But maybe I am not understanding the code.
There was a problem hiding this comment.
big changes. Fine, but not sure I'm tracking all the logic.
There was a problem hiding this comment.
Yea, base mods need more thinking through. The MA library doesn't quite have the API to represent base mods in a way that makes it simple to read them from MM tags AND emit them in MM tags. Will think more about how to fix this from the library side.
There was a problem hiding this comment.
we might need a couple tests to make sure we aren't changing/breaking anything in fibertigs. maybe you already have.
| /// When `legacy` is not set, any pre-existing legacy tags are stripped so | ||
| /// the output is MA-only. | ||
| pub fn write_annotations(record: &mut bam::Record, annot: &MolecularAnnotations, legacy: bool) { | ||
| for tag in LEGACY_NUC_MSP_TAGS { |
There was a problem hiding this comment.
we only want to remove these tags if we know they were fibertools tags. otherwise we are removing other's user defined tags.
Which I think means it would be best/easier to drop these in the reader after we make the MA tag(if it did not exist)
| pub cpg: Ranges, | ||
| pub base_mods: BaseMods, | ||
| pub annotations: MolecularAnnotations, | ||
| pub ec: f32, |
There was a problem hiding this comment.
I think we should have a reference to the "accessibility" molecular annotation here, which will differ depending on whether we have a fiber-seq or daf-seq set of tags in the MA tags. And then tools like add_nucs and fire will pull counts from the accesibility tag rather than m6a directly, so it can work across techs without faking m6a calls.
FIRE is now the filtered subset of MSPs with non-zero XGB precision —
its own coords and qualities, not a sidecar to MSP. Consumers that
previously asked "what's the FIRE qual for this MSP" via the legacy
1:1 model are rewritten to iterate fiber.fire() directly.
* add_fire_to_rec filters precisions to p > 0 before building fire+Q
* fire_qual() helper deleted — its semantics conflated MSP iteration
with FIRE quality lookup
* pileup, validate, decorator/fire_decorators iterate fiber.fire()
* footprint splits the conflated msp_fire_scores into independent
has_spanning_msp + fire_quals tracked over fiber.fire() spanning
* BED12 (fibertools extract --all) and centered TSV writers emit
FIRE as its own coord group (starts/lengths/qual/ref_starts/
ref_lengths) parallel to nuc/m6a/cpg, not wedged into MSP columns
Obsolete tests removed: legacy_write_emits_both_tag_sets,
legacy_write_inverse_roundtrip, fire_qual_matches_legacy_and_ma_paths.
extract_all_napa snapshot regenerated to reflect new column shape.
Replace the BaseMods/BaseMod struct pair with parse_mm_ml_into_ma and write_mm_ml, which read and write MM/ML directly against a shared MolecularAnnotations object. Each call's MM group header (e.g. "A+a", "T-a", "N+a") is stored verbatim on Annotation.name so it round-trips through the writer byte-for-byte — including non-standard encodings like N+a that the old path silently normalized to A+a;T-a.
|
@mrvollger We read basemods (from edit: This is only really a "problem" if a common use case is for users of fibertools/MA want to look at basemods frequently. Which I'm unsure about Two possible options I'm considering and could use feedback on:
Not a blocker here but something to think about |
|
I feel fairly strongly that we should do option 2 and make the basemod reading and writing part of the MA lib. We could make it optional parts of the builder and writer in the lib. |
|
We could even make it a feature flag, that works with rust htslib and another for noodles |
This PR migrates fibertools to use the MolecularAnnotation (MA) Spec and library.
Motivation (no particular order):
Behavior changes worth flagging:
Deferred / follow-up
generic MA CLI.