feat!: use MolecularAnnotation spec and library by cademirch · Pull Request #107 · fiberseq/fibertools-rs

cademirch · 2026-05-12T04:02:35Z

This PR migrates fibertools to use the MolecularAnnotation (MA) Spec and library.

Motivation (no particular order):

Conflicting / non-portable BAM tags. Fibertools emitted custom lowercase aux tags (ns/nl for nucleosomes, as/al/aq for MSPs, etc.). This led to: collisions with other custom aux tags (from other tools) and aux tag bloat (fibertigs). Also custom aux tags are mostly unreadable by any other tool without a fibertools-specific parser. After this PR, all extra tags are MA-spec compliant and any MA-aware reader can consume fibertools output without custom code.
Singular in memory representations of annotations. Before this PR, each annotation type was represented by its own field on FiberseqData. Now, all annotations live in one field in the MolecularAnnotations container type which will allow for simple addition of future annotations. This also makes it simpler to iterate all annotations for whatever operations needed.
Single liftover code path. Pre PR, there were 3 lift functions (with subtly different behavior at block boundaries). Now fibertools defers to the MA library liftover module. Any future issues that arise with liftovers should be simpler to track down.
Simpler base-mod handling. BaseMods/BaseMod is gone. MM/ML is pure I/O against the same MA object. ddda-to-m6a no longer has to fake up BaseMod intermediates, synthesized m6a is just another annotation appended to the MA object, written out by the same path as predict-m6a.
Centralized MA I/O. Reads happen in exactly one place (FiberseqData::new); writes happen in exactly one place per code path (FiberseqData::serialize_annotations, with mock_fire as a documented exception since it synthesizes records from BED).

Behavior changes worth flagging:

Legacy lowercase tags (ns/nl, as/al/aq, fire-specific tags, etc.) are no longer emitted by default. Anything reading them needs to switch to the MA tag set, or callers can pass --legacy-tags to keep emitting them alongside MA during the migration.
relating to Possible reference_ends bug? #90, block-boundary lifts no longer stretch MSPs/nucleosomes into adjacent blocks. This is consequence of MA library's inward-snap semantics.
MM tag canonicalization: groups are emitted in canonical fiberseq form (A+a/T-a for m6a, C+m/G+m for 5mC) reconstructed from the forward sequence. Positions and ML qualities round-trip exactly; MM tag bytes for non-canonical encodings (e.g. N+a wildcard groups, which the SAM spec allows but fiberseq tools don't emit in practice) are not byte-identical to input. @mrvollger could use your thoughts here especially. I'm not super familiar with MM tags and if this would be a problem.

Deferred / follow-up

Eliminate ddda-to-m6a as a separate subcommand by reading DddA Y/R sequence directly into m6a annotations on the read path. Plumbing change here makes this a straightforward follow-up.
Make extract --all, center, and pileup discover arbitrary user-defined annotation types in the MA tag set and emit columns/rows for them automatically — so fibertools can serve as a
generic MA CLI.

The molecular_annotation crate moved strand from AnnotationType to per-Annotation and replaced from_ma_record/write_ma_record with from_record + from_tags / to_record. Update src/utils/ma_io.rs and tests/molecular_annotation.rs to call sites. read_ma_tags now extracts MA/AL/AQ/AN aux fields explicitly and feeds them through MolecularAnnotations::from_tags, then attaches AlignedBlocks for liftover.

Updates the module-level doc comment to reflect the post-Phase-0 API shape, the planned producer/consumer split for nuc/msp/fire, and the in-memory-only m6a/cpg story (MM/ML stays the on-disk truth).

Adds FIRE_TYPE constant, build_annotations (nuc/msp/fire-aware), and extract_fire_arrays. build_nuc_msp_annotations is preserved as a thin wrapper so existing producers keep compiling until they migrate to the fire-aware builder.

This wires up nuc,msp, and fire annotations to go thru MA spec. Does not cover m6a/cpg.

Adopt MA spec's inward liftover semantics for query-space ranges. Legacy liftover_closest used inclusive endpoint matching (q_pos <= q_en), which mapped positions at block ends back into the previous block, over-extending ranges across deletions by 1bp. Inward semantics treat blocks as half-open, so a query position at the edge of a deletion lifts to the next aligned reference base. Nucleosome and MSP lengths now match the read's actual aligned footprint. Snapshot regenerated.

Replace direct field access (fiber.msp, fiber.nuc, etc.) with method accessors that return AnnotationTypeView<'_> derived from the spec's MolecularAnnotations. read_annotations + BaseMods::populate_ma now populate the spec object at construction with nuc/msp/fire from MA tags and m6a/cpg from ML/MM, so the spec is the source of truth for read-side coordinates. AnnotationTypeView delegates to the library's iter_type / AnnotationInfo for per-annotation iteration and ref-coord lifting, and provides Vec<i64> / Vec<Option<i64>> column accessors for the legacy column-oriented call sites. No bespoke per-annotation wrapper struct; call sites read AnnotationInfo fields directly. Legacy Ranges fields on FiberseqData are retained for the centering path, which still mutates them; they'll be dropped after centering migrates to the spec's translate. All read-site call sites (decorator, fire, footprint, pileup, qc, validate, utils/fire) updated to call fiber.X() instead of fiber.X. Centering deliberately left on field access.

Add `primary_qual(&[u8], &str) -> u8` and `AnnotationTypeView::qual_at(idx)` to make the single-quality assumption explicit. `qual()` now delegates to `qual_at(0)` with `debug_assert!(qualities.len() <= 1, …)`, so a future multi-quality annotation type can't silently lose qualities through `first().copied().unwrap_or(0)`. Migrates the existing single-quality call sites in ma_io (3 sites), footprint, and qc to the helper. Same byte-for-byte output; the assert fires only in debug builds if a multi-quality type appears.

CenteredFiberData no longer mutates a cloned FiberseqData. Instead it stashes both offsets and projects on-demand via the spec library's project_query / project_reference (which return ProjectedAnnotation with the centered coords already computed). grab_data collapses to a single helper per type. validate_msp_m6a_alignment moved onto CenteredFiberData, also using project_query. FiberseqData drops msp/nuc/m6a/cpg Ranges fields; read sites already go through the AnnotationTypeView accessors over self.annotations. apply_filter_fsd switched to MolecularAnnotations::retain. Call sites in pileup, add_nucleosomes, fiber_hmm, qc, fire, input_bam adjusted to the new field-less FiberseqData; non-regression test files (extract_test, m6a_prediction_test) also migrated. Dead code left for a follow-up: FiberAnnotations::apply_offset / apply_offset_helper, and bamlift.rs (still used by Ranges::new 1bp path and CenteredFiberData::find_offset).

Adds `read_fibertig_tags` / `write_fibertig_tags` helpers and a `FIBERTIG_TYPE = "fibertig"` annotation type name. The fs/fl/fa BAM tag wire format is unchanged (positions in molecular orientation, fa as `|`-separated chunks with `;`-joined columns), so existing fibertig-annotated BAMs decode byte-for-byte identically. In memory the pipeline now uses MolecularAnnotations / Annotation throughout: - read_bed_annotations returns HashMap<String, MolecularAnnotations>. - approximately_divide_annotations_by_window_size operates on MolecularAnnotations. - create_annotated_records_from_splits shifts coords record-relative and writes via write_fibertig_tags. - extract_to_bed reads via read_fibertig_tags and lifts ref coords through the record's aligned blocks. The Annotation.name field carries the BED column 4+ payload as a single semicolon-joined string, so each annotation owns its own extras — the historical fa-pairing concern around shared/overlapping peaks (and the `forward_extras` plumbing in FiberAnnotations) is structurally impossible now. Tests in tests/fibertig_test.rs updated to the new API. Storage is molecular ascending (matches fs input order), and iter_type yields BAM-orient query coords + lifted ref coords. Remaining users of FiberAnnotations: BaseMods (ML/MM-derived ranges held internally). `FiberAnnotations::from_bam_tags` / `write_to_bam_tags` on bamannotations.rs are now unused dead code; left in place for a followup that also migrates BaseMods and lets us delete FiberAnnotations + bamlift.rs entirely.

Delete the BaseMods / BaseMod / ModCall intermediate. MM/ML is now pure I/O on a MolecularAnnotations object: parse_mm_ml_into_ma fills it, write_mm_ml emits it. The legacy FiberAnnotation / FiberAnnotations / Ranges types and their iterators are deleted entirely. Behavior change - MM tag canonicalization Canonical fiberseq groups (mod_type 'a' -> m6a, 'm' -> cpg) have their per-group (base, strand_sign) reconstructed from the forward sequence on write. Non-canonical groups (e.g. C+h, T+f, ChEBI numeric IDs) are preserved verbatim under an annotation type named after the MM group header and round-trip unchanged. | Input MM tag | Output MM tag | Note | |-----------------------------|---------------------|--------------------------------------| | A+a,5;T-a,3;C+m,2; | A+a,5;C+m,2;T-a,3; | unchanged; emitted in sorted order | | N+a,3,2,1; (calls on A/T/G) | A+a,3;G+a,0;T-a,2; | regrouped from forward seq | | C+h,5;C+m,2; | C+h,5;C+m,2; | non-canonical group preserved | Positions and ML qualities round-trip exactly in every case.

bamlift's lift_query_range / lift_reference_positions are gone - the spec library's AlignedBlocks::{lift_to_reference, lift_to_query} is the single liftover path. center::find_offset is rewired onto lift_to_query. The boundary-bug regression (fiberseq#90) and coordinate-lift test are kept and migrated, not deleted. Both now exercise AlignedBlocks. The spec's inward-snap semantics natively fix fiberseq#90: an MSP that ends exactly on a block boundary now lifts to the end of the current block, not the start of the next one.

Collapses MA reads/writes to one spot each, addressing collaborator feedback that nuc/fire were doing their own from_record + write round-trips: - Add FiberseqData::serialize_annotations(legacy) as the single write entry point for code paths that have a FiberseqData. Wraps ma_io::write_annotations. - nucleosome::add_nucleosomes_to_record -> add_nucleosomes_to_annotations: takes &mut MolecularAnnotations and mutates in place; caller serializes. - fire::add_fire_to_rec uses rec.annotations directly instead of re-reading the record; emits via serialize_annotations. - subcommands/add_nucleosomes.rs constructs FiberseqData once per record and calls serialize_annotations at the end. - mock_fire.rs keeps its direct ma_io::write_annotations call (no FiberseqData since records are synthesized from BED); commented. While here: ma_io::write_annotations now strips m6a / cpg before emission as a safety net. The module already documented that base-mod types live in MM/ML, never in the MA tag set, but the writer wasn't enforcing it. Without this guard the new centralized flow would double-emit m6a (once via MM/ML, once via MA). Read path unchanged: FiberseqData::new is still the only place that reads MA tags + MM/ML.

Finishes the fibertig migration: the BED ↔ BAM pipeline now goes through ma_io::{read,write}_annotations on the `fibertig` type, so fibertig data lives on the MA/AL/AN tag set like every other fibertools annotation type. Deletes read_fibertig_tags / write_fibertig_tags and the fs/fl/fa wire format entirely - no fallback, no legacy flag. Percent-encode `,` in BED extras The MA-spec AN tag uses `,` as its separator with no escape mechanism defined, but BED column 4+ payload can legally contain commas. The fibertig BED reader percent-encodes `,` (-> %2C) and `%` (-> %25) into Annotation.name on the way in; extract_to_bed reverses it on the way out. AN on disk stays spec-compliant; round-trip is lossless for any ASCII input. This is a breaking change for any external tool that reads fs/fl/fa from fibertig-annotated BAMs.

Post-MA-migration `add_fire_to_rec` writes FIRE precisions to a separate `fire+P` annotation type and leaves MSP with `QualitySpec::none()`. But several consumers still read `msp.qual()` expecting FIRE precision, the legacy convention from when fibertools stored precisions on the MSP `aq` tag.

Three producers loaded the existing annotation set from the record and appended their fresh output without first dropping the type they were about to re-emit. Re-running the command on its own output silently doubled the annotation list (or, in the case of nuc+msp where the existing type had a different QualitySpec, panicked inside a par_iter_mut worker): - nucleosome::add_nucleosomes_to_annotations Now drops pre-existing nuc, msp, AND fire types before adding. fire is included because it's paired 1:1 with msp by index - leaving fire behind while replacing msp would strand fire precisions against unrelated MSPs. Affects `ft add-nucleosomes` and `ft predict-m6a` (which also runs nucleosome calling on its output). - fire::add_fire_to_rec Drops the pre-existing fire type before adding. Affects `ft fire` re-runs (e.g. re-running with a different model). - ddda_to_m6a::ddda_to_m6a_record Drops pre-existing m6a before appending the Y/R-derived calls. The old code would emit duplicate MM positions for any Y/R base that also had a prior m6a call. Tests: two new integration tests in tests/molecular_annotation.rs exercise re-run idempotency for nucleosomes (including the stale-fire case) and fire. Both fail without these fixes.

Two latent correctness bugs in `parse_mm_ml_into_ma`: 1. **Multi-char alpha mod types collided with canonical dispatch.** The bucket selector matched the first character of `mod_type_str` (`'a'`/`'m'`), so any non-canonical alpha code starting with one of those — `C+ac` (acetyl), hypothetical `A+myloride`, etc. — silently routed into the `m6a` / `cpg` bucket and was rewritten as fiberseq canonical on the next write. The SAM regex permits multi-char alpha mod codes (`[A-Za-z]+`), so the input is valid. Now matches the exact string `"a"` / `"m"`; everything else lands in the verbatim non-canonical bucket and round-trips unchanged. 2. **Pre-existing basemod types in `annot` silently merged.** `add_calls` ultimately calls `MolecularAnnotations::add_annotation_type`, which returns the existing type if the name+QualitySpec match, and then appends. Two paths could hit this: repeated calls to `parse_mm_ml_into_ma` on the same `annot`, or a third-party BAM that wrongly emits `m6a` into both the MA tag set AND MM/ML — `FiberseqData::new` runs `read_annotations` before `parse_mm_ml_into_ma`, so the latter would merge on top instead of overwriting. MM/ML is the on-disk source of truth for basemods; now the parser clears any m6a/cpg/non-canonical type off `annot` first, making the function idempotent and ensuring MM/ML always wins on conflict. Tests: - `test_multichar_alpha_modtype_not_treated_as_canonical` synthesizes a `C+ac` MM tag and asserts it lands in its own type, not m6a. - `test_parse_mm_ml_idempotent_on_rerun` confirms double-invocation is a no-op and stale pre-seeded m6a is replaced, not merged. Both tests fail without these fixes.

`parse_mm_ml_into_ma` previously aborted on any MM group that claimed more `mod_base` positions than the forward sequence contained — the `assert_eq!` after the resolver loop panicked the whole worker thread. A single bad-actor BAM record could take down a `par_iter` chunk. Now warns and truncates the resolved positions. Critically, the ML offset arithmetic still advances by the *on-disk* slot count (`mod_dists.len()`), not the resolved count — otherwise the next group's ML qualities would be shifted by the truncated remainder and silently re-align with the wrong positions. Test: synthesizes a record whose first MM group `A+a` claims `total_a + 2` positions (impossible), followed by a legitimate `C+m` group with a sentinel quality. Asserts (a) no panic, (b) the m6a count is truncated to actual `A` count, (c) the C+m sentinel quality survives — proving the ML offset stayed correctly aligned.

mrvollger · 2026-05-12T10:34:34Z

 linear-map = "1.2.0"
 regex = "1.9.1"
-rust-htslib = "0.46"
+rust-htslib = "0.47"


We were having lots of breaks in the static binary builds when we tried this last. Worth checking out

Interesting, I will look into that.

mrvollger · 2026-05-12T10:54:47Z

+    if starts.is_empty() {
+        return;
+    }
+    let t = annot.add_annotation_type(FIRE_TYPE, "P".parse::<QualitySpec>().expect("P parses"));


Quality is linear here

mrvollger · 2026-05-12T11:03:05Z

+        return;
+    }
+
+    let starts: Vec<u32> = msp.annotations.iter().map(|a| a.start).collect();


Not all msps should become a fire annotation. Only ones with non zero scores. But maybe I am not understanding the code.

mrvollger · 2026-05-14T01:29:51Z

big changes. Fine, but not sure I'm tracking all the logic.

Yea, base mods need more thinking through. The MA library doesn't quite have the API to represent base mods in a way that makes it simple to read them from MM tags AND emit them in MM tags. Will think more about how to fix this from the library side.

mrvollger · 2026-05-14T01:30:52Z

we might need a couple tests to make sure we aren't changing/breaking anything in fibertigs. maybe you already have.

mrvollger · 2026-05-14T01:34:14Z

+/// When `legacy` is not set, any pre-existing legacy tags are stripped so
+/// the output is MA-only.
+pub fn write_annotations(record: &mut bam::Record, annot: &MolecularAnnotations, legacy: bool) {
+    for tag in LEGACY_NUC_MSP_TAGS {


we only want to remove these tags if we know they were fibertools tags. otherwise we are removing other's user defined tags.

Which I think means it would be best/easier to drop these in the reader after we make the MA tag(if it did not exist)

mrvollger · 2026-05-14T01:38:52Z

-    pub cpg: Ranges,
-    pub base_mods: BaseMods,
+    pub annotations: MolecularAnnotations,
    pub ec: f32,


I think we should have a reference to the "accessibility" molecular annotation here, which will differ depending on whether we have a fiber-seq or daf-seq set of tags in the MA tags. And then tools like add_nucs and fire will pull counts from the accesibility tag rather than m6a directly, so it can work across techs without faking m6a calls.

We can save for a future PR.

FIRE is now the filtered subset of MSPs with non-zero XGB precision — its own coords and qualities, not a sidecar to MSP. Consumers that previously asked "what's the FIRE qual for this MSP" via the legacy 1:1 model are rewritten to iterate fiber.fire() directly. * add_fire_to_rec filters precisions to p > 0 before building fire+Q * fire_qual() helper deleted — its semantics conflated MSP iteration with FIRE quality lookup * pileup, validate, decorator/fire_decorators iterate fiber.fire() * footprint splits the conflated msp_fire_scores into independent has_spanning_msp + fire_quals tracked over fiber.fire() spanning * BED12 (fibertools extract --all) and centered TSV writers emit FIRE as its own coord group (starts/lengths/qual/ref_starts/ ref_lengths) parallel to nuc/m6a/cpg, not wedged into MSP columns Obsolete tests removed: legacy_write_emits_both_tag_sets, legacy_write_inverse_roundtrip, fire_qual_matches_legacy_and_ma_paths. extract_all_napa snapshot regenerated to reflect new column shape.

Replace the BaseMods/BaseMod struct pair with parse_mm_ml_into_ma and write_mm_ml, which read and write MM/ML directly against a shared MolecularAnnotations object. Each call's MM group header (e.g. "A+a", "T-a", "N+a") is stored verbatim on Annotation.name so it round-trips through the writer byte-for-byte — including non-standard encodings like N+a that the old path silently normalized to A+a;T-a.

cademirch · 2026-05-20T23:31:56Z

@mrvollger We read basemods (from MM/ML) into MolecularAnnotations (in memory) but write them back out as MM/ML. This is nice because MM/ML stays on disk source of truth for base modifications, but then downstream users need to do their own dance to read in MM/ML into MolecularAnnotations.

edit: This is only really a "problem" if a common use case is for users of fibertools/MA want to look at basemods frequently. Which I'm unsure about

Two possible options I'm considering and could use feedback on:

Write basemods to MA tags in addition to MM. We do have to decide what is the source of truth for when we encounter bams with both MA and MM for the same signal (ie m6a)
Provide some MM/ML reading api in the MA library. Not quite sure what this looks like yet.

Not a blocker here but something to think about

mrvollger · 2026-05-21T00:00:03Z

I feel fairly strongly that we should do option 2 and make the basemod reading and writing part of the MA lib.

We could make it optional parts of the builder and writer in the lib.

mrvollger · 2026-05-21T02:19:06Z

We could even make it a feature flag, that works with rust htslib and another for noodles

cademirch added 6 commits May 11, 2026 22:05

docs: refresh ma_io module overview

63ceda8

Updates the module-level doc comment to reflect the post-Phase-0 API shape, the planned producer/consumer split for nuc/msp/fire, and the in-memory-only m6a/cpg story (MM/ML stays the on-disk truth).

feat: ma_io fire+P builder and extractor

4d1b766

Adds FIRE_TYPE constant, build_annotations (nuc/msp/fire-aware), and extract_fire_arrays. build_nuc_msp_annotations is preserved as a thin wrapper so existing producers keep compiling until they migrate to the fire-aware builder.

feat: integrate MA spec into all subcommands

451efa4

This wires up nuc,msp, and fire annotations to go thru MA spec. Does not cover m6a/cpg.

cademirch force-pushed the molecular-annotation branch from 0b62961 to 0262039 Compare May 12, 2026 04:06

chore: update ma library dep

b9b8ac8

cademirch force-pushed the molecular-annotation branch from ee4e71a to b9b8ac8 Compare May 12, 2026 04:19

cademirch added 2 commits May 12, 2026 10:58

cademirch force-pushed the molecular-annotation branch from 0c66b9f to 650a2bf Compare May 12, 2026 17:00

cademirch added 4 commits May 12, 2026 11:10

cademirch marked this pull request as ready for review May 12, 2026 23:14

cademirch added 5 commits May 13, 2026 14:08

mrvollger reviewed May 14, 2026

View reviewed changes

cademirch added 4 commits May 14, 2026 11:22

fix: correct FIRE quality spec; use strand::unknown for nuc/msp/fire

91d7646

refactor: drop legacy tag emission across all writers

06e70ea

cademirch mentioned this pull request May 24, 2026

chore: merge molecular annotation #109

Merged

Conversation

cademirch commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cademirch commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrvollger commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrvollger commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cademirch commented May 12, 2026 •

edited

Loading

cademirch commented May 20, 2026 •

edited

Loading

mrvollger commented May 21, 2026 •

edited

Loading