Skip to content

feat!: use MolecularAnnotation spec and library#107

Open
cademirch wants to merge 22 commits into
fiberseq:mainfrom
cademirch:molecular-annotation
Open

feat!: use MolecularAnnotation spec and library#107
cademirch wants to merge 22 commits into
fiberseq:mainfrom
cademirch:molecular-annotation

Conversation

@cademirch
Copy link
Copy Markdown
Contributor

@cademirch cademirch commented May 12, 2026

This PR migrates fibertools to use the MolecularAnnotation (MA) Spec and library.

Motivation (no particular order):

  1. Conflicting / non-portable BAM tags. Fibertools emitted custom lowercase aux tags (ns/nl for nucleosomes, as/al/aq for MSPs, etc.). This led to: collisions with other custom aux tags (from other tools) and aux tag bloat (fibertigs). Also custom aux tags are mostly unreadable by any other tool without a fibertools-specific parser. After this PR, all extra tags are MA-spec compliant and any MA-aware reader can consume fibertools output without custom code.
  2. Singular in memory representations of annotations. Before this PR, each annotation type was represented by its own field on FiberseqData. Now, all annotations live in one field in the MolecularAnnotations container type which will allow for simple addition of future annotations. This also makes it simpler to iterate all annotations for whatever operations needed.
  3. Single liftover code path. Pre PR, there were 3 lift functions (with subtly different behavior at block boundaries). Now fibertools defers to the MA library liftover module. Any future issues that arise with liftovers should be simpler to track down.
  4. Simpler base-mod handling. BaseMods/BaseMod is gone. MM/ML is pure I/O against the same MA object. ddda-to-m6a no longer has to fake up BaseMod intermediates, synthesized m6a is just another annotation appended to the MA object, written out by the same path as predict-m6a.
  5. Centralized MA I/O. Reads happen in exactly one place (FiberseqData::new); writes happen in exactly one place per code path (FiberseqData::serialize_annotations, with mock_fire as a documented exception since it synthesizes records from BED).

Behavior changes worth flagging:

  • Legacy lowercase tags (ns/nl, as/al/aq, fire-specific tags, etc.) are no longer emitted by default. Anything reading them needs to switch to the MA tag set, or callers can pass --legacy-tags to keep emitting them alongside MA during the migration.
  • relating to Possible reference_ends bug? #90, block-boundary lifts no longer stretch MSPs/nucleosomes into adjacent blocks. This is consequence of MA library's inward-snap semantics.
  • MM tag canonicalization: groups are emitted in canonical fiberseq form (A+a/T-a for m6a, C+m/G+m for 5mC) reconstructed from the forward sequence. Positions and ML qualities round-trip exactly; MM tag bytes for non-canonical encodings (e.g. N+a wildcard groups, which the SAM spec allows but fiberseq tools don't emit in practice) are not byte-identical to input. @mrvollger could use your thoughts here especially. I'm not super familiar with MM tags and if this would be a problem.

Deferred / follow-up

  • Eliminate ddda-to-m6a as a separate subcommand by reading DddA Y/R sequence directly into m6a annotations on the read path. Plumbing change here makes this a straightforward follow-up.
  • Make extract --all, center, and pileup discover arbitrary user-defined annotation types in the MA tag set and emit columns/rows for them automatically — so fibertools can serve as a
    generic MA CLI.

cademirch added 6 commits May 11, 2026 22:05
The molecular_annotation crate moved strand from AnnotationType to per-Annotation
and replaced from_ma_record/write_ma_record with from_record + from_tags / to_record.
Update src/utils/ma_io.rs and tests/molecular_annotation.rs to call sites.

read_ma_tags now extracts MA/AL/AQ/AN aux fields explicitly and feeds them through
MolecularAnnotations::from_tags, then attaches AlignedBlocks for liftover.
Updates the module-level doc comment to reflect the post-Phase-0 API
shape, the planned producer/consumer split for nuc/msp/fire, and the
in-memory-only m6a/cpg story (MM/ML stays the on-disk truth).
Adds FIRE_TYPE constant, build_annotations (nuc/msp/fire-aware), and
extract_fire_arrays. build_nuc_msp_annotations is preserved as a thin
wrapper so existing producers keep compiling until they migrate to the
fire-aware builder.
This wires up nuc,msp, and fire annotations to go thru MA spec. Does not cover m6a/cpg.
Adopt MA spec's inward liftover semantics for query-space ranges.
Legacy liftover_closest used inclusive endpoint matching (q_pos <= q_en),
which mapped positions at block ends back into the previous block,
over-extending ranges across deletions by 1bp.

Inward semantics treat blocks as half-open, so a query position at the
edge of a deletion lifts to the next aligned reference base. Nucleosome
and MSP lengths now match the read's actual aligned footprint.

Snapshot regenerated.
Replace direct field access (fiber.msp, fiber.nuc, etc.) with method
accessors that return AnnotationTypeView<'_> derived from the spec's
MolecularAnnotations. read_annotations + BaseMods::populate_ma now
populate the spec object at construction with nuc/msp/fire from MA
tags and m6a/cpg from ML/MM, so the spec is the source of truth for
read-side coordinates.

AnnotationTypeView delegates to the library's iter_type /
AnnotationInfo for per-annotation iteration and ref-coord lifting,
and provides Vec<i64> / Vec<Option<i64>> column accessors for the
legacy column-oriented call sites. No bespoke per-annotation wrapper
struct; call sites read AnnotationInfo fields directly.

Legacy Ranges fields on FiberseqData are retained for the centering
path, which still mutates them; they'll be dropped after centering
migrates to the spec's translate.

All read-site call sites (decorator, fire, footprint, pileup, qc,
validate, utils/fire) updated to call fiber.X() instead of fiber.X.
Centering deliberately left on field access.
@cademirch cademirch force-pushed the molecular-annotation branch from 0b62961 to 0262039 Compare May 12, 2026 04:06
@cademirch cademirch force-pushed the molecular-annotation branch from ee4e71a to b9b8ac8 Compare May 12, 2026 04:19
cademirch added 2 commits May 12, 2026 10:58
Add `primary_qual(&[u8], &str) -> u8` and `AnnotationTypeView::qual_at(idx)`
to make the single-quality assumption explicit. `qual()` now delegates to
`qual_at(0)` with `debug_assert!(qualities.len() <= 1, …)`, so a future
multi-quality annotation type can't silently lose qualities through
`first().copied().unwrap_or(0)`.

Migrates the existing single-quality call sites in ma_io (3 sites),
footprint, and qc to the helper. Same byte-for-byte output; the assert
fires only in debug builds if a multi-quality type appears.
CenteredFiberData no longer mutates a cloned FiberseqData. Instead it
stashes both offsets and projects on-demand via the spec library's
project_query / project_reference (which return ProjectedAnnotation
with the centered coords already computed). grab_data collapses to a
single helper per type. validate_msp_m6a_alignment moved onto
CenteredFiberData, also using project_query.

FiberseqData drops msp/nuc/m6a/cpg Ranges fields; read sites already
go through the AnnotationTypeView accessors over self.annotations.

apply_filter_fsd switched to MolecularAnnotations::retain.

Call sites in pileup, add_nucleosomes, fiber_hmm, qc, fire, input_bam
adjusted to the new field-less FiberseqData; non-regression test files
(extract_test, m6a_prediction_test) also migrated.

Dead code left for a follow-up: FiberAnnotations::apply_offset /
apply_offset_helper, and bamlift.rs (still used by Ranges::new 1bp
path and CenteredFiberData::find_offset).
@cademirch cademirch force-pushed the molecular-annotation branch from 0c66b9f to 650a2bf Compare May 12, 2026 17:00
cademirch added 4 commits May 12, 2026 11:10
Adds `read_fibertig_tags` / `write_fibertig_tags` helpers and a
`FIBERTIG_TYPE = "fibertig"` annotation type name. The fs/fl/fa BAM
tag wire format is unchanged (positions in molecular orientation,
fa as `|`-separated chunks with `;`-joined columns), so existing
fibertig-annotated BAMs decode byte-for-byte identically.

In memory the pipeline now uses MolecularAnnotations / Annotation
throughout:
- read_bed_annotations returns HashMap<String, MolecularAnnotations>.
- approximately_divide_annotations_by_window_size operates on
  MolecularAnnotations.
- create_annotated_records_from_splits shifts coords record-relative
  and writes via write_fibertig_tags.
- extract_to_bed reads via read_fibertig_tags and lifts ref coords
  through the record's aligned blocks.

The Annotation.name field carries the BED column 4+ payload as a
single semicolon-joined string, so each annotation owns its own
extras — the historical fa-pairing concern around shared/overlapping
peaks (and the `forward_extras` plumbing in FiberAnnotations) is
structurally impossible now.

Tests in tests/fibertig_test.rs updated to the new API. Storage is
molecular ascending (matches fs input order), and iter_type yields
BAM-orient query coords + lifted ref coords.

Remaining users of FiberAnnotations: BaseMods (ML/MM-derived ranges
held internally). `FiberAnnotations::from_bam_tags` /
`write_to_bam_tags` on bamannotations.rs are now unused dead code;
left in place for a followup that also migrates BaseMods and lets
us delete FiberAnnotations + bamlift.rs entirely.
Delete the BaseMods / BaseMod / ModCall intermediate. MM/ML is now
pure I/O on a MolecularAnnotations object: parse_mm_ml_into_ma fills
it, write_mm_ml emits it. The legacy FiberAnnotation /
FiberAnnotations / Ranges types and their iterators are deleted
entirely.

Behavior change - MM tag canonicalization

Canonical fiberseq groups (mod_type 'a' -> m6a, 'm' -> cpg) have
their per-group (base, strand_sign) reconstructed from the forward
sequence on write. Non-canonical groups (e.g. C+h, T+f, ChEBI numeric
IDs) are preserved verbatim under an annotation type named after the
MM group header and round-trip unchanged.

| Input MM tag                | Output MM tag       | Note                                 |
|-----------------------------|---------------------|--------------------------------------|
| A+a,5;T-a,3;C+m,2;          | A+a,5;C+m,2;T-a,3;  | unchanged; emitted in sorted order   |
| N+a,3,2,1; (calls on A/T/G) | A+a,3;G+a,0;T-a,2;  | regrouped from forward seq           |
| C+h,5;C+m,2;                | C+h,5;C+m,2;        | non-canonical group preserved        |

Positions and ML qualities round-trip exactly in every case.
bamlift's lift_query_range / lift_reference_positions are gone - the
spec library's AlignedBlocks::{lift_to_reference, lift_to_query} is
the single liftover path. center::find_offset is rewired onto
lift_to_query.

The boundary-bug regression (fiberseq#90) and coordinate-lift test are kept
and migrated, not deleted. Both now exercise AlignedBlocks. The
spec's inward-snap semantics natively fix fiberseq#90: an MSP that ends
exactly on a block boundary now lifts to the end of the current
block, not the start of the next one.
Collapses MA reads/writes to one spot each, addressing collaborator
feedback that nuc/fire were doing their own from_record + write
round-trips:

- Add FiberseqData::serialize_annotations(legacy) as the single write
  entry point for code paths that have a FiberseqData. Wraps
  ma_io::write_annotations.
- nucleosome::add_nucleosomes_to_record -> add_nucleosomes_to_annotations:
  takes &mut MolecularAnnotations and mutates in place; caller
  serializes.
- fire::add_fire_to_rec uses rec.annotations directly instead of
  re-reading the record; emits via serialize_annotations.
- subcommands/add_nucleosomes.rs constructs FiberseqData once per
  record and calls serialize_annotations at the end.
- mock_fire.rs keeps its direct ma_io::write_annotations call (no
  FiberseqData since records are synthesized from BED); commented.

While here: ma_io::write_annotations now strips m6a / cpg before
emission as a safety net. The module already documented that
base-mod types live in MM/ML, never in the MA tag set, but the
writer wasn't enforcing it. Without this guard the new centralized
flow would double-emit m6a (once via MM/ML, once via MA).

Read path unchanged: FiberseqData::new is still the only place that
reads MA tags + MM/ML.
@cademirch cademirch marked this pull request as ready for review May 12, 2026 23:14
cademirch added 5 commits May 13, 2026 14:08
Finishes the fibertig migration: the BED ↔ BAM pipeline now goes through
ma_io::{read,write}_annotations on the `fibertig` type, so fibertig data
lives on the MA/AL/AN tag set like every other fibertools annotation
type. Deletes read_fibertig_tags / write_fibertig_tags and the fs/fl/fa
wire format entirely - no fallback, no legacy flag.

Percent-encode `,` in BED extras

The MA-spec AN tag uses `,` as its separator with no escape mechanism
defined, but BED column 4+ payload can legally contain commas. The
fibertig BED reader percent-encodes `,` (-> %2C) and `%` (-> %25) into
Annotation.name on the way in; extract_to_bed reverses it on the way
out. AN on disk stays spec-compliant; round-trip is lossless for any
ASCII input.

This is a breaking change for any external tool that reads fs/fl/fa
from fibertig-annotated BAMs.
  Post-MA-migration `add_fire_to_rec` writes FIRE precisions to a separate
  `fire+P` annotation type and leaves MSP with `QualitySpec::none()`. But
  several consumers still read `msp.qual()` expecting FIRE precision, the
  legacy convention from when fibertools stored precisions on the MSP `aq`
  tag.
Three producers loaded the existing annotation set from the record and
appended their fresh output without first dropping the type they were
about to re-emit. Re-running the command on its own output silently
doubled the annotation list (or, in the case of nuc+msp where the
existing type had a different QualitySpec, panicked inside a
par_iter_mut worker):

- nucleosome::add_nucleosomes_to_annotations
  Now drops pre-existing nuc, msp, AND fire types before adding. fire
  is included because it's paired 1:1 with msp by index - leaving fire
  behind while replacing msp would strand fire precisions against
  unrelated MSPs. Affects `ft add-nucleosomes` and `ft predict-m6a`
  (which also runs nucleosome calling on its output).

- fire::add_fire_to_rec
  Drops the pre-existing fire type before adding. Affects `ft fire`
  re-runs (e.g. re-running with a different model).

- ddda_to_m6a::ddda_to_m6a_record
  Drops pre-existing m6a before appending the Y/R-derived calls. The
  old code would emit duplicate MM positions for any Y/R base that
  also had a prior m6a call.

Tests: two new integration tests in tests/molecular_annotation.rs
exercise re-run idempotency for nucleosomes (including the stale-fire
case) and fire. Both fail without these fixes.
Two latent correctness bugs in `parse_mm_ml_into_ma`:

1. **Multi-char alpha mod types collided with canonical dispatch.**
   The bucket selector matched the first character of `mod_type_str`
   (`'a'`/`'m'`), so any non-canonical alpha code starting with one
   of those — `C+ac` (acetyl), hypothetical `A+myloride`, etc. —
   silently routed into the `m6a` / `cpg` bucket and was rewritten
   as fiberseq canonical on the next write. The SAM regex permits
   multi-char alpha mod codes (`[A-Za-z]+`), so the input is valid.
   Now matches the exact string `"a"` / `"m"`; everything else lands
   in the verbatim non-canonical bucket and round-trips unchanged.

2. **Pre-existing basemod types in `annot` silently merged.**
   `add_calls` ultimately calls `MolecularAnnotations::add_annotation_type`,
   which returns the existing type if the name+QualitySpec match,
   and then appends. Two paths could hit this: repeated calls to
   `parse_mm_ml_into_ma` on the same `annot`, or a third-party BAM
   that wrongly emits `m6a` into both the MA tag set AND MM/ML —
   `FiberseqData::new` runs `read_annotations` before
   `parse_mm_ml_into_ma`, so the latter would merge on top instead
   of overwriting. MM/ML is the on-disk source of truth for
   basemods; now the parser clears any m6a/cpg/non-canonical type
   off `annot` first, making the function idempotent and ensuring
   MM/ML always wins on conflict.

Tests:
- `test_multichar_alpha_modtype_not_treated_as_canonical` synthesizes
  a `C+ac` MM tag and asserts it lands in its own type, not m6a.
- `test_parse_mm_ml_idempotent_on_rerun` confirms double-invocation
  is a no-op and stale pre-seeded m6a is replaced, not merged.
Both tests fail without these fixes.
`parse_mm_ml_into_ma` previously aborted on any MM group that claimed
more `mod_base` positions than the forward sequence contained — the
`assert_eq!` after the resolver loop panicked the whole worker thread.
A single bad-actor BAM record could take down a `par_iter` chunk.

Now warns and truncates the resolved positions. Critically, the ML
offset arithmetic still advances by the *on-disk* slot count
(`mod_dists.len()`), not the resolved count — otherwise the next
group's ML qualities would be shifted by the truncated remainder and
silently re-align with the wrong positions.

Test: synthesizes a record whose first MM group `A+a` claims
`total_a + 2` positions (impossible), followed by a legitimate `C+m`
group with a sentinel quality. Asserts (a) no panic, (b) the m6a
count is truncated to actual `A` count, (c) the C+m sentinel quality
survives — proving the ML offset stayed correctly aligned.
Comment thread Cargo.toml
linear-map = "1.2.0"
regex = "1.9.1"
rust-htslib = "0.46"
rust-htslib = "0.47"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were having lots of breaks in the static binary builds when we tried this last. Worth checking out

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I will look into that.

Comment thread src/utils/ma_io.rs Outdated
if starts.is_empty() {
return;
}
let t = annot.add_annotation_type(FIRE_TYPE, "P".parse::<QualitySpec>().expect("P parses"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quality is linear here

Comment thread src/utils/ma_io.rs Outdated
Comment thread src/fiber.rs Outdated
Comment thread src/subcommands/fire.rs Outdated
return;
}

let starts: Vec<u32> = msp.annotations.iter().map(|a| a.start).collect();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all msps should become a fire annotation. Only ones with non zero scores. But maybe I am not understanding the code.

Comment thread src/utils/basemods.rs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big changes. Fine, but not sure I'm tracking all the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, base mods need more thinking through. The MA library doesn't quite have the API to represent base mods in a way that makes it simple to read them from MM tags AND emit them in MM tags. Will think more about how to fix this from the library side.

Comment thread src/utils/fibertig.rs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need a couple tests to make sure we aren't changing/breaking anything in fibertigs. maybe you already have.

Comment thread src/utils/ma_io.rs Outdated
/// When `legacy` is not set, any pre-existing legacy tags are stripped so
/// the output is MA-only.
pub fn write_annotations(record: &mut bam::Record, annot: &MolecularAnnotations, legacy: bool) {
for tag in LEGACY_NUC_MSP_TAGS {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only want to remove these tags if we know they were fibertools tags. otherwise we are removing other's user defined tags.

Which I think means it would be best/easier to drop these in the reader after we make the MA tag(if it did not exist)

Comment thread src/utils/ma_io.rs Outdated
Comment thread src/fiber.rs
pub cpg: Ranges,
pub base_mods: BaseMods,
pub annotations: MolecularAnnotations,
pub ec: f32,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a reference to the "accessibility" molecular annotation here, which will differ depending on whether we have a fiber-seq or daf-seq set of tags in the MA tags. And then tools like add_nucs and fire will pull counts from the accesibility tag rather than m6a directly, so it can work across techs without faking m6a calls.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save for a future PR.

cademirch added 4 commits May 14, 2026 11:22
  FIRE is now the filtered subset of MSPs with non-zero XGB precision —
  its own coords and qualities, not a sidecar to MSP. Consumers that
  previously asked "what's the FIRE qual for this MSP" via the legacy
  1:1 model are rewritten to iterate fiber.fire() directly.

    * add_fire_to_rec filters precisions to p > 0 before building fire+Q
    * fire_qual() helper deleted — its semantics conflated MSP iteration
      with FIRE quality lookup
    * pileup, validate, decorator/fire_decorators iterate fiber.fire()
    * footprint splits the conflated msp_fire_scores into independent
      has_spanning_msp + fire_quals tracked over fiber.fire() spanning
    * BED12 (fibertools extract --all) and centered TSV writers emit
      FIRE as its own coord group (starts/lengths/qual/ref_starts/
      ref_lengths) parallel to nuc/m6a/cpg, not wedged into MSP columns

  Obsolete tests removed: legacy_write_emits_both_tag_sets,
  legacy_write_inverse_roundtrip, fire_qual_matches_legacy_and_ma_paths.
  extract_all_napa snapshot regenerated to reflect new column shape.
  Replace the BaseMods/BaseMod struct pair with parse_mm_ml_into_ma and
  write_mm_ml, which read and write MM/ML directly against a shared
  MolecularAnnotations object. Each call's MM group header (e.g. "A+a",
  "T-a", "N+a") is stored verbatim on Annotation.name so it round-trips
  through the writer byte-for-byte — including non-standard encodings
  like N+a that the old path silently normalized to A+a;T-a.
@cademirch
Copy link
Copy Markdown
Contributor Author

cademirch commented May 20, 2026

@mrvollger We read basemods (from MM/ML) into MolecularAnnotations (in memory) but write them back out as MM/ML. This is nice because MM/ML stays on disk source of truth for base modifications, but then downstream users need to do their own dance to read in MM/ML into MolecularAnnotations.

edit: This is only really a "problem" if a common use case is for users of fibertools/MA want to look at basemods frequently. Which I'm unsure about

Two possible options I'm considering and could use feedback on:

  1. Write basemods to MA tags in addition to MM. We do have to decide what is the source of truth for when we encounter bams with both MA and MM for the same signal (ie m6a)
  2. Provide some MM/ML reading api in the MA library. Not quite sure what this looks like yet.

Not a blocker here but something to think about

@mrvollger
Copy link
Copy Markdown
Member

mrvollger commented May 21, 2026

I feel fairly strongly that we should do option 2 and make the basemod reading and writing part of the MA lib.

We could make it optional parts of the builder and writer in the lib.

@mrvollger
Copy link
Copy Markdown
Member

We could even make it a feature flag, that works with rust htslib and another for noodles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants