Intelligent audio offset detection and correction during multi-track muxing (fixed-offset sync)

## Summary

Add an **optional, settings-gated** capability to MeedyaConverter that detects and corrects **fixed audio offset** between a designated *base* audio track and one or more *candidate* audio tracks being muxed into the same output container. This is the "**Issue 1 — Offset Sync**" feature in the collector/archive workflow discussed in [this Blu-ray forum thread](https://forum.blu-ray.com/showthread.php?t=309955) (members manually computing offsets to align imported audio dubs against a base track).

When enabled, MeedyaConverter analyses each candidate track against the base, computes the offset, presents the user with a **confidence score** and **diagnostic report**, and (on approval) applies the offset via container-level audio delay — no audio re-encoding required for offset-only corrections, so **passthrough remains intact** (including for spatial audio formats).

This issue covers **fixed offset only**. Drift correction, scene-cut handling, and the spatial-audio policy are tracked separately in a companion issue (link in **Related issues** once filed).

## Target users

- **Collectors** muxing imported foreign-language dubs into a domestic Blu-ray rip
- **Archivists** consolidating multiple releases of the same content
- **Audiophiles** muxing higher-bitrate audio (e.g. lossless from a different release) into a video they already own
- **Preservationists** assembling restored audio against archival video

All of these workflows currently require either Subtitle Edit + manual offset trial-and-error, MKVToolNix + a calculator, or DVDFab/MakeMKV's paid sync features. MeedyaConverter would replace that workflow.

## Default state and gating

**The feature is OFF by default.** Analysis adds 5-30 seconds per audio track to the conversion process (chromaprint fingerprinting + cross-correlation on a few-minute sample). Users who don't need it shouldn't pay that cost.

When OFF:
- Audio tracks are muxed with their declared timestamps unchanged (current behaviour)

When ON (via Settings → Audio Sync → Enable Offset Detection):
- For every multi-track mux job, each non-base audio track is analysed against the base
- A "Sync Analysis" step appears in the conversion progress UI
- Results are surfaced before the encode step commits

## Approach (technical)

### Detection: chromaprint fingerprints + refinement

1. **Stage 1 — Coarse offset (chromaprint)**: Decode the first ~3 minutes of dialogue/M&E from both base and candidate to PCM. Generate chromaprint fingerprints. Cross-correlate fingerprints → coarse offset estimate (resolution: ~100 ms).
2. **Stage 2 — Refinement (FFT cross-correlation)**: Around the coarse-offset region, run sample-accurate FFT cross-correlation on a 10-second window → sub-sample offset (resolution: <1 ms).
3. **Stage 3 — Verification**: Take a second sample from a different part of the timeline (e.g. 60% in) and verify the offset still holds. If it does, this is a fixed-offset case and we're done. **If the offset has changed → this is a drift case** → defer to the companion drift-correction issue.

### Correction: container-level delay (no re-encode)

The detected offset is applied as a container-level audio delay:
- For MKV / MP4: write the audio stream with an adjusted timestamp / track-level delay
- For MOV: edit list (`elst`) atom
- No audio sample re-encoding — passthrough preserved for AC3, E-AC3, DTS, DTS-HD MA, TrueHD, **Atmos, DTS:X, Auro-3D, PCM, FLAC** etc.
- This is critical for the spatial-audio case — fixed-offset correction does NOT destroy spatial metadata.

### Cross-language alignment via M&E features

When base and candidate are different languages, dialogue differs but the M&E (music and effects) stem is typically shared. Default behaviour: **cross-correlate on chromagram + onset envelope features**, not raw audio. These are robust to dialogue differences and detect alignment on shared music/effect cues.

A user-overridable setting allows forcing raw-PCM cross-correlation when languages match.

## Scope (phased)

### Phase A — Detection engine + diagnostic CLI
- [ ] Integrate **chromaprint** library (LGPL — link as shared library, attribute correctly). See [Open questions](#open-questions) regarding LGPL acceptability.
- [ ] Audio decoder path (FFmpeg) to PCM for chromaprint input — covers all standard codecs
- [ ] FFT-based cross-correlation implementation (Accelerate.framework on Apple platforms; FFTW or kissfft as cross-platform fallback)
- [ ] Chromagram + onset-envelope feature extraction (cross-language robustness path)
- [ ] CLI tool: `meedyaconvert audio-sync analyse --base <track-spec> --candidate <track-spec> <input-file>` — outputs JSON with offset, confidence, sample-points used, residual correlation
- [ ] Unit test: feed a known-offset pair (the same audio file, one delayed by exactly 500 ms) → must detect 500 ± 1 ms with confidence > 0.99

### Phase B — Pipeline integration + container-level delay
- [ ] Integrate the analyser into the muxing pipeline as a pre-encode stage
- [ ] CLI flags:
  - `--audio-sync-detect` (boolean — enable analysis)
  - `--audio-sync-base <track-index>` (which track is the reference; default: heuristic — see Open questions)
  - `--audio-sync-mode auto|prompt|reject` (auto = apply if confidence ≥ threshold; prompt = always ask; reject = analyse-only, never apply)
  - `--audio-sync-min-confidence <0.0-1.0>` (default 0.85; sub-floor = reject)
  - `--audio-sync-feature-set raw|chroma|onset|auto` (default: auto, picks based on language tag match)
  - `--audio-sync-sample-points <comma-list>` (default: 0.1,0.5,0.9 — fractions of duration to sample)
- [ ] Container-level delay application: MKV, MP4/MOV, MKA, M4A — verify timestamps on output and verify the audio truly aligns post-mux
- [ ] Integration test: take a real Blu-ray rip + a known-offset alternate dub, run end-to-end, verify both audio is in sync AND passthrough was preserved (no quality loss, codec unchanged)

### Phase C — Confidence reporting + verification UI
- [ ] **Confidence score model**: combine correlation peak height + peak sharpness + cross-sample-point agreement → single score 0.0-1.0
- [ ] Diagnostic report (JSON + human-readable text) — per analysis:
  - Detected offset (ms, samples)
  - Confidence score (0.0-1.0)
  - Per-sample-point correlation values
  - Feature set used (raw / chroma / onset)
  - Sample window durations
  - Warnings (e.g. "candidate is shorter than base; head trim suspected")
- [ ] Sanity check unit tests:
  - Sync a track against itself → offset = 0, confidence = 1.0
  - Sync silence against music → confidence < 0.1, refuse
  - Sync white noise against music → confidence < 0.2, refuse
- [ ] GUI: pre-mux confirmation dialog showing the offset, confidence, "Apply", "Skip this track", "Manual override" buttons. Includes a small **waveform overlay** of the first 5 seconds of dialogue after sync is applied so the user can eyeball it.

### Phase D — Settings, defaults, and polish
- [ ] Settings panel: Audio Sync section with:
  - Master enable toggle
  - Default behaviour: *Always ask*, *Auto-apply if confidence ≥ X*, *Auto-reject if confidence < Y*, *Analyse only (never apply)*
  - Confidence thresholds (auto-apply, hard reject)
  - Base-track selection strategy (see Open questions)
  - Feature-set preference (auto / force chromagram / force raw)
- [ ] Persistent "Remember my choice for this session / always" on per-job prompts
- [ ] Diagnostic export: "Save sync report" produces a self-contained `.json` + `.txt` bundle for support cases (no audio data leaks — just metadata + correlation curves)
- [ ] Progress UI during analysis (per-track progress bars; sub-stage labels: "fingerprinting", "correlating", "verifying")
- [ ] Cancellation: analysis must be cancellable mid-run without leaving partial state

## Settings UX (specification)

A new Settings panel section: **Audio Sync** (top-level, separate from Audio encoding settings).

| Setting | Type | Default | Notes |
|---------|------|---------|-------|
| Enable offset detection | toggle | OFF | Master switch |
| Default behaviour | radio | Always ask | Always ask / Auto-apply ≥ threshold / Auto-reject < threshold / Analyse only |
| Auto-apply threshold | slider | 0.95 | Only when "Auto-apply" selected; range 0.75-1.00 |
| Auto-reject threshold | slider | 0.70 | Below this, refuse without prompting; range 0.50-0.90 |
| Base track selection | radio | First lossless track / User picks per job / By language preference | See Open questions |
| Feature set preference | radio | Auto (recommended) | Auto / Force chromagram / Force raw PCM |
| Sample analysis points | multi-select | 10%, 50%, 90% | Fractions of duration to test |
| Show diagnostic report | toggle | ON | Show the per-track sync report after analysis |

## CLI behaviour

```
# Default — analyse, prompt, apply if approved
meedyaconvert mux --audio-sync-detect input.mkv

# Fully automated for batch jobs — apply when confidence ≥ 0.95, skip otherwise
meedyaconvert mux --audio-sync-detect --audio-sync-mode auto \
  --audio-sync-min-confidence 0.95 input.mkv

# Analyse only, report, don't touch the audio
meedyaconvert mux --audio-sync-detect --audio-sync-mode reject input.mkv
# → writes JSON report alongside, output file gets uncorrected audio + warning

# Force a specific base track
meedyaconvert mux --audio-sync-detect --audio-sync-base 0 input.mkv
```

JSON report schema (proposed):

```json
{
  "version": 1,
  "input_file": "input.mkv",
  "base_track": {"index": 0, "language": "eng", "codec": "truehd", "channels": 8},
  "analyses": [
    {
      "candidate_track": {"index": 1, "language": "ita", "codec": "dts", "channels": 6},
      "result": "applied",
      "offset_ms": 412.7,
      "offset_samples": 19811,
      "confidence": 0.94,
      "feature_set": "chroma",
      "sample_points": [
        {"position": 0.1, "correlation": 0.93},
        {"position": 0.5, "correlation": 0.95},
        {"position": 0.9, "correlation": 0.94}
      ],
      "warnings": []
    }
  ]
}
```

## Acceptance criteria

- [ ] Phase A complete: synthetic-test sync (same audio + known offset) succeeds at sub-millisecond accuracy
- [ ] Phase B complete and shipped: end-to-end test on at least three real-world cases:
  - Same-language fixed offset (e.g. PCM English vs DTS-HD MA English, different masters)
  - Cross-language fixed offset (e.g. English TrueHD base + Italian DTS dub)
  - Spatial audio passthrough (e.g. Atmos candidate with fixed offset against TrueHD Atmos base — must preserve Atmos metadata in output)
- [ ] Phase C complete: confidence score correlates with subjective sync quality (validated against a manual A/B listening test for at least five sync results)
- [ ] Hard-reject floor is enforced: a sync attempt with confidence < 0.70 default never applies without explicit user override
- [ ] CLI API documentation (per standing task #11) updated with all `--audio-sync-*` flags and the JSON report schema
- [ ] In-app help (`Resources/Help/`) and the GitHub wiki document:
  - What the feature does
  - The difference between offset (this issue) and drift (companion issue)
  - **What is explicitly NOT supported**: drift correction (separate feature), scene-cut handling (separate feature), automatic base-track selection by content (only by metadata heuristics)
  - Performance impact estimate (typical 5-30 sec/track)
  - Privacy: analysis happens locally; no audio leaves the machine
- [ ] Security review (standing task #4) confirms: (a) malformed input audio cannot crash the analyser, (b) the chromaprint dependency is pinned to a known-good version, (c) JSON report output cannot leak filesystem paths beyond the input file, (d) cancellation cleanly tears down without leaving temp files

## Technical / security notes

1. **License question — chromaprint is LGPL.** Linking dynamically and shipping the library as a separate framework is the safe path for a proprietary app. See Open questions below — we need to confirm this is acceptable before committing to the dependency.
2. **Performance**: chromaprint is fast (~real-time at decode speed for fingerprinting). FFT cross-correlation on a 10-second window is sub-second. Total analysis time per candidate track on a typical multi-hour file: 5-30 seconds. Acceptable.
3. **Memory**: PCM decoding consumes ~10 MB/min of stereo 16-bit audio. Three sample windows × 10 seconds × 6 channels = ~3 MB. Negligible.
4. **No audio leaves the device**: this is a local-analysis feature, no telemetry, no cloud calls. Worth marketing.
5. **Base track selection**: see Open questions — heuristics vs user-pick is a real product decision.
6. **Stream copy preserved**: offset-only correction never requires audio decode/re-encode. This is the feature's biggest engineering advantage.

## Open questions

These need decisions before Phase A starts:

1. **LGPL chromaprint acceptable?** chromaprint is LGPL v2.1+. Linked dynamically and bundled as a framework, this is generally fine for proprietary apps (the LGPL allows it with attribution + a path for users to swap the library). But if the policy is "no copyleft dependencies", we need an alternative. Permissive alternatives: write our own perceptual fingerprinter (significant work), use raw FFT correlation only (less robust to codec differences), or license a commercial fingerprinter (rare).

2. **Base track selection heuristic — which is right?** Candidates:
   - **(a) First lossless audio track in the file** — simple, predictable, usually right
   - **(b) Track matching the user's preferred language (from system locale or settings)** — most semantically correct
   - **(c) Track with the most channels / highest bitrate** — assumes "best quality = base"
   - **(d) User picks per job** — most flexible but interrupts batch workflows
   - **(e) Combination**: heuristic default + user override per job
   - Recommendation: **(e)** with **(a) + (b)** combined as the default heuristic ("first lossless track matching the user's preferred language; if none, first lossless track of any language; if none, first track of any kind"). Want to confirm.

3. **What happens when the input has only one audio track?** The feature is silently a no-op. Should it surface a "no candidate tracks to sync" message, or just stay quiet?

4. **Multi-candidate base-relative sync?** When there are 4 audio tracks (1 base + 3 candidates), should they be synced *independently* against the base, or should we also check that the candidates agree with each other? Recommendation: independent against base only (simpler, faster, correct).

5. **What sample positions are optimal?** Default 10/50/90% of duration. But films often have low-content opening logos and closing credits — those sample points may give weak correlations even on perfectly-aligned tracks. Better defaults might be 15/50/85%? Worth measuring on a corpus.

6. **Granularity of the "Auto-apply" threshold**: default 0.95. Is this too conservative? Too permissive? Worth validating against the same corpus.

## References

- **Chromaprint** (proposed fingerprinting library): <https://github.com/acoustid/chromaprint> — LGPL v2.1+, used by AcoustID, MusicBrainz Picard, beets
- **AcoustID** (the service that uses chromaprint): <https://acoustid.org/> — same fingerprint algorithm, public reference for "is this the same audio?"
- **Apple Accelerate framework** (for FFT on Apple platforms): <https://developer.apple.com/documentation/accelerate>
- **DaVinci Resolve Auto Sync Audio** (reference UX): <https://www.blackmagicdesign.com/products/davinciresolve> — Edit page, Audio panel, "Auto Align Audio" command
- **Audacity Align Tracks** (simpler reference implementation): <https://manual.audacityteam.org/man/tracks_menu_align_tracks.html>
- **The Blu-ray forum thread that motivated this feature**: <https://forum.blu-ray.com/showthread.php?t=309955>

## Effort estimate

- Phase A: 2-3 weeks (chromaprint integration + FFT correlation + diagnostic CLI)
- Phase B: 2-3 weeks (pipeline integration + container-level delay + integration tests)
- Phase C: 2-3 weeks (confidence model + verification UI + waveform overlay)
- Phase D: 2 weeks (settings UI + persistence + polish)

Conservative total: **8-11 weeks of focused work.** Smaller than OFX (#419) and OCIO (#420), and more obviously valuable in the short term to the target user base.

## Related issues

- _(to be filed in this batch)_ Audio drift correction, scene-cut handling, and spatial-audio policy — the companion to this issue covering Issue 2 in full
- #419 — OpenFX (OFX) plugin host: unrelated but in the same Phase 9 pro-tier track
- #420 — OpenColorIO (OCIO) integration: unrelated but in the same Phase 9 pro-tier track
- _(to be filed)_ Premium-tier feature gating mechanism — this feature is a candidate for the paid tier per the original discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent audio offset detection and correction during multi-track muxing (fixed-offset sync) #421

Summary

Target users

Default state and gating

Approach (technical)

Detection: chromaprint fingerprints + refinement

Correction: container-level delay (no re-encode)

Cross-language alignment via M&E features

Scope (phased)

Phase A — Detection engine + diagnostic CLI

Phase B — Pipeline integration + container-level delay

Phase C — Confidence reporting + verification UI

Phase D — Settings, defaults, and polish

Settings UX (specification)

CLI behaviour

Acceptance criteria

Technical / security notes

Open questions

References

Effort estimate

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Setting	Type	Default	Notes
Enable offset detection	toggle	OFF	Master switch
Default behaviour	radio	Always ask	Always ask / Auto-apply ≥ threshold / Auto-reject < threshold / Analyse only
Auto-apply threshold	slider	0.95	Only when "Auto-apply" selected; range 0.75-1.00
Auto-reject threshold	slider	0.70	Below this, refuse without prompting; range 0.50-0.90
Base track selection	radio	First lossless track / User picks per job / By language preference	See Open questions
Feature set preference	radio	Auto (recommended)	Auto / Force chromagram / Force raw PCM
Sample analysis points	multi-select	10%, 50%, 90%	Fractions of duration to test
Show diagnostic report	toggle	ON	Show the per-track sync report after analysis

Intelligent audio offset detection and correction during multi-track muxing (fixed-offset sync) #421

Description

Summary

Target users

Default state and gating

Approach (technical)

Detection: chromaprint fingerprints + refinement

Correction: container-level delay (no re-encode)

Cross-language alignment via M&E features

Scope (phased)

Phase A — Detection engine + diagnostic CLI

Phase B — Pipeline integration + container-level delay

Phase C — Confidence reporting + verification UI

Phase D — Settings, defaults, and polish

Settings UX (specification)

CLI behaviour

Acceptance criteria

Technical / security notes

Open questions

References

Effort estimate

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions