Skip to content

Intelligent audio offset detection and correction during multi-track muxing (fixed-offset sync) #421

@Salem874

Description

@Salem874

Summary

Add an optional, settings-gated capability to MeedyaConverter that detects and corrects fixed audio offset between a designated base audio track and one or more candidate audio tracks being muxed into the same output container. This is the "Issue 1 — Offset Sync" feature in the collector/archive workflow discussed in this Blu-ray forum thread (members manually computing offsets to align imported audio dubs against a base track).

When enabled, MeedyaConverter analyses each candidate track against the base, computes the offset, presents the user with a confidence score and diagnostic report, and (on approval) applies the offset via container-level audio delay — no audio re-encoding required for offset-only corrections, so passthrough remains intact (including for spatial audio formats).

This issue covers fixed offset only. Drift correction, scene-cut handling, and the spatial-audio policy are tracked separately in a companion issue (link in Related issues once filed).

Target users

  • Collectors muxing imported foreign-language dubs into a domestic Blu-ray rip
  • Archivists consolidating multiple releases of the same content
  • Audiophiles muxing higher-bitrate audio (e.g. lossless from a different release) into a video they already own
  • Preservationists assembling restored audio against archival video

All of these workflows currently require either Subtitle Edit + manual offset trial-and-error, MKVToolNix + a calculator, or DVDFab/MakeMKV's paid sync features. MeedyaConverter would replace that workflow.

Default state and gating

The feature is OFF by default. Analysis adds 5-30 seconds per audio track to the conversion process (chromaprint fingerprinting + cross-correlation on a few-minute sample). Users who don't need it shouldn't pay that cost.

When OFF:

  • Audio tracks are muxed with their declared timestamps unchanged (current behaviour)

When ON (via Settings → Audio Sync → Enable Offset Detection):

  • For every multi-track mux job, each non-base audio track is analysed against the base
  • A "Sync Analysis" step appears in the conversion progress UI
  • Results are surfaced before the encode step commits

Approach (technical)

Detection: chromaprint fingerprints + refinement

  1. Stage 1 — Coarse offset (chromaprint): Decode the first ~3 minutes of dialogue/M&E from both base and candidate to PCM. Generate chromaprint fingerprints. Cross-correlate fingerprints → coarse offset estimate (resolution: ~100 ms).
  2. Stage 2 — Refinement (FFT cross-correlation): Around the coarse-offset region, run sample-accurate FFT cross-correlation on a 10-second window → sub-sample offset (resolution: <1 ms).
  3. Stage 3 — Verification: Take a second sample from a different part of the timeline (e.g. 60% in) and verify the offset still holds. If it does, this is a fixed-offset case and we're done. If the offset has changed → this is a drift case → defer to the companion drift-correction issue.

Correction: container-level delay (no re-encode)

The detected offset is applied as a container-level audio delay:

  • For MKV / MP4: write the audio stream with an adjusted timestamp / track-level delay
  • For MOV: edit list (elst) atom
  • No audio sample re-encoding — passthrough preserved for AC3, E-AC3, DTS, DTS-HD MA, TrueHD, Atmos, DTS:X, Auro-3D, PCM, FLAC etc.
  • This is critical for the spatial-audio case — fixed-offset correction does NOT destroy spatial metadata.

Cross-language alignment via M&E features

When base and candidate are different languages, dialogue differs but the M&E (music and effects) stem is typically shared. Default behaviour: cross-correlate on chromagram + onset envelope features, not raw audio. These are robust to dialogue differences and detect alignment on shared music/effect cues.

A user-overridable setting allows forcing raw-PCM cross-correlation when languages match.

Scope (phased)

Phase A — Detection engine + diagnostic CLI

  • Integrate chromaprint library (LGPL — link as shared library, attribute correctly). See Open questions regarding LGPL acceptability.
  • Audio decoder path (FFmpeg) to PCM for chromaprint input — covers all standard codecs
  • FFT-based cross-correlation implementation (Accelerate.framework on Apple platforms; FFTW or kissfft as cross-platform fallback)
  • Chromagram + onset-envelope feature extraction (cross-language robustness path)
  • CLI tool: meedyaconvert audio-sync analyse --base <track-spec> --candidate <track-spec> <input-file> — outputs JSON with offset, confidence, sample-points used, residual correlation
  • Unit test: feed a known-offset pair (the same audio file, one delayed by exactly 500 ms) → must detect 500 ± 1 ms with confidence > 0.99

Phase B — Pipeline integration + container-level delay

  • Integrate the analyser into the muxing pipeline as a pre-encode stage
  • CLI flags:
    • --audio-sync-detect (boolean — enable analysis)
    • --audio-sync-base <track-index> (which track is the reference; default: heuristic — see Open questions)
    • --audio-sync-mode auto|prompt|reject (auto = apply if confidence ≥ threshold; prompt = always ask; reject = analyse-only, never apply)
    • --audio-sync-min-confidence <0.0-1.0> (default 0.85; sub-floor = reject)
    • --audio-sync-feature-set raw|chroma|onset|auto (default: auto, picks based on language tag match)
    • --audio-sync-sample-points <comma-list> (default: 0.1,0.5,0.9 — fractions of duration to sample)
  • Container-level delay application: MKV, MP4/MOV, MKA, M4A — verify timestamps on output and verify the audio truly aligns post-mux
  • Integration test: take a real Blu-ray rip + a known-offset alternate dub, run end-to-end, verify both audio is in sync AND passthrough was preserved (no quality loss, codec unchanged)

Phase C — Confidence reporting + verification UI

  • Confidence score model: combine correlation peak height + peak sharpness + cross-sample-point agreement → single score 0.0-1.0
  • Diagnostic report (JSON + human-readable text) — per analysis:
    • Detected offset (ms, samples)
    • Confidence score (0.0-1.0)
    • Per-sample-point correlation values
    • Feature set used (raw / chroma / onset)
    • Sample window durations
    • Warnings (e.g. "candidate is shorter than base; head trim suspected")
  • Sanity check unit tests:
    • Sync a track against itself → offset = 0, confidence = 1.0
    • Sync silence against music → confidence < 0.1, refuse
    • Sync white noise against music → confidence < 0.2, refuse
  • GUI: pre-mux confirmation dialog showing the offset, confidence, "Apply", "Skip this track", "Manual override" buttons. Includes a small waveform overlay of the first 5 seconds of dialogue after sync is applied so the user can eyeball it.

Phase D — Settings, defaults, and polish

  • Settings panel: Audio Sync section with:
    • Master enable toggle
    • Default behaviour: Always ask, Auto-apply if confidence ≥ X, Auto-reject if confidence < Y, Analyse only (never apply)
    • Confidence thresholds (auto-apply, hard reject)
    • Base-track selection strategy (see Open questions)
    • Feature-set preference (auto / force chromagram / force raw)
  • Persistent "Remember my choice for this session / always" on per-job prompts
  • Diagnostic export: "Save sync report" produces a self-contained .json + .txt bundle for support cases (no audio data leaks — just metadata + correlation curves)
  • Progress UI during analysis (per-track progress bars; sub-stage labels: "fingerprinting", "correlating", "verifying")
  • Cancellation: analysis must be cancellable mid-run without leaving partial state

Settings UX (specification)

A new Settings panel section: Audio Sync (top-level, separate from Audio encoding settings).

Setting Type Default Notes
Enable offset detection toggle OFF Master switch
Default behaviour radio Always ask Always ask / Auto-apply ≥ threshold / Auto-reject < threshold / Analyse only
Auto-apply threshold slider 0.95 Only when "Auto-apply" selected; range 0.75-1.00
Auto-reject threshold slider 0.70 Below this, refuse without prompting; range 0.50-0.90
Base track selection radio First lossless track / User picks per job / By language preference See Open questions
Feature set preference radio Auto (recommended) Auto / Force chromagram / Force raw PCM
Sample analysis points multi-select 10%, 50%, 90% Fractions of duration to test
Show diagnostic report toggle ON Show the per-track sync report after analysis

CLI behaviour

# Default — analyse, prompt, apply if approved
meedyaconvert mux --audio-sync-detect input.mkv

# Fully automated for batch jobs — apply when confidence ≥ 0.95, skip otherwise
meedyaconvert mux --audio-sync-detect --audio-sync-mode auto \
  --audio-sync-min-confidence 0.95 input.mkv

# Analyse only, report, don't touch the audio
meedyaconvert mux --audio-sync-detect --audio-sync-mode reject input.mkv
# → writes JSON report alongside, output file gets uncorrected audio + warning

# Force a specific base track
meedyaconvert mux --audio-sync-detect --audio-sync-base 0 input.mkv

JSON report schema (proposed):

{
  "version": 1,
  "input_file": "input.mkv",
  "base_track": {"index": 0, "language": "eng", "codec": "truehd", "channels": 8},
  "analyses": [
    {
      "candidate_track": {"index": 1, "language": "ita", "codec": "dts", "channels": 6},
      "result": "applied",
      "offset_ms": 412.7,
      "offset_samples": 19811,
      "confidence": 0.94,
      "feature_set": "chroma",
      "sample_points": [
        {"position": 0.1, "correlation": 0.93},
        {"position": 0.5, "correlation": 0.95},
        {"position": 0.9, "correlation": 0.94}
      ],
      "warnings": []
    }
  ]
}

Acceptance criteria

  • Phase A complete: synthetic-test sync (same audio + known offset) succeeds at sub-millisecond accuracy
  • Phase B complete and shipped: end-to-end test on at least three real-world cases:
    • Same-language fixed offset (e.g. PCM English vs DTS-HD MA English, different masters)
    • Cross-language fixed offset (e.g. English TrueHD base + Italian DTS dub)
    • Spatial audio passthrough (e.g. Atmos candidate with fixed offset against TrueHD Atmos base — must preserve Atmos metadata in output)
  • Phase C complete: confidence score correlates with subjective sync quality (validated against a manual A/B listening test for at least five sync results)
  • Hard-reject floor is enforced: a sync attempt with confidence < 0.70 default never applies without explicit user override
  • CLI API documentation (per standing task [Phase 0.5] GitHub Project Board #11) updated with all --audio-sync-* flags and the JSON report schema
  • In-app help (Resources/Help/) and the GitHub wiki document:
    • What the feature does
    • The difference between offset (this issue) and drift (companion issue)
    • What is explicitly NOT supported: drift correction (separate feature), scene-cut handling (separate feature), automatic base-track selection by content (only by metadata heuristics)
    • Performance impact estimate (typical 5-30 sec/track)
    • Privacy: analysis happens locally; no audio leaves the machine
  • Security review (standing task [Phase 5.13] Matrix encoding metadata on downmix — embed Pro Logic II/Dolby Surround for AVR unfold #4) confirms: (a) malformed input audio cannot crash the analyser, (b) the chromaprint dependency is pinned to a known-good version, (c) JSON report output cannot leak filesystem paths beyond the input file, (d) cancellation cleanly tears down without leaving temp files

Technical / security notes

  1. License question — chromaprint is LGPL. Linking dynamically and shipping the library as a separate framework is the safe path for a proprietary app. See Open questions below — we need to confirm this is acceptable before committing to the dependency.
  2. Performance: chromaprint is fast (~real-time at decode speed for fingerprinting). FFT cross-correlation on a 10-second window is sub-second. Total analysis time per candidate track on a typical multi-hour file: 5-30 seconds. Acceptable.
  3. Memory: PCM decoding consumes ~10 MB/min of stereo 16-bit audio. Three sample windows × 10 seconds × 6 channels = ~3 MB. Negligible.
  4. No audio leaves the device: this is a local-analysis feature, no telemetry, no cloud calls. Worth marketing.
  5. Base track selection: see Open questions — heuristics vs user-pick is a real product decision.
  6. Stream copy preserved: offset-only correction never requires audio decode/re-encode. This is the feature's biggest engineering advantage.

Open questions

These need decisions before Phase A starts:

  1. LGPL chromaprint acceptable? chromaprint is LGPL v2.1+. Linked dynamically and bundled as a framework, this is generally fine for proprietary apps (the LGPL allows it with attribution + a path for users to swap the library). But if the policy is "no copyleft dependencies", we need an alternative. Permissive alternatives: write our own perceptual fingerprinter (significant work), use raw FFT correlation only (less robust to codec differences), or license a commercial fingerprinter (rare).

  2. Base track selection heuristic — which is right? Candidates:

    • (a) First lossless audio track in the file — simple, predictable, usually right
    • (b) Track matching the user's preferred language (from system locale or settings) — most semantically correct
    • (c) Track with the most channels / highest bitrate — assumes "best quality = base"
    • (d) User picks per job — most flexible but interrupts batch workflows
    • (e) Combination: heuristic default + user override per job
    • Recommendation: (e) with (a) + (b) combined as the default heuristic ("first lossless track matching the user's preferred language; if none, first lossless track of any language; if none, first track of any kind"). Want to confirm.
  3. What happens when the input has only one audio track? The feature is silently a no-op. Should it surface a "no candidate tracks to sync" message, or just stay quiet?

  4. Multi-candidate base-relative sync? When there are 4 audio tracks (1 base + 3 candidates), should they be synced independently against the base, or should we also check that the candidates agree with each other? Recommendation: independent against base only (simpler, faster, correct).

  5. What sample positions are optimal? Default 10/50/90% of duration. But films often have low-content opening logos and closing credits — those sample points may give weak correlations even on perfectly-aligned tracks. Better defaults might be 15/50/85%? Worth measuring on a corpus.

  6. Granularity of the "Auto-apply" threshold: default 0.95. Is this too conservative? Too permissive? Worth validating against the same corpus.

References

Effort estimate

  • Phase A: 2-3 weeks (chromaprint integration + FFT correlation + diagnostic CLI)
  • Phase B: 2-3 weeks (pipeline integration + container-level delay + integration tests)
  • Phase C: 2-3 weeks (confidence model + verification UI + waveform overlay)
  • Phase D: 2 weeks (settings UI + persistence + polish)

Conservative total: 8-11 weeks of focused work. Smaller than OFX (#419) and OCIO (#420), and more obviously valuable in the short term to the target user base.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions