You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an optional, settings-gated capability to MeedyaConverter that detects and corrects fixed audio offset between a designated base audio track and one or more candidate audio tracks being muxed into the same output container. This is the "Issue 1 — Offset Sync" feature in the collector/archive workflow discussed in this Blu-ray forum thread (members manually computing offsets to align imported audio dubs against a base track).
When enabled, MeedyaConverter analyses each candidate track against the base, computes the offset, presents the user with a confidence score and diagnostic report, and (on approval) applies the offset via container-level audio delay — no audio re-encoding required for offset-only corrections, so passthrough remains intact (including for spatial audio formats).
This issue covers fixed offset only. Drift correction, scene-cut handling, and the spatial-audio policy are tracked separately in a companion issue (link in Related issues once filed).
Target users
Collectors muxing imported foreign-language dubs into a domestic Blu-ray rip
Archivists consolidating multiple releases of the same content
Audiophiles muxing higher-bitrate audio (e.g. lossless from a different release) into a video they already own
Preservationists assembling restored audio against archival video
All of these workflows currently require either Subtitle Edit + manual offset trial-and-error, MKVToolNix + a calculator, or DVDFab/MakeMKV's paid sync features. MeedyaConverter would replace that workflow.
Default state and gating
The feature is OFF by default. Analysis adds 5-30 seconds per audio track to the conversion process (chromaprint fingerprinting + cross-correlation on a few-minute sample). Users who don't need it shouldn't pay that cost.
When OFF:
Audio tracks are muxed with their declared timestamps unchanged (current behaviour)
When ON (via Settings → Audio Sync → Enable Offset Detection):
For every multi-track mux job, each non-base audio track is analysed against the base
A "Sync Analysis" step appears in the conversion progress UI
Results are surfaced before the encode step commits
Approach (technical)
Detection: chromaprint fingerprints + refinement
Stage 1 — Coarse offset (chromaprint): Decode the first ~3 minutes of dialogue/M&E from both base and candidate to PCM. Generate chromaprint fingerprints. Cross-correlate fingerprints → coarse offset estimate (resolution: ~100 ms).
Stage 2 — Refinement (FFT cross-correlation): Around the coarse-offset region, run sample-accurate FFT cross-correlation on a 10-second window → sub-sample offset (resolution: <1 ms).
Stage 3 — Verification: Take a second sample from a different part of the timeline (e.g. 60% in) and verify the offset still holds. If it does, this is a fixed-offset case and we're done. If the offset has changed → this is a drift case → defer to the companion drift-correction issue.
Correction: container-level delay (no re-encode)
The detected offset is applied as a container-level audio delay:
For MKV / MP4: write the audio stream with an adjusted timestamp / track-level delay
For MOV: edit list (elst) atom
No audio sample re-encoding — passthrough preserved for AC3, E-AC3, DTS, DTS-HD MA, TrueHD, Atmos, DTS:X, Auro-3D, PCM, FLAC etc.
This is critical for the spatial-audio case — fixed-offset correction does NOT destroy spatial metadata.
Cross-language alignment via M&E features
When base and candidate are different languages, dialogue differs but the M&E (music and effects) stem is typically shared. Default behaviour: cross-correlate on chromagram + onset envelope features, not raw audio. These are robust to dialogue differences and detect alignment on shared music/effect cues.
A user-overridable setting allows forcing raw-PCM cross-correlation when languages match.
Scope (phased)
Phase A — Detection engine + diagnostic CLI
Integrate chromaprint library (LGPL — link as shared library, attribute correctly). See Open questions regarding LGPL acceptability.
Audio decoder path (FFmpeg) to PCM for chromaprint input — covers all standard codecs
FFT-based cross-correlation implementation (Accelerate.framework on Apple platforms; FFTW or kissfft as cross-platform fallback)
--audio-sync-feature-set raw|chroma|onset|auto (default: auto, picks based on language tag match)
--audio-sync-sample-points <comma-list> (default: 0.1,0.5,0.9 — fractions of duration to sample)
Container-level delay application: MKV, MP4/MOV, MKA, M4A — verify timestamps on output and verify the audio truly aligns post-mux
Integration test: take a real Blu-ray rip + a known-offset alternate dub, run end-to-end, verify both audio is in sync AND passthrough was preserved (no quality loss, codec unchanged)
Diagnostic report (JSON + human-readable text) — per analysis:
Detected offset (ms, samples)
Confidence score (0.0-1.0)
Per-sample-point correlation values
Feature set used (raw / chroma / onset)
Sample window durations
Warnings (e.g. "candidate is shorter than base; head trim suspected")
Sanity check unit tests:
Sync a track against itself → offset = 0, confidence = 1.0
Sync silence against music → confidence < 0.1, refuse
Sync white noise against music → confidence < 0.2, refuse
GUI: pre-mux confirmation dialog showing the offset, confidence, "Apply", "Skip this track", "Manual override" buttons. Includes a small waveform overlay of the first 5 seconds of dialogue after sync is applied so the user can eyeball it.
Phase D — Settings, defaults, and polish
Settings panel: Audio Sync section with:
Master enable toggle
Default behaviour: Always ask, Auto-apply if confidence ≥ X, Auto-reject if confidence < Y, Analyse only (never apply)
Confidence thresholds (auto-apply, hard reject)
Base-track selection strategy (see Open questions)
Feature-set preference (auto / force chromagram / force raw)
Persistent "Remember my choice for this session / always" on per-job prompts
Diagnostic export: "Save sync report" produces a self-contained .json + .txt bundle for support cases (no audio data leaks — just metadata + correlation curves)
Phase A complete: synthetic-test sync (same audio + known offset) succeeds at sub-millisecond accuracy
Phase B complete and shipped: end-to-end test on at least three real-world cases:
Same-language fixed offset (e.g. PCM English vs DTS-HD MA English, different masters)
Cross-language fixed offset (e.g. English TrueHD base + Italian DTS dub)
Spatial audio passthrough (e.g. Atmos candidate with fixed offset against TrueHD Atmos base — must preserve Atmos metadata in output)
Phase C complete: confidence score correlates with subjective sync quality (validated against a manual A/B listening test for at least five sync results)
Hard-reject floor is enforced: a sync attempt with confidence < 0.70 default never applies without explicit user override
In-app help (Resources/Help/) and the GitHub wiki document:
What the feature does
The difference between offset (this issue) and drift (companion issue)
What is explicitly NOT supported: drift correction (separate feature), scene-cut handling (separate feature), automatic base-track selection by content (only by metadata heuristics)
License question — chromaprint is LGPL. Linking dynamically and shipping the library as a separate framework is the safe path for a proprietary app. See Open questions below — we need to confirm this is acceptable before committing to the dependency.
Performance: chromaprint is fast (~real-time at decode speed for fingerprinting). FFT cross-correlation on a 10-second window is sub-second. Total analysis time per candidate track on a typical multi-hour file: 5-30 seconds. Acceptable.
Memory: PCM decoding consumes ~10 MB/min of stereo 16-bit audio. Three sample windows × 10 seconds × 6 channels = ~3 MB. Negligible.
No audio leaves the device: this is a local-analysis feature, no telemetry, no cloud calls. Worth marketing.
Base track selection: see Open questions — heuristics vs user-pick is a real product decision.
Stream copy preserved: offset-only correction never requires audio decode/re-encode. This is the feature's biggest engineering advantage.
Open questions
These need decisions before Phase A starts:
LGPL chromaprint acceptable? chromaprint is LGPL v2.1+. Linked dynamically and bundled as a framework, this is generally fine for proprietary apps (the LGPL allows it with attribution + a path for users to swap the library). But if the policy is "no copyleft dependencies", we need an alternative. Permissive alternatives: write our own perceptual fingerprinter (significant work), use raw FFT correlation only (less robust to codec differences), or license a commercial fingerprinter (rare).
Base track selection heuristic — which is right? Candidates:
(a) First lossless audio track in the file — simple, predictable, usually right
(b) Track matching the user's preferred language (from system locale or settings) — most semantically correct
(c) Track with the most channels / highest bitrate — assumes "best quality = base"
(d) User picks per job — most flexible but interrupts batch workflows
(e) Combination: heuristic default + user override per job
Recommendation: (e) with (a) + (b) combined as the default heuristic ("first lossless track matching the user's preferred language; if none, first lossless track of any language; if none, first track of any kind"). Want to confirm.
What happens when the input has only one audio track? The feature is silently a no-op. Should it surface a "no candidate tracks to sync" message, or just stay quiet?
Multi-candidate base-relative sync? When there are 4 audio tracks (1 base + 3 candidates), should they be synced independently against the base, or should we also check that the candidates agree with each other? Recommendation: independent against base only (simpler, faster, correct).
What sample positions are optimal? Default 10/50/90% of duration. But films often have low-content opening logos and closing credits — those sample points may give weak correlations even on perfectly-aligned tracks. Better defaults might be 15/50/85%? Worth measuring on a corpus.
Granularity of the "Auto-apply" threshold: default 0.95. Is this too conservative? Too permissive? Worth validating against the same corpus.
Conservative total: 8-11 weeks of focused work. Smaller than OFX (#419) and OCIO (#420), and more obviously valuable in the short term to the target user base.
Related issues
(to be filed in this batch) Audio drift correction, scene-cut handling, and spatial-audio policy — the companion to this issue covering Issue 2 in full
Summary
Add an optional, settings-gated capability to MeedyaConverter that detects and corrects fixed audio offset between a designated base audio track and one or more candidate audio tracks being muxed into the same output container. This is the "Issue 1 — Offset Sync" feature in the collector/archive workflow discussed in this Blu-ray forum thread (members manually computing offsets to align imported audio dubs against a base track).
When enabled, MeedyaConverter analyses each candidate track against the base, computes the offset, presents the user with a confidence score and diagnostic report, and (on approval) applies the offset via container-level audio delay — no audio re-encoding required for offset-only corrections, so passthrough remains intact (including for spatial audio formats).
This issue covers fixed offset only. Drift correction, scene-cut handling, and the spatial-audio policy are tracked separately in a companion issue (link in Related issues once filed).
Target users
All of these workflows currently require either Subtitle Edit + manual offset trial-and-error, MKVToolNix + a calculator, or DVDFab/MakeMKV's paid sync features. MeedyaConverter would replace that workflow.
Default state and gating
The feature is OFF by default. Analysis adds 5-30 seconds per audio track to the conversion process (chromaprint fingerprinting + cross-correlation on a few-minute sample). Users who don't need it shouldn't pay that cost.
When OFF:
When ON (via Settings → Audio Sync → Enable Offset Detection):
Approach (technical)
Detection: chromaprint fingerprints + refinement
Correction: container-level delay (no re-encode)
The detected offset is applied as a container-level audio delay:
elst) atomCross-language alignment via M&E features
When base and candidate are different languages, dialogue differs but the M&E (music and effects) stem is typically shared. Default behaviour: cross-correlate on chromagram + onset envelope features, not raw audio. These are robust to dialogue differences and detect alignment on shared music/effect cues.
A user-overridable setting allows forcing raw-PCM cross-correlation when languages match.
Scope (phased)
Phase A — Detection engine + diagnostic CLI
meedyaconvert audio-sync analyse --base <track-spec> --candidate <track-spec> <input-file>— outputs JSON with offset, confidence, sample-points used, residual correlationPhase B — Pipeline integration + container-level delay
--audio-sync-detect(boolean — enable analysis)--audio-sync-base <track-index>(which track is the reference; default: heuristic — see Open questions)--audio-sync-mode auto|prompt|reject(auto = apply if confidence ≥ threshold; prompt = always ask; reject = analyse-only, never apply)--audio-sync-min-confidence <0.0-1.0>(default 0.85; sub-floor = reject)--audio-sync-feature-set raw|chroma|onset|auto(default: auto, picks based on language tag match)--audio-sync-sample-points <comma-list>(default: 0.1,0.5,0.9 — fractions of duration to sample)Phase C — Confidence reporting + verification UI
Phase D — Settings, defaults, and polish
.json+.txtbundle for support cases (no audio data leaks — just metadata + correlation curves)Settings UX (specification)
A new Settings panel section: Audio Sync (top-level, separate from Audio encoding settings).
CLI behaviour
JSON report schema (proposed):
{ "version": 1, "input_file": "input.mkv", "base_track": {"index": 0, "language": "eng", "codec": "truehd", "channels": 8}, "analyses": [ { "candidate_track": {"index": 1, "language": "ita", "codec": "dts", "channels": 6}, "result": "applied", "offset_ms": 412.7, "offset_samples": 19811, "confidence": 0.94, "feature_set": "chroma", "sample_points": [ {"position": 0.1, "correlation": 0.93}, {"position": 0.5, "correlation": 0.95}, {"position": 0.9, "correlation": 0.94} ], "warnings": [] } ] }Acceptance criteria
--audio-sync-*flags and the JSON report schemaResources/Help/) and the GitHub wiki document:Technical / security notes
Open questions
These need decisions before Phase A starts:
LGPL chromaprint acceptable? chromaprint is LGPL v2.1+. Linked dynamically and bundled as a framework, this is generally fine for proprietary apps (the LGPL allows it with attribution + a path for users to swap the library). But if the policy is "no copyleft dependencies", we need an alternative. Permissive alternatives: write our own perceptual fingerprinter (significant work), use raw FFT correlation only (less robust to codec differences), or license a commercial fingerprinter (rare).
Base track selection heuristic — which is right? Candidates:
What happens when the input has only one audio track? The feature is silently a no-op. Should it surface a "no candidate tracks to sync" message, or just stay quiet?
Multi-candidate base-relative sync? When there are 4 audio tracks (1 base + 3 candidates), should they be synced independently against the base, or should we also check that the candidates agree with each other? Recommendation: independent against base only (simpler, faster, correct).
What sample positions are optimal? Default 10/50/90% of duration. But films often have low-content opening logos and closing credits — those sample points may give weak correlations even on perfectly-aligned tracks. Better defaults might be 15/50/85%? Worth measuring on a corpus.
Granularity of the "Auto-apply" threshold: default 0.95. Is this too conservative? Too permissive? Worth validating against the same corpus.
References
Effort estimate
Conservative total: 8-11 weeks of focused work. Smaller than OFX (#419) and OCIO (#420), and more obviously valuable in the short term to the target user base.
Related issues