QwenVoice Dev Validation Report — April 18, 2026
Executive Summary
The April 18, 2026 dev validation pass is mostly green but not clean. The only maintained automated source-lane failure was the audio lane, where inter_chunk_timing_jitter exceeded threshold (CV 1.0675 vs 0.5). Manual Computer Use coverage found two user-visible soft defects: inconsistent clone batch progress text during active generation, and a stale sidebar engine error card that can appear on fresh launch before any new interaction. Performance evidence points to the current helper overlay/refactor rather than the mlx-audio runtime move, while backend generation remains the dominant wall-time cost and pro_custom prewarmed latency artifacts are still variable enough to merit follow-up.
Validation Matrix
Passed
./scripts/check_project_inputs.sh
python3 scripts/harness.py validate
python3 scripts/harness.py diagnose
- Fresh Debug build:
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build
contract lane: python3 scripts/harness.py test --layer contract
server lane: python3 scripts/harness.py test --layer server
rpc lane: python3 scripts/harness.py test --layer rpc
pipeline lane: python3 scripts/harness.py test --layer pipeline
swift lane: python3 scripts/harness.py test --layer swift
- Live native smoke:
QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -destination 'platform=macOS' -only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test
load benchmark: python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
latency benchmark: python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
clone_regression benchmark: python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
perf benchmark: python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
- Manual Computer Use pass:
- light mode: main window chrome, sidebar/search/toolbar, Models, Custom Voice single + batch, Voice Design single + save to Saved Voices, Saved Voices preview + Open in Cloning, Voice Cloning single + batch, History, Preferences
- dark mode: main chrome, Models, Preferences
Failed
audio lane: python3 scripts/harness.py test --layer audio
Skipped / Retired
quality benchmark: python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18 (retired)
tts_roundtrip benchmark: python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18 (skipped: no local ASR evaluator)
Hard Failures
The only automated failure from the maintained source gates was the audio lane. The failing check was inter_chunk_timing_jitter, which reported CV 1.0675 against a threshold of 0.5. Optional audio dependency skips observed during the run were context only and are not counted as failures.
Soft UI Defects
- Clone batch status reporting is inconsistent during processing. During manual validation, the UI showed
Generating item 2/2... 0 of 2 clips completed · Item 1 active before eventually finishing successfully. This is user-visible, but it should not be treated as release-blocking unless it reproduces in packaged or release validation.
- Fresh launch can show a stale sidebar engine error card before new interaction. On fresh launches, the sidebar showed an empty-custom-prompt style engine error before any new action. This is user-visible, but it should not be treated as a confirmed release regression unless it reproduces in packaged or release validation.
Performance Findings
clone_regression.json classifies the slowdown source as overlay_refactor, not the mlx-audio runtime move. The recorded reason is: “The current helper is materially slower than the old helper on the same current mlx-audio runtime, which points to the overlay refactor rather than the runtime move.” Total time was 30.988s for current_runtime_current_helper versus 26.3212s for current_runtime_old_helper.
perf_results.json shows generation-path work dominating wall time. Top bottlenecks were total_backend (33296.8ms, 100.0% of wall), generation (33252.1ms, 99.8%), and collect_generation (33251.8ms, 99.8%). Mean RPC overhead was only 5.5ms, which reinforces that server-side generation dominates.
pro_custom_prewarmed_samples.json remains noisy enough to merit follow-up even though the latency lane passed. Recorded wall time for the prewarmed scenario had mean 45290.91ms, min 23843.57ms, max 85212.12ms, and CV 0.6238.
Evidence And Artifacts
- Built dev app:
/tmp/qwenvoice-dev-derived/Build/Products/Debug/QwenVoice.app
- Benchmark root:
/tmp/qwenvoice-dev-bench-2026-04-18
- Key artifacts:
/tmp/qwenvoice-dev-bench-2026-04-18/perf_results.json
/tmp/qwenvoice-dev-bench-2026-04-18/clone_regression.json
/tmp/qwenvoice-dev-bench-2026-04-18/tts_roundtrip_summary.json
./scripts/check_project_inputs.sh
python3 scripts/harness.py validate
python3 scripts/harness.py diagnose
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build
python3 scripts/harness.py test --layer contract
python3 scripts/harness.py test --layer server
python3 scripts/harness.py test --layer rpc
python3 scripts/harness.py test --layer pipeline
python3 scripts/harness.py test --layer audio
python3 scripts/harness.py test --layer swift
QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 \
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice \
-destination 'platform=macOS' \
-only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test
python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
Recommended Next Fixes
-
Fix audio inter-chunk jitter regression in the maintained audio lane
Problem: the maintained audio lane currently fails on inter_chunk_timing_jitter, which makes the automated source gate non-green.
Evidence: the April 18 run recorded CV 1.0675 against a threshold of 0.5, and this was the only maintained source-lane failure.
Acceptance: python3 scripts/harness.py test --layer audio passes cleanly, with inter_chunk_timing_jitter back under threshold and optional dependency skips still treated as non-failing context.
-
Correct clone batch progress/status text during active generation
Problem: the clone batch UI can report contradictory progress while work is still in flight.
Evidence: manual validation showed Generating item 2/2... 0 of 2 clips completed · Item 1 active before eventual success.
Acceptance: during clone batch generation, the visible active item, completed count, and overall item index stay internally consistent from first clip through completion.
-
Suppress stale startup error state in the sidebar on fresh launch
Problem: the app can surface an old engine error card before the user performs any new interaction.
Evidence: fresh launches in manual validation showed a stale empty-custom-prompt style sidebar error before any new action.
Acceptance: a fresh launch only shows current startup or runtime state, and the sidebar does not surface stale generation errors unless a new action actually produces one.
-
Investigate generation-path latency variance and backend-dominant perf cost
Problem: performance remains dominated by backend generation, and pro_custom prewarmed latency artifacts are still noisy enough to obscure regression interpretation.
Evidence: perf_results.json shows total_backend, generation, and collect_generation at effectively all wall time, while clone_regression.json points to the helper overlay/refactor and pro_custom_prewarmed_samples.json shows wall-time CV 0.6238.
Acceptance: isolate whether the remaining variance is prompt-, warm-state-, helper-, or scheduling-driven, then show improvement or at least a stable reproducible explanation in the smallest relevant benchmark category.
Recommended Rerun Order
python3 scripts/harness.py validate
python3 scripts/harness.py test --layer audio
- The smallest relevant benchmark category for the fix under test:
latency for pro_custom prewarm variability
clone_regression for helper-overlay work
- another directly touched category if the fix is narrower
- A scoped Computer Use pass for the touched screen or flow
This report captures the completed April 18, 2026 validation pass and does not imply any repo changes were made as part of the report itself.
QwenVoice Dev Validation Report — April 18, 2026
Executive Summary
The April 18, 2026 dev validation pass is mostly green but not clean. The only maintained automated source-lane failure was the
audiolane, whereinter_chunk_timing_jitterexceeded threshold (CV1.0675vs0.5). Manual Computer Use coverage found two user-visible soft defects: inconsistent clone batch progress text during active generation, and a stale sidebar engine error card that can appear on fresh launch before any new interaction. Performance evidence points to the current helper overlay/refactor rather than themlx-audioruntime move, while backend generation remains the dominant wall-time cost andpro_customprewarmed latency artifacts are still variable enough to merit follow-up.Validation Matrix
Passed
./scripts/check_project_inputs.shpython3 scripts/harness.py validatepython3 scripts/harness.py diagnosexcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived buildcontractlane:python3 scripts/harness.py test --layer contractserverlane:python3 scripts/harness.py test --layer serverrpclane:python3 scripts/harness.py test --layer rpcpipelinelane:python3 scripts/harness.py test --layer pipelineswiftlane:python3 scripts/harness.py test --layer swiftQWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -destination 'platform=macOS' -only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests testloadbenchmark:python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18latencybenchmark:python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18clone_regressionbenchmark:python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18perfbenchmark:python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18Failed
audiolane:python3 scripts/harness.py test --layer audioSkipped / Retired
qualitybenchmark:python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18(retired)tts_roundtripbenchmark:python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18(skipped: no local ASR evaluator)Hard Failures
The only automated failure from the maintained source gates was the
audiolane. The failing check wasinter_chunk_timing_jitter, which reported CV1.0675against a threshold of0.5. Optionalaudiodependency skips observed during the run were context only and are not counted as failures.Soft UI Defects
Generating item 2/2... 0 of 2 clips completed · Item 1 activebefore eventually finishing successfully. This is user-visible, but it should not be treated as release-blocking unless it reproduces in packaged or release validation.Performance Findings
clone_regression.jsonclassifies the slowdown source asoverlay_refactor, not themlx-audioruntime move. The recorded reason is: “The current helper is materially slower than the old helper on the same current mlx-audio runtime, which points to the overlay refactor rather than the runtime move.” Total time was30.988sforcurrent_runtime_current_helperversus26.3212sforcurrent_runtime_old_helper.perf_results.jsonshows generation-path work dominating wall time. Top bottlenecks weretotal_backend(33296.8ms,100.0%of wall),generation(33252.1ms,99.8%), andcollect_generation(33251.8ms,99.8%). Mean RPC overhead was only5.5ms, which reinforces that server-side generation dominates.pro_custom_prewarmed_samples.jsonremains noisy enough to merit follow-up even though the latency lane passed. Recorded wall time for the prewarmed scenario had mean45290.91ms, min23843.57ms, max85212.12ms, and CV0.6238.Evidence And Artifacts
/tmp/qwenvoice-dev-derived/Build/Products/Debug/QwenVoice.app/tmp/qwenvoice-dev-bench-2026-04-18/tmp/qwenvoice-dev-bench-2026-04-18/perf_results.json/tmp/qwenvoice-dev-bench-2026-04-18/clone_regression.json/tmp/qwenvoice-dev-bench-2026-04-18/tts_roundtrip_summary.jsonRecommended Next Fixes
Fix audio inter-chunk jitter regression in the maintained audio lane
Problem: the maintained
audiolane currently fails oninter_chunk_timing_jitter, which makes the automated source gate non-green.Evidence: the April 18 run recorded CV
1.0675against a threshold of0.5, and this was the only maintained source-lane failure.Acceptance:
python3 scripts/harness.py test --layer audiopasses cleanly, withinter_chunk_timing_jitterback under threshold and optional dependency skips still treated as non-failing context.Correct clone batch progress/status text during active generation
Problem: the clone batch UI can report contradictory progress while work is still in flight.
Evidence: manual validation showed
Generating item 2/2... 0 of 2 clips completed · Item 1 activebefore eventual success.Acceptance: during clone batch generation, the visible active item, completed count, and overall item index stay internally consistent from first clip through completion.
Suppress stale startup error state in the sidebar on fresh launch
Problem: the app can surface an old engine error card before the user performs any new interaction.
Evidence: fresh launches in manual validation showed a stale empty-custom-prompt style sidebar error before any new action.
Acceptance: a fresh launch only shows current startup or runtime state, and the sidebar does not surface stale generation errors unless a new action actually produces one.
Investigate generation-path latency variance and backend-dominant perf cost
Problem: performance remains dominated by backend generation, and
pro_customprewarmed latency artifacts are still noisy enough to obscure regression interpretation.Evidence:
perf_results.jsonshowstotal_backend,generation, andcollect_generationat effectively all wall time, whileclone_regression.jsonpoints to the helper overlay/refactor andpro_custom_prewarmed_samples.jsonshows wall-time CV0.6238.Acceptance: isolate whether the remaining variance is prompt-, warm-state-, helper-, or scheduling-driven, then show improvement or at least a stable reproducible explanation in the smallest relevant benchmark category.
Recommended Rerun Order
python3 scripts/harness.py validatepython3 scripts/harness.py test --layer audiolatencyforpro_customprewarm variabilityclone_regressionfor helper-overlay workThis report captures the completed April 18, 2026 validation pass and does not imply any repo changes were made as part of the report itself.