Skip to content

Dev validation report: April 18, 2026 #22

@PowerBeef

Description

@PowerBeef

QwenVoice Dev Validation Report — April 18, 2026

Executive Summary

The April 18, 2026 dev validation pass is mostly green but not clean. The only maintained automated source-lane failure was the audio lane, where inter_chunk_timing_jitter exceeded threshold (CV 1.0675 vs 0.5). Manual Computer Use coverage found two user-visible soft defects: inconsistent clone batch progress text during active generation, and a stale sidebar engine error card that can appear on fresh launch before any new interaction. Performance evidence points to the current helper overlay/refactor rather than the mlx-audio runtime move, while backend generation remains the dominant wall-time cost and pro_custom prewarmed latency artifacts are still variable enough to merit follow-up.

Validation Matrix

Passed

  • ./scripts/check_project_inputs.sh
  • python3 scripts/harness.py validate
  • python3 scripts/harness.py diagnose
  • Fresh Debug build: xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build
  • contract lane: python3 scripts/harness.py test --layer contract
  • server lane: python3 scripts/harness.py test --layer server
  • rpc lane: python3 scripts/harness.py test --layer rpc
  • pipeline lane: python3 scripts/harness.py test --layer pipeline
  • swift lane: python3 scripts/harness.py test --layer swift
  • Live native smoke: QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -destination 'platform=macOS' -only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test
  • load benchmark: python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
  • latency benchmark: python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
  • clone_regression benchmark: python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
  • perf benchmark: python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
  • Manual Computer Use pass:
    • light mode: main window chrome, sidebar/search/toolbar, Models, Custom Voice single + batch, Voice Design single + save to Saved Voices, Saved Voices preview + Open in Cloning, Voice Cloning single + batch, History, Preferences
    • dark mode: main chrome, Models, Preferences

Failed

  • audio lane: python3 scripts/harness.py test --layer audio

Skipped / Retired

  • quality benchmark: python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18 (retired)
  • tts_roundtrip benchmark: python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18 (skipped: no local ASR evaluator)

Hard Failures

The only automated failure from the maintained source gates was the audio lane. The failing check was inter_chunk_timing_jitter, which reported CV 1.0675 against a threshold of 0.5. Optional audio dependency skips observed during the run were context only and are not counted as failures.

Soft UI Defects

  • Clone batch status reporting is inconsistent during processing. During manual validation, the UI showed Generating item 2/2... 0 of 2 clips completed · Item 1 active before eventually finishing successfully. This is user-visible, but it should not be treated as release-blocking unless it reproduces in packaged or release validation.
  • Fresh launch can show a stale sidebar engine error card before new interaction. On fresh launches, the sidebar showed an empty-custom-prompt style engine error before any new action. This is user-visible, but it should not be treated as a confirmed release regression unless it reproduces in packaged or release validation.

Performance Findings

  • clone_regression.json classifies the slowdown source as overlay_refactor, not the mlx-audio runtime move. The recorded reason is: “The current helper is materially slower than the old helper on the same current mlx-audio runtime, which points to the overlay refactor rather than the runtime move.” Total time was 30.988s for current_runtime_current_helper versus 26.3212s for current_runtime_old_helper.
  • perf_results.json shows generation-path work dominating wall time. Top bottlenecks were total_backend (33296.8ms, 100.0% of wall), generation (33252.1ms, 99.8%), and collect_generation (33251.8ms, 99.8%). Mean RPC overhead was only 5.5ms, which reinforces that server-side generation dominates.
  • pro_custom_prewarmed_samples.json remains noisy enough to merit follow-up even though the latency lane passed. Recorded wall time for the prewarmed scenario had mean 45290.91ms, min 23843.57ms, max 85212.12ms, and CV 0.6238.

Evidence And Artifacts

  • Built dev app: /tmp/qwenvoice-dev-derived/Build/Products/Debug/QwenVoice.app
  • Benchmark root: /tmp/qwenvoice-dev-bench-2026-04-18
  • Key artifacts:
    • /tmp/qwenvoice-dev-bench-2026-04-18/perf_results.json
    • /tmp/qwenvoice-dev-bench-2026-04-18/clone_regression.json
    • /tmp/qwenvoice-dev-bench-2026-04-18/tts_roundtrip_summary.json
./scripts/check_project_inputs.sh
python3 scripts/harness.py validate
python3 scripts/harness.py diagnose
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build

python3 scripts/harness.py test --layer contract
python3 scripts/harness.py test --layer server
python3 scripts/harness.py test --layer rpc
python3 scripts/harness.py test --layer pipeline
python3 scripts/harness.py test --layer audio
python3 scripts/harness.py test --layer swift

QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 \
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice \
-destination 'platform=macOS' \
-only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test

python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18

Recommended Next Fixes

  1. Fix audio inter-chunk jitter regression in the maintained audio lane
    Problem: the maintained audio lane currently fails on inter_chunk_timing_jitter, which makes the automated source gate non-green.
    Evidence: the April 18 run recorded CV 1.0675 against a threshold of 0.5, and this was the only maintained source-lane failure.
    Acceptance: python3 scripts/harness.py test --layer audio passes cleanly, with inter_chunk_timing_jitter back under threshold and optional dependency skips still treated as non-failing context.

  2. Correct clone batch progress/status text during active generation
    Problem: the clone batch UI can report contradictory progress while work is still in flight.
    Evidence: manual validation showed Generating item 2/2... 0 of 2 clips completed · Item 1 active before eventual success.
    Acceptance: during clone batch generation, the visible active item, completed count, and overall item index stay internally consistent from first clip through completion.

  3. Suppress stale startup error state in the sidebar on fresh launch
    Problem: the app can surface an old engine error card before the user performs any new interaction.
    Evidence: fresh launches in manual validation showed a stale empty-custom-prompt style sidebar error before any new action.
    Acceptance: a fresh launch only shows current startup or runtime state, and the sidebar does not surface stale generation errors unless a new action actually produces one.

  4. Investigate generation-path latency variance and backend-dominant perf cost
    Problem: performance remains dominated by backend generation, and pro_custom prewarmed latency artifacts are still noisy enough to obscure regression interpretation.
    Evidence: perf_results.json shows total_backend, generation, and collect_generation at effectively all wall time, while clone_regression.json points to the helper overlay/refactor and pro_custom_prewarmed_samples.json shows wall-time CV 0.6238.
    Acceptance: isolate whether the remaining variance is prompt-, warm-state-, helper-, or scheduling-driven, then show improvement or at least a stable reproducible explanation in the smallest relevant benchmark category.

Recommended Rerun Order

  1. python3 scripts/harness.py validate
  2. python3 scripts/harness.py test --layer audio
  3. The smallest relevant benchmark category for the fix under test:
    • latency for pro_custom prewarm variability
    • clone_regression for helper-overlay work
    • another directly touched category if the fix is narrower
  4. A scoped Computer Use pass for the touched screen or flow

This report captures the completed April 18, 2026 validation pass and does not imply any repo changes were made as part of the report itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions