Dev validation report: April 18, 2026

# QwenVoice Dev Validation Report — April 18, 2026

## Executive Summary
The April 18, 2026 dev validation pass is mostly green but not clean. The only maintained automated source-lane failure was the `audio` lane, where `inter_chunk_timing_jitter` exceeded threshold (CV `1.0675` vs `0.5`). Manual Computer Use coverage found two user-visible soft defects: inconsistent clone batch progress text during active generation, and a stale sidebar engine error card that can appear on fresh launch before any new interaction. Performance evidence points to the current helper overlay/refactor rather than the `mlx-audio` runtime move, while backend generation remains the dominant wall-time cost and `pro_custom` prewarmed latency artifacts are still variable enough to merit follow-up.

## Validation Matrix

### Passed
- `./scripts/check_project_inputs.sh`
- `python3 scripts/harness.py validate`
- `python3 scripts/harness.py diagnose`
- Fresh Debug build: `xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build`
- `contract` lane: `python3 scripts/harness.py test --layer contract`
- `server` lane: `python3 scripts/harness.py test --layer server`
- `rpc` lane: `python3 scripts/harness.py test --layer rpc`
- `pipeline` lane: `python3 scripts/harness.py test --layer pipeline`
- `swift` lane: `python3 scripts/harness.py test --layer swift`
- Live native smoke: `QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -destination 'platform=macOS' -only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test`
- `load` benchmark: `python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18`
- `latency` benchmark: `python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18`
- `clone_regression` benchmark: `python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18`
- `perf` benchmark: `python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18`
- Manual Computer Use pass:
  - light mode: main window chrome, sidebar/search/toolbar, Models, Custom Voice single + batch, Voice Design single + save to Saved Voices, Saved Voices preview + Open in Cloning, Voice Cloning single + batch, History, Preferences
  - dark mode: main chrome, Models, Preferences

### Failed
- `audio` lane: `python3 scripts/harness.py test --layer audio`

### Skipped / Retired
- `quality` benchmark: `python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18` (`retired`)
- `tts_roundtrip` benchmark: `python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18` (`skipped`: no local ASR evaluator)

## Hard Failures
The only automated failure from the maintained source gates was the `audio` lane. The failing check was `inter_chunk_timing_jitter`, which reported CV `1.0675` against a threshold of `0.5`. Optional `audio` dependency skips observed during the run were context only and are not counted as failures.

## Soft UI Defects
- **Clone batch status reporting is inconsistent during processing.** During manual validation, the UI showed `Generating item 2/2... 0 of 2 clips completed · Item 1 active` before eventually finishing successfully. This is user-visible, but it should not be treated as release-blocking unless it reproduces in packaged or release validation.
- **Fresh launch can show a stale sidebar engine error card before new interaction.** On fresh launches, the sidebar showed an empty-custom-prompt style engine error before any new action. This is user-visible, but it should not be treated as a confirmed release regression unless it reproduces in packaged or release validation.

## Performance Findings
- `clone_regression.json` classifies the slowdown source as `overlay_refactor`, not the `mlx-audio` runtime move. The recorded reason is: “The current helper is materially slower than the old helper on the same current mlx-audio runtime, which points to the overlay refactor rather than the runtime move.” Total time was `30.988s` for `current_runtime_current_helper` versus `26.3212s` for `current_runtime_old_helper`.
- `perf_results.json` shows generation-path work dominating wall time. Top bottlenecks were `total_backend` (`33296.8ms`, `100.0%` of wall), `generation` (`33252.1ms`, `99.8%`), and `collect_generation` (`33251.8ms`, `99.8%`). Mean RPC overhead was only `5.5ms`, which reinforces that server-side generation dominates.
- `pro_custom_prewarmed_samples.json` remains noisy enough to merit follow-up even though the latency lane passed. Recorded wall time for the prewarmed scenario had mean `45290.91ms`, min `23843.57ms`, max `85212.12ms`, and CV `0.6238`.

## Evidence And Artifacts
- Built dev app: `/tmp/qwenvoice-dev-derived/Build/Products/Debug/QwenVoice.app`
- Benchmark root: `/tmp/qwenvoice-dev-bench-2026-04-18`
- Key artifacts:
  - `/tmp/qwenvoice-dev-bench-2026-04-18/perf_results.json`
  - `/tmp/qwenvoice-dev-bench-2026-04-18/clone_regression.json`
  - `/tmp/qwenvoice-dev-bench-2026-04-18/tts_roundtrip_summary.json`

```bash
./scripts/check_project_inputs.sh
python3 scripts/harness.py validate
python3 scripts/harness.py diagnose
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice -configuration Debug -derivedDataPath /tmp/qwenvoice-dev-derived build

python3 scripts/harness.py test --layer contract
python3 scripts/harness.py test --layer server
python3 scripts/harness.py test --layer rpc
python3 scripts/harness.py test --layer pipeline
python3 scripts/harness.py test --layer audio
python3 scripts/harness.py test --layer swift

QWENVOICE_ENABLE_NATIVE_ENGINE_LIVE_TESTS=1 \
xcodebuild -project QwenVoice.xcodeproj -scheme QwenVoice \
-destination 'platform=macOS' \
-only-testing:QwenVoiceTests/NativeMLXMacEngineLiveTests test

python3 scripts/harness.py bench --category load --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category latency --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category quality --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category tts_roundtrip --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
python3 scripts/harness.py bench --category perf --tier all --runs 3 --output-dir /tmp/qwenvoice-dev-bench-2026-04-18
```

## Recommended Next Fixes

1. **Fix audio inter-chunk jitter regression in the maintained audio lane**  
Problem: the maintained `audio` lane currently fails on `inter_chunk_timing_jitter`, which makes the automated source gate non-green.  
Evidence: the April 18 run recorded CV `1.0675` against a threshold of `0.5`, and this was the only maintained source-lane failure.  
Acceptance: `python3 scripts/harness.py test --layer audio` passes cleanly, with `inter_chunk_timing_jitter` back under threshold and optional dependency skips still treated as non-failing context.

2. **Correct clone batch progress/status text during active generation**  
Problem: the clone batch UI can report contradictory progress while work is still in flight.  
Evidence: manual validation showed `Generating item 2/2... 0 of 2 clips completed · Item 1 active` before eventual success.  
Acceptance: during clone batch generation, the visible active item, completed count, and overall item index stay internally consistent from first clip through completion.

3. **Suppress stale startup error state in the sidebar on fresh launch**  
Problem: the app can surface an old engine error card before the user performs any new interaction.  
Evidence: fresh launches in manual validation showed a stale empty-custom-prompt style sidebar error before any new action.  
Acceptance: a fresh launch only shows current startup or runtime state, and the sidebar does not surface stale generation errors unless a new action actually produces one.

4. **Investigate generation-path latency variance and backend-dominant perf cost**  
Problem: performance remains dominated by backend generation, and `pro_custom` prewarmed latency artifacts are still noisy enough to obscure regression interpretation.  
Evidence: `perf_results.json` shows `total_backend`, `generation`, and `collect_generation` at effectively all wall time, while `clone_regression.json` points to the helper overlay/refactor and `pro_custom_prewarmed_samples.json` shows wall-time CV `0.6238`.  
Acceptance: isolate whether the remaining variance is prompt-, warm-state-, helper-, or scheduling-driven, then show improvement or at least a stable reproducible explanation in the smallest relevant benchmark category.

## Recommended Rerun Order
1. `python3 scripts/harness.py validate`
2. `python3 scripts/harness.py test --layer audio`
3. The smallest relevant benchmark category for the fix under test:
   - `latency` for `pro_custom` prewarm variability
   - `clone_regression` for helper-overlay work
   - another directly touched category if the fix is narrower
4. A scoped Computer Use pass for the touched screen or flow

This report captures the completed April 18, 2026 validation pass and does not imply any repo changes were made as part of the report itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev validation report: April 18, 2026 #22

QwenVoice Dev Validation Report — April 18, 2026

Executive Summary

Validation Matrix

Passed

Failed

Skipped / Retired

Hard Failures

Soft UI Defects

Performance Findings

Evidence And Artifacts

Recommended Next Fixes

Recommended Rerun Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dev validation report: April 18, 2026 #22

Description

QwenVoice Dev Validation Report — April 18, 2026

Executive Summary

Validation Matrix

Passed

Failed

Skipped / Retired

Hard Failures

Soft UI Defects

Performance Findings

Evidence And Artifacts

Recommended Next Fixes

Recommended Rerun Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions