This is the next layer below the cross-layer wall-clock probe (commits
958567d, b46b2d7, 37c21c3). When infer_ms per-chunk numbers
have been fully attributed to engine sub-stages but you still need to
know what the GPU is actually doing inside each stage, this is the
tool.
| Question | Right tool |
|---|---|
| "How long does the user wait between Cmd+Return and audio playback?" | scripts/bench_ui_generation.sh (desktop-UI bench, full pipeline) |
| "How does that wall break down across engine / XPC / UI layers?" | `[Probe.Engine |
"How does per-chunk infer_ms break down across talker / code predictor / decoder / eval cadence?" |
Phase 1 + Phase 2a fields in the same probe |
"Inside talker_forward_ms, which Metal kernels run, and is the GPU saturated or idle-waiting?" |
Instruments / xctrace (this doc) |
| "Where does CPU time actually go inside the per-token loop?" | Time Profiler, captured in the same trace |
"Is eval(audioChunk) waiting on a GPU command buffer, or is it actual compute?" |
Metal System Trace, in the same trace |
The probe stops at wall-clock attribution. Instruments crosses into GPU-kernel-level visibility.
Five os_signpost intervals fire on every per-token iteration of the
streaming generation, all under subsystem com.qwenvoice.engine.qwen3,
category generation:
| Signpost | What it brackets | Source |
|---|---|---|
Talker Forward |
talker(inputEmbeds, cache: cache) — the LLM forward pass per token |
Qwen3TTS.swift per-token loop |
Code Predictor Loop |
The 7-iteration multi-codebook prediction loop | same |
Step Eval Flush |
The eval(inputEmbeds, isEOS) synchronous flush at policy .full |
same |
Audio Decoder |
speechTokenizer.decoder.streamingStep(...) |
same |
Audio Chunk Eval |
The eval(audioChunk) after each streaming decoder run |
same |
Coarser intervals already exist in NativeStreamingSynthesisSession.swift
under subsystem com.qwenvoice.engine (category generation):
Native Generation Stream, Native First Audio Chunk,
Native Final WAV Finish. Both subsystems show up together when the
"os_signpost" instrument is added in Instruments.
./scripts/bench_instruments_trace.shDefault: 30-second System Trace (CPU + Metal + signposts), output to
build/instruments-traces/vocello-YYYYMMDD-HHMMSS.trace. Override:
./scripts/bench_instruments_trace.sh --seconds 60 \
--output /tmp/long-vc-trace.traceWorkflow:
- The script kills any running Vocello, resets defaults to land on Custom Voice with Smooth OFF, and relaunches a fresh debug build.
- After the engine is Ready, the script starts
xctrace recordwith theSystem Tracetemplate and your chosen time window. - While recording, the operator triggers ONE generation in the Vocello UI: paste a script, hit Cmd+Return.
- When the trace stops, the script
opens the.tracebundle in Instruments.
The trace captures the entire system, so the bundled
QwenVoiceEngineService.xpc helper (where the engine actually runs)
is captured alongside the Vocello main process.
In Instruments:
- Add the "os_signpost" instrument if not already present (View → Instruments Library, drag the os_signpost instrument onto the trace).
- Filter by subsystem:
com.qwenvoice.engine.qwen3for the per-token markers, orcom.qwenvoice.enginefor the coarser generation-lifecycle markers. - Stack tracks vertically so the os_signpost lane sits between the Time Profiler and the Metal GPU lane. This makes alignment instant.
What to look for:
Talker Forwardwith GPU idle — kernel launch overhead, sampling, embed lookup happening on CPU. Could potentially overlap with the next prep step.Code Predictor Loopwith 7 sequential GPU bursts — confirms the multi-codebook prediction is unbatched. A single multi-codebook dispatch could collapse those 7 bursts into one.Audio Chunk Evalmuch longer thanAudio Decoder— the decoder kernel runs fast buteval(audioChunk)waits on it + flushes. Could pipeline with the next forward pass.Step Eval Flushwith substantial GPU activity — confirms the eval flush IS doing real work (not just synchronizing). The Phase 2a finding about.deferrednot being a wall-clock win comes from here: the GPU work is happening regardless of which sync point triggers it.- GPU idle gaps between engine stages — these are pipeline opportunities. Anywhere the GPU sits idle while CPU works on housekeeping (sampling, list construction, MLX expression building) is amortizable.
Phase 2a delivered wall-clock attribution. Phase 2b used Instruments
traces to identify kernel-level slack. The first clean trace
(/tmp/vocello-phase2b.trace, May 2026, Custom Voice medium cold,
212 token iterations / 27 audio chunks / ~17 s audio) decomposed
per-iteration work as:
| Stage | p50 / iter | total | % of work |
|---|---|---|---|
| Step Eval Flush | 80 ms | 18.2 s | 62 % |
| Code Predictor Loop | 26 ms | 5.6 s | 19 % |
| Audio Chunk Eval | 135 ms / chunk | 3.7 s | 13 % |
| Talker Forward | 6 ms | 1.4 s | 5 % |
| Audio Decoder | 6 ms / chunk | 0.2 s | <1 % |
Step Eval Flush at 80 ms p50 was confirmed irreducible by the
.deferred policy experiment (commit 730e569) — moving the sync
point doesn't reduce the underlying forward-pass GPU work.
Code Predictor Loop at 26 ms p50 was the next clear wall-clock target, but architectural analysis showed parallel codebook prediction is not viable in this Swift port:
- Each codebook's input is built from
codeTokens.last!— feeding back the previously-sampled codebook token. The autoregressive dependency is real, not optional. lmHead[generationStep]is a per-stepLinear(1024, 2048)head (15 of them, one per codebook). They cannot share a hidden state output.- The codebook KV cache (
codeCache) builds sequentially viacache.update(keys: k, values: v). State accumulates across inner-loop iterations. num_code_groups = 16(15 inner-loop iterations, not 7 as the Phase 2c planning doc had assumed).- The upstream Python reference (
Blaizzy/mlx-audio) doesn't expose a parallel-codebook config flag — it treats Qwen3 as a flat token stream viamlx_lm.stream_generate, so there's no published reference to port.
With the primary 12 % RTF target unreachable, Phase 2c shipped the
plan's next-priority alternative: Audio Chunk Eval pipelining via
asyncEval at the in-loop chunk boundary only. The change replaces
blocking eval(audioChunk) with non-blocking asyncEval(audioChunk)
inside the per-token loop's chunk emission so the engine returns to
the loop without CPU-blocking on Metal command-buffer drain. The
consumer's samples.asArray(Float.self) triggers materialisation off
the engine's critical path. The Audio Chunk Eval signpost interval
collapses from ~135 ms / chunk to near-zero for in-loop chunks
(asyncEval enqueue cost only) — that collapse IS the success signal.
The trailing chunk stays on blocking eval. The first cut of
Phase 2c asyncEval'd that one too; live playback then truncated
mid-script because the engine returned with the final chunk still
lazy, the awaited generation result raced ahead of the broker's
MainActor chunk-publication tasks, and AudioPlayerViewModel's
liveFinalFilePath got set before the last few chunks had been
scheduled in the AVAudioEngine queue. As soon as the next buffer
drained, handleLiveBufferPlaybackCompletion's liveScheduledCount == 0 && liveFinalFilePath != nil branch fired an early file-playback
handoff. Blocking on the trailing chunk closes that race. See
mlx-audio-swift-patching.md for the
patch baseline.
If the trace shows the GPU is fully saturated during the per-token loop (no idle gaps), the conclusion is that the M1's GPU is the bottleneck and further wall-clock optimization requires either model quantization (already at 4-bit Speed) or hardware (M2/M3/M4 with more GPU cores).
Phase 2c's pipelining still leaves Step Eval Flush (62 % of work) as the dominant per-iteration cost. The work itself is the per-token forward sync; reducing it requires either a faster talker forward (quantization, kernel-fusion of the SwiGLU + attention path, or talker-layer batching across tokens) or a fundamentally different generation strategy (e.g. speculative decoding). All of these are research-grade, not local optimisations.
Qwen3TTS.swift gains an import os and a private Qwen3Signposts
namespace. The signpost calls live in the per-token loop inside
generateVoiceDesign(...) (the inner generation function). This is a
local patch on top of the vendored MLXAudioTTS — preserve on rebase.
See mlx-audio-swift-patching.md for
the patch baseline.