Skip to content

perf: investigate prewarmed latency variance and backend-dominant generation cost #26

@PowerBeef

Description

@PowerBeef

Performance follow-up should isolate the remaining latency variance and backend-dominant cost without blaming the wrong subsystem.

Source report: #22

Problem

Prewarmed latency remains noisy enough to obscure regression interpretation, while benchmark evidence still shows backend generation dominating wall time.

Evidence

  • April 18, 2026 validation report: Dev validation report: April 18, 2026 #22
  • clone_regression.json classified the slowdown source as overlay_refactor
  • current_runtime_current_helper_total_s = 30.988
  • current_runtime_old_helper_total_s = 26.3212
  • perf_results.json identified total_backend, generation, and collect_generation as the top bottlenecks
  • pro_custom_prewarmed_samples.json recorded wall-time CV 0.6238

Current ownership

  • scripts/harness_lib/bench_runner.py
  • Sources/QwenVoiceNativeRuntime/NativeMLXMacEngine.swift
  • Sources/QwenVoiceNative/XPCNativeEngineClient.swift

Acceptance

The team can either show a stable reproduced explanation for the remaining variance or a measurable improvement in the smallest relevant benchmark category touched by the fix.

Focused rerun

  • python3 scripts/harness.py validate
  • python3 scripts/harness.py bench --category latency --runs 3 --output-dir <dir>
  • python3 scripts/harness.py bench --category clone_regression --runs 3 --output-dir <dir> only if helper-overlay work is touched

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions