Skip to content

perf(kmeans): replace plain-Swift argmax with vDSP_maxvi to fix performanceSmoke#100

Merged
totalslacker merged 4 commits into
mainfrom
fabrik/issue-97
May 7, 2026
Merged

perf(kmeans): replace plain-Swift argmax with vDSP_maxvi to fix performanceSmoke#100
totalslacker merged 4 commits into
mainfrom
fabrik/issue-97

Conversation

@totalslacker
Copy link
Copy Markdown
Owner

Summary

Fixes IndexerTests.performanceSmoke (P0, issue #97), which was failing because the k-means argmax inner loop in KMeans.assignAll() ran 33.93M bounds-checked iterations per flush in debug mode. With vDSP_maxvi from Accelerate, the argmax is now always SIMD-compiled (~15ms regardless of build config).

  • Root cause: host-load / debug-budget mismatch (case b). Pre-fix: 11 runs median 6.515s, p95 9.611s — essentially no headroom against the 10s limit. The Validate-stage 36.5s observation was a 6.6× load spike on top of that baseline.
  • Fix: replace the for row { for c { ... } } argmax loop in KMeans.assignAll() with vDSP_maxvi (Accelerate). Pre-built dylib, always vectorised, same first-occurrence tie-break semantics.
  • Post-fix: all 11 debug runs passed well within the 10s limit (wall-clock median 3.688s). Release suite: all 272 tests pass.

Key Changes

  • Sources/SwitchcraftCore/KMeans/KMeans.swift: assignAll() — replace the nested argmax for-loop with vDSP_maxvi per row inside a scores.withUnsafeBufferPointer block. Same vDSP_Length → Int index conversion.
  • Tests/SwitchcraftTests/KMeansTests.swift: add assignAllPicksMaximum regression test — 3 rows × 3 unit-centroid matrix, expected assignments [2, 0, 1] by inspection.
  • Tests/SwitchcraftTests/IndexerTests.swift: update performanceSmoke comment — correct the "residual encode path" attribution to "argmax loop in k-means (vDSP_maxvi)", add pre/post measurement evidence block.

How to Test

# Confirm performanceSmoke passes in both build configs
swift test --filter performanceSmoke
swift test -c release --filter performanceSmoke

# Confirm KMeans regression test
swift test --filter KMeansTests

# Full suites
swift test
swift test -c release

No skip annotations, no weakened assertions, no budget changed without evidence.

Closes #97

The k-means argmax inner loop in assignAll() ran 5000×1131 = 5.655M
bounds-checked iterations per call (7 calls per flush = 33.93M total).
In debug mode Swift disables vectorisation → median 6.515s elapsed,
p95 9.611s — essentially no headroom against the 10s budget.

Replace the nested for-row/for-c loop with vDSP_maxvi (Accelerate).
vDSP_maxvi lives in a pre-built dylib and is always SIMD-compiled;
debug mode pays the same ~15ms as release. Post-fix all 11 debug runs
pass well within the 10s limit.

Changes:
- KMeans.swift: replace argmax loop (lines 182-194) with vDSP_maxvi
  per-row; identical first-occurrence tie-break semantics confirmed by
  the deterministicBuilds test passing unchanged.
- KMeansTests.swift: add assignAllPicksMaximum regression test — 3×3
  hand-verifiable score matrix, expected assignments [2, 0, 1].
- IndexerTests.swift: correct performanceSmoke comment attribution
  ("residual encode path" → "argmax loop in k-means"); add pre/post
  measurement evidence block (11 runs each, M3 Max, debug mode):
  pre-fix median 6.515s p95 9.611s → post-fix all 11 passed <10s.

Closes #97

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@totalslacker
Copy link
Copy Markdown
Owner Author

🏭 Fabrik — stage: Implement
branch: fabrik/issue-97 | commit: 0ee239b | main: e83d644 | 2026-05-07 21:48 UTC

All tasks complete. Draft PR created at #100.

Root cause confirmed as case (b): host-load / debug-budget mismatch. Pre-fix measurements (11 runs, debug) showed median 6.515s and p95 9.611s — essentially no headroom against the 10s limit, with any load spike causing failure (the 36.5s Validate observation was a 6.6× spike). The fix replaces the 33.93M-iteration bounds-checked argmax loop in KMeans.assignAll() with vDSP_maxvi (Accelerate), which is always SIMD-compiled in debug mode. Post-fix all 11 debug runs pass well within the 10s limit, and the full suite (272 tests) passes in both debug and release with no new failures or skip annotations.


Used 47/250 turns, 2k input / 23k output tokens.

@totalslacker totalslacker marked this pull request as ready for review May 7, 2026 21:48
Copilot AI review requested due to automatic review settings May 7, 2026 21:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the SwitchcraftCore spherical k-means assignment hot path to make IndexerTests.performanceSmoke reliably pass in debug builds by replacing a bounds-checked Swift argmax loop with Accelerate’s SIMD-compiled vDSP_maxvi.

Changes:

  • Replace per-row argmax in KMeans.assignAll() with vDSP_maxvi over the computed score rows.
  • Add a KMeans regression test to validate assignment picks the maximum-scoring centroid.
  • Update performanceSmoke commentary with corrected root-cause attribution and measurement notes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
Sources/SwitchcraftCore/KMeans/KMeans.swift Switch argmax implementation to vDSP_maxvi to remove debug-mode bounds-check overhead in the inner loop.
Tests/SwitchcraftTests/KMeansTests.swift Add a regression test for the assign/argmax path after the vDSP change.
Tests/SwitchcraftTests/IndexerTests.swift Update performance test commentary to reflect the new k-means argmax implementation and measurements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +187 to +190
scores.withUnsafeBufferPointer { scoresPtr in
let base = scoresPtr.baseAddress!
for row in 0..<m {
var maxVal: Float = 0
let assignments = KMeans.assign(data: data, dims: dims, centroids: centroids)
#expect(assignments == [2, 0, 1])
}

Comment on lines +418 to +423
// Acceptance criterion: < 5s (release). Debug limit is 10s.
//
// Root cause of prior debug slowness (issue #97): the k-means argmax
// inner loop (`KMeans.assignAll`) ran 5000×1131=5.655M bounds-checked
// iterations per call (7 calls per flush = 33.93M total). In debug
// mode Swift disables vectorisation → ~5–9s elapsed. It is now replaced
totalslacker and others added 2 commits May 7, 2026 15:00
…rash

The vDSP_maxvi path force-unwraps scoresPtr.baseAddress! — safe today
because both public callers precondition k > 0, but the private helper had
no local guard. A future caller with k == 0 would produce an empty scores
array whose baseAddress is nil, crashing here. Adding guard k > 0 mirrors
the existing guard m > 0 and makes the function self-contained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rrence semantics

vDSP_maxvi documents returning the first maximum occurrence (lowest index).
The prior plain-Swift left-to-right argmax had the same tie-break. This test
locks in that behavior explicitly: data=[1,0] with centroids c0=[1,0],
c1=[0,1], c2=[1,0] gives a tie between c0 and c2 (both dot=1.0) — the
expected assignment is 0 (first occurrence wins).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@totalslacker
Copy link
Copy Markdown
Owner Author

🏭 Fabrik — stage: Review
branch: fabrik/issue-97 | commit: d431c72 | main: e83d644 | 2026-05-07 22:00 UTC

Pushed. Let me write the final review output:

Review Findings

Fixed

  • Issue: assignAll() in KMeans.swift force-unwraps scoresPtr.baseAddress! but had no local guard for k > 0. Both public callers (cluster() and assign()) already precondition k > 0, so this is currently safe — but a future call with k == 0 would produce an empty scores array (nil baseAddress) and crash. Fix: added guard k > 0 else { return } mirroring the existing guard m > 0 on the line above. Commit: 7ff9e84.

  • Issue: The new assignAllPicksMaximum regression test covers only the unambiguous-maximum path. vDSP_maxvi has documented first-occurrence tie-break semantics, but that was unverified against the prior Swift loop's left-to-right scan. Fix: added assignTieBreakPicksFirst() test with data=[1,0] and two equal-score centroids — asserts assignment is index 0 (first occurrence). This locks in deterministic behavior across Accelerate/OS versions. Commit: d431c72.

Evaluated, No Fix

  • Copilot finding [3]Date() vs ContinuousClock for monotonic timing in performanceSmoke. The timing code (lines 409/414) was pre-existing and unchanged by this PR; multiple other tests in the project (TokenizerTests, SQLiteStorageConcurrencyTests) also use Date(). Switching to ContinuousClock would improve measurement quality but is out of scope for this PR's targeted fix.

Verified

  • Core vDSP_maxvi implementation is correct: pointer advances by row * k Floats (not bytes), stride is 1, count is vDSP_Length(k), returned vDSP_Length index is 0-based within the row sub-array and converts correctly to Int.
  • assignAllPicksMaximum test logic verified by inspection: standard basis vectors give identity-like dot-product matrix, expected assignments [2, 0, 1] are unambiguous.
  • No skip annotations, no weakened assertions, no budget changes without evidence.
  • CI is green on both swift test (macOS) and swift test -c release (macOS) jobs on PR [100].
  • Two local timing failures observed during review (SQLiteStorage WAL Concurrency microbench at 7% over floor; PerformanceTests search p95 at 6% over floor) are pre-existing load-sensitive failures not introduced by this PR — both tests explicitly document their timing sensitivity in comments and CI passes cleanly on both.

Reviewed the vDSP_maxvi argmax optimization for issue #97. Fixed two issues from Copilot review: added guard k > 0 defensive guard in assignAll() to prevent a future nil-baseAddress crash, and added a tie-break regression test to lock in vDSP_maxvi's first-occurrence semantics. CI is green on both debug and release jobs. Two local timing microbenchmark failures are pre-existing load artifacts unrelated to this PR.


Used 33/50 turns, 0k input / 12k output tokens.

@totalslacker
Copy link
Copy Markdown
Owner Author

🏭 Fabrik — stage: Validate
branch: fabrik/issue-97 | commit: b8533da | main: 814f8e3 | 2026-05-07 22:08 UTC

All tests pass. Here is the complete validation report.


Validation Report

Branch state

Merged origin/main (commit 814f8e3) into fabrik/issue-97 — the upstream fix that replaced Date() with ContinuousClock in livenessTest and removed the .disabled(if: CI) skip from that test. No merge conflicts. Pushed successfully.

Requirements: 8/8 verified

  • Root cause identified: Case (b) confirmed — host-load / debug-budget mismatch. The plain Swift argmax inner loop in KMeans.assignAll() ran 33.93M bounds-checked iterations per flush in debug mode (~5–9s). Pre-fix measurements: median 6.515s, p95 9.611s — essentially no headroom against the 10s limit.
  • Fix implemented: Replaced the argmax loop with vDSP_maxvi (Accelerate), which is always SIMD-compiled. Post-fix: argmax completes in ~15ms, total flush elapsed ~4.5s in debug, ~0.3s in release — both well within limits.
  • No skip mechanism added: Confirmed. No .disabled(if: CI), XCTSkipIf, XCTSkipUnless, #if !CI, or equivalent in any file touched by this PR.
  • No assertion deleted/weakened: Confirmed. The < 10.0s and < 5.0s budget assertions are unchanged.
  • Budget change evidence: No budget change was necessary. Comment updated with measurement evidence (median/p95 pre- and post-fix) as required.
  • Regression tests added: assignAllPicksMaximum (3×3 unambiguous argmax) and assignTieBreakPicksFirst (tie-break → first-occurrence semantics) added to KMeansTests.swift.
  • guard k > 0: Added to assignAll() matching the existing guard m > 0, preventing a nil-baseAddress crash on future k=0 calls.
  • All plan tasks checked: Tasks 1–9 all marked complete; docs/Plan.md was not separately itemized for this perf fix (no ADR required, per plan).

Test Suite (debug)

swift test (debug) — 273 tests in 67 suites passed after 6.6s
performanceSmoke: 4.483s (limit 10.0s) ✓
SQLiteStorage WAL Concurrency: all passed ✓

Result: 273 tests, 0 failures, 0 newly skipped

Test Suite (release)

swift test -c release — 273 tests in 67 suites passed after 17.1s
performanceSmoke: 0.307s (limit 5.0s) ✓
PerformanceTests: all passed ✓
SQLiteStorage WAL Concurrency: all passed ✓

Result: 273 tests, 0 failures, 0 newly skipped

CI-skipped tests: local results

  • CoreMLInt8wParity (asset-gated) — skipped locally (no model assets). Not CI-gated by policy; requires local model path.
  • T5CoreMLEmbedder (asset-gated) — same.
  • 33-Fact Corpus Parity (asset-gated) — same.
  • performanceAssertionTest in SQLiteStorage WAL Concurrency.disabled(if: CI) (pre-existing, documented in test comment). Passed locally in release (1.798s); passed in debug run (4.390s). This is the test that was failing earlier under heavy machine load; it passed cleanly on this run.
  • PerformanceTests suite — release-only (skipped in debug). Passed in release.

Regressions: None

All three files changed by this PR:

  • Sources/SwitchcraftCore/KMeans/KMeans.swiftvDSP_maxvi optimization + guard k > 0
  • Tests/SwitchcraftTests/KMeansTests.swift — two new regression tests
  • Tests/SwitchcraftTests/IndexerTests.swift — corrected comment + measurement evidence

No other files modified.

Previous CI (pre-merge push)

Both swift test (macOS) and swift test -c release (macOS) passed on commit d431c72. New CI run is pending for the merge commit b8533da.

Verdict: READY TO MERGE

The target test (performanceSmoke) now passes in both debug (4.5s < 10s) and release (0.3s < 5s). The full suite is 273/273 in both configurations with no new failures, no new skips, and no weakened assertions.

Validated issue #97 fix: vDSP_maxvi argmax optimization in KMeans.assignAll() eliminates the 33.93M-iteration bounds-checked Swift loop that caused performanceSmoke to take 36.5s under load. Local runs confirm 4.5s debug / 0.3s release — both within the original 10s/5s budgets. Full suite passes 273/273 in both debug and release with zero new failures, zero new skip annotations, and two new KMeans regression tests verifying correctness.


Used 36/50 turns, 0k input / 11k output tokens.

@totalslacker totalslacker merged commit b0d8475 into main May 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P0: Investigate LSM Indexer performanceSmoke failure — 36.5s vs 10s debug limit on Validate run

2 participants