You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IndexerTests.performanceSmoke — "indexing 5000 single-token 128-dim chunks completes within 5s" — took 36.5s vs the 10s debug-mode limit (3.6× overshoot) on the Validate run for PR #94. The test is at Tests/SwitchcraftTests/IndexerTests.swift:393-429 and is not currently disabled on CI. PR #94's Validate stage dismissed the failure as host load and verdicted "READY TO MERGE" anyway — a policy violation. This issue investigates the root cause and fixes it without relaxing the test through policy-violating means.
This is P0: it (and its sibling issues) blocks all merges until resolved.
Requirements
Root cause of the 36.5s observation must be identified and documented. Exactly one of the following will be true:
(a) Real performance regression. A recent commit slowed the indexer hot path (Q4 dequant, residual encode, bucket flush, etc.) by 3–7×. Fix the regression; do not relax the test.
(b) Host-load / debug-budget mismatch. The debug path is legitimately 10–20× slower than release (as the test comment acknowledges), and the 10s budget leaves no headroom. Raise the debug budget only with measurement evidence (median of 11+ runs on a quiet machine) recorded in the test comment, and only if the new budget still distinguishes a 3–5× perf regression from normal variance.
(c) CI runner regression. macOS-15 runners have changed. Compare CI runtime against a quiet local workstation. If CI is the outlier, move the test to release-only configuration with a tighter budget — never via .disabled(if: CI).
The test must pass ≥10 consecutive runs on CI in both debug and release after the fix.
No skip mechanism may be added: no .disabled(if: CI), XCTSkipIf(...), XCTSkipUnless(...), #if !CI, @Test(.disabled(...)), or equivalent.
No assertion may be deleted, weakened to a tautology, or made a no-op.
Any budget change must be justified by measurement evidence (median of 11+ runs on a quiet machine) recorded in a comment next to the new value.
No new failing tests may be introduced anywhere — not in debug, not in release, not intermittently. "Flaky" counts as failing under project policy.
No previously-passing tests may have skip annotations added as a side effect of this fix.
The full test suite after the fix must be at least as healthy as main was before this PR (i.e., one fewer failing test — the target — and no new failures elsewhere).
Scope
In scope:
Investigate IndexerTests.performanceSmoke at Tests/SwitchcraftTests/IndexerTests.swift:393-429.
Bisect git history (from the last green CI run on main) if a regression is suspected.
Run timing measurements (≥11 runs, quiet machine, both debug and release) to characterize the actual runtime distribution.
Fix the underlying cause — regression fix or evidence-backed budget adjustment.
Verify the full suite is green (no net regression) after the fix.
Out of scope:
Fixing other unrelated test failures discovered during investigation (file separate issues; treat them as P0 incidents per project policy).
Skipping or disabling the performance test on CI for any reason.
Raising the budget without measurement evidence.
Deleting the test (only permissible if the Research stage produces an ADR demonstrating performance assertions of this kind cannot be measured reliably in this project's testing environment).
Prior Art / Context
Tests/SwitchcraftTests/IndexerTests.swift:393-429 — the failing test. Budget: < 5.0s release, < 10.0s debug (doubled per test comment to account for debug-mode loop overhead).
Test uses Indexer(storage: storage, config: .production) with InMemoryStorage, indexing 5000 random rows of 128-dim Q4-quantized embeddings.
Observed: 36.5s on Validate stage local run (2026-05-07).
Sibling WAL skip issues: resolved differently (those were pre-approved skips for known flaky concurrency microbench); this test has no such approval.
Risks / Dependencies
If the root cause is a real regression, git bisect may reveal the offending commit was already merged to main — the fix may need a clean revert or targeted patch.
If the root cause is debug-budget too tight, the evidence requirement (11+ runs, quiet machine) means measurement takes real wall-clock time before any code change can be made.
If CI runner performance has degraded, the fix path (release-only budget or environment-specific budget) must be designed so a 3–5× real regression is still detectable.
Any code change touching the indexer hot path requires regression tests per project policy (storage, index, codec files all have this requirement).
Acceptance Criteria
Root cause identified and documented (real regression / host-load budget / runner change).
If real regression: fix applied and test passes within the original budget.
If budget adjustment: new budget set with measurement evidence (median of 11+ runs) recorded in the test comment, and the budget still distinguishes a 3–5× regression from normal variance.
Test passes ≥10 consecutive runs on CI in both debug and release.
No .disabled(if: CI), XCTSkip, #if !CI, or equivalent skip mechanism added anywhere.
No previously-passing test newly fails or is newly skipped as a side effect.
Full swift test (debug) and swift test -c release suite is at least as healthy as main was before this PR, with exactly one fewer failure (the target test).
CI is green on both jobs (swift test (macOS) and swift test -c release (macOS)) on the PR.
Engineering Policy Reminder
Per .claude/CLAUDE.md: failing tests never merge. The Validate stage on PR #94 violated this by verdicting "READY TO MERGE" with this test red. Do not repeat that pattern on this issue's PR.
A fix that makes the target test pass by introducing fragility elsewhere is not a fix — it is moving the bug. Stop and escalate to a human comment instead of merging in that case.
Summary
IndexerTests.performanceSmoke— "indexing 5000 single-token 128-dim chunks completes within 5s" — took 36.5s vs the 10s debug-mode limit (3.6× overshoot) on the Validate run for PR #94. The test is atTests/SwitchcraftTests/IndexerTests.swift:393-429and is not currently disabled on CI. PR #94's Validate stage dismissed the failure as host load and verdicted "READY TO MERGE" anyway — a policy violation. This issue investigates the root cause and fixes it without relaxing the test through policy-violating means.This is P0: it (and its sibling issues) blocks all merges until resolved.
Requirements
.disabled(if: CI)..disabled(if: CI),XCTSkipIf(...),XCTSkipUnless(...),#if !CI,@Test(.disabled(...)), or equivalent.mainwas before this PR (i.e., one fewer failing test — the target — and no new failures elsewhere).Scope
In scope:
IndexerTests.performanceSmokeatTests/SwitchcraftTests/IndexerTests.swift:393-429.main) if a regression is suspected.Out of scope:
Prior Art / Context
Tests/SwitchcraftTests/IndexerTests.swift:393-429— the failing test. Budget:< 5.0srelease,< 10.0sdebug (doubled per test comment to account for debug-mode loop overhead).Indexer(storage: storage, config: .production)withInMemoryStorage, indexing 5000 random rows of 128-dim Q4-quantized embeddings..claude/CLAUDE.md..disabled(if: CI)and "skip flaky on CI" are not acceptable resolutions on this project.Risks / Dependencies
main— the fix may need a clean revert or targeted patch.Acceptance Criteria
.disabled(if: CI),XCTSkip,#if !CI, or equivalent skip mechanism added anywhere.swift test(debug) andswift test -c releasesuite is at least as healthy asmainwas before this PR, with exactly one fewer failure (the target test).swift test (macOS)andswift test -c release (macOS)) on the PR.Engineering Policy Reminder
Per
.claude/CLAUDE.md: failing tests never merge. The Validate stage on PR #94 violated this by verdicting "READY TO MERGE" with this test red. Do not repeat that pattern on this issue's PR.A fix that makes the target test pass by introducing fragility elsewhere is not a fix — it is moving the bug. Stop and escalate to a human comment instead of merging in that case.