Skip to content

P0: Fix flaky SQLite WAL livenessTest — replace Date() comparison with monotonic clock, remove CI skip #95

@totalslacker

Description

@totalslacker

Summary

SQLiteStorageConcurrencyTests.livenessTest() was disabled on CI runners via @Test(.disabled(if: ProcessInfo.processInfo.environment["CI"] != nil)) because its writeEnd < readEnd assertion uses Date() timestamps that lack the resolution to distinguish ticks on fast release-mode CI runners. This issue removes the skip and replaces the timing primitive with a monotonic, sub-microsecond clock so the assertion is both correct and reliable on CI. Per the project engineering policy (.claude/CLAUDE.md), the CI skip is itself a policy violation. This is P0 — it blocks all merges until resolved.

Requirements

  • Replace Date() timestamp captures in livenessTest() with a monotonic, sub-microsecond clock. Preferred: ContinuousClock / SuspendingClock instants. Acceptable: clock_gettime_nsec_np(CLOCK_UPTIME_RAW) / mach_absolute_time for nanosecond resolution.
  • Compare instants using the chosen clock's native Instant type (e.g. < on ContinuousClock.Instant) — not bridged through TimeInterval or Double.
  • Remove the .disabled(if: CI) annotation. The test must run unconditionally, including on CI.
  • livenessTest() must pass ≥20 consecutive runs locally in both debug and release configurations before the fix is considered complete.
  • If after the clock fix the assertion is still flaky, do not skip again — root-cause the real ordering race in the WAL writer/reader split.
  • No new failing tests, anywhere, for any reason. This includes:
    • No new tests that fail locally (debug or release) on a clean checkout.
    • No new intermittently-failing ("flaky") tests — flaky counts as failing under project policy.
    • No new .disabled(if: ...), XCTSkipIf(...), XCTSkipUnless(...), #if !CI, @Test(.disabled(...)), or equivalent annotations on tests that previously ran.
    • No deleting assertions, weakening them to tautologies, or making any test a no-op.
    • No raising a threshold or budget without measurement evidence (median of ≥11 runs on a quiet machine) recorded in a comment next to the new value.
  • The full test suite after the fix must be at least as healthy as main was before this PR, with exactly one fewer skipped/failing test (the one this issue fixes).
  • If making livenessTest() pass requires introducing fragility elsewhere, stop and escalate to a human comment instead of merging.

Scope

In scope:

  • Tests/SwitchcraftTests/SQLiteStorageConcurrencyTests.swift — clock fix and CI annotation removal in livenessTest().

Out of scope:

  • Changes to performanceAssertionTest() (separate issue, different root cause).
  • Changes to safariUnfuckerRegressionTest() (currently passing, leave alone).
  • Changes to the SQLite storage actor split itself.

Acceptance Criteria

  • livenessTest() no longer uses Date() for the ordering assertion.
  • .disabled(if: CI) annotation removed from livenessTest().
  • Test passes ≥20/20 consecutive runs locally in both debug and release.
  • CI green on the fix branch (both swift test and swift test -c release jobs).
  • No new test skips, XCTSkip, #if !CI, or equivalent introduced anywhere.
  • No other test newly fails or newly becomes flaky as a result of this PR.
  • Validate stage verifies ≥10 consecutive passing runs and confirms no other test regressed.

Prior Art / Context

  • Root cause: Tests/SwitchcraftTests/SQLiteStorageConcurrencyTests.swift:87–116Date() timestamps are microsecond-class; under release-mode CI the seeded FTS scan completes fast enough that both timestamps can land in the same observed tick, causing the strict writeEnd < readEnd comparison to tie.
  • The underlying assertion (writer must finish before slow reader on the WAL split) is correct and must be preserved.
  • CI skip was introduced in commit f43e29e4 (2026-05-06) as a workaround.
  • See also: issue T5CoreMLEmbedder CPU fallback path never recovers — 1,038/1,038 fail in production, dead weight #93 thread for prior discussion of why .disabled(if: CI) is not an acceptable resolution on this project.

Risks / Dependencies

  • If after the clock fix the assertion is still non-deterministic, that indicates a real ordering race in the WAL writer/reader split — a deeper bug requiring separate root-cause analysis. The correct response is escalation, not another skip.
  • Release-mode inlining/optimization may affect observed ordering of timestamp captures; the chosen monotonic clock must be used with appropriate memory ordering to avoid compiler reordering.

Engineering Policy Reminder

Per .claude/CLAUDE.md: failing tests never merge. "Flaky" is a diagnosis, not a resolution. Moving a bug to a different test is not a fix. This issue exists because the prior fix (skip on CI) was the wrong move — do not repeat it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions