Skip to content

T5CoreMLEmbedder CPU fallback path never recovers — 1,038/1,038 fail in production, dead weight #93

@totalslacker

Description

@totalslacker

Problem

The CPU fallback path added by #87 (and improved with diagnostics in #90) does not recover any inferences in practice. Every attempt to retry an ANE-failed inference on .cpuOnly also fails, and the embedder rethrows the original IOSurface error.

What's actually keeping bulk-index alive in production is the proactive model reload every reloadInterval calls (also from #87). The CPU fallback is dead weight: it costs an extra (failed) MLModel load + predict per failure, produces no successful results, and contributes nothing to the recovery story.

Production Data

A SafariUnfucker bulk-index run (2026-05-07) processed 5,570 successful inferences and 1,038 failures:

1038  cpu_fallback_failed  (every failure)
1038  Failed to allocate E5 buffer object. E5RT: Failed to allocate memory IOSurface object. (3)  (every failure's reason)
0     recovered_iosurface_exhaustion  (none)

category: "cpu_fallback_failed" fires when: ANE threw IOSurface, CPU fallback was attempted, and CPU also threw. Across 1,038 opportunities, CPU fallback recovered zero inferences.

The ANE pool DID recover during the run — three failure bursts each followed by recovery periods of 5–25 minutes. Recovery timing is consistent with the proactive reload (Layer 2) firing at a 500-call boundary, not with CPU fallback.

Summary

Remove the CPU fallback path (Layer 3b) from T5CoreMLEmbedder entirely. The path has a measured 0% recovery rate and adds per-failure overhead (one extra MLModel load + predict call) without benefit. The proactive reload (Layer 2) is the only mitigation that works; it should be the sole recovery mechanism. This simplifies predictWindow, eliminates cpuPredictorFactory from the actor's internal state, and removes tests for a dead code path.

Note on option (b) — log CPU-side error: The issue originally proposed logging the CPU-side error as a diagnostic step before removal. Based on source inspection, logCPUFallbackFailed (line 613 of T5CoreMLEmbedder.swift) already captures cpuErrorName, cpuErrorReason, and cpuCallStack in the JSONL row per ADR 021's addendum. Option (b) appears already complete. The Research stage should verify this against actual production JSONL rows; if the cpu error fields are absent, option (b) must be implemented before option (a) is shipped.

Requirements

  • cpuPredictorFactory property is removed from T5CoreMLEmbedder (both the stored let and all assignments in all init paths).
  • The Layer 3b CPU fallback branch in predictWindow (lines ~539–557) is removed. The reactive reload + ANE retry (Layer 3a, lines ~521–537) is retained.
  • When an IOSurface exception is caught and the ANE retry also fails, the embedder rethrows the original error with the existing category: "error" JSONL row (the cpuPredictorFactory == nil path). The cpu_fallback_failed category is no longer produced.
  • ADR 021 is amended to reflect Layer 3b's removal and the production evidence that motivated it.
  • Tests for CPU fallback success (testIOSurfaceFallbackLogsRecovery) and CPU fallback failure (testCPUFactoryThrowsLogsDistinctError, testCPUPredictThrowsLogsDistinctError) are removed.
  • All remaining tests continue to pass: proactive reload, autoreleasepool, IOSurface error logging, ANE retry.
  • No change to the public init signatures (cpuPredictorFactory was never a public parameter).

Scope

In scope:

  • Remove cpuPredictorFactory property and all wiring in production init paths.
  • Remove cpuPredictorFactory parameter from the test-only internal init.
  • Remove the Layer 3b branch from predictWindow.
  • Remove the logCPUFallbackFailed private method and extractCPUErrorFields helper if they are unused after removal (verify — extractCPUErrorFields may have no other callers).
  • Remove cpu_fallback_failed JSONL category (no longer produced).
  • Update ADR 021 with a second addendum documenting this removal and the production evidence.
  • Delete the three stress-test cases that cover CPU fallback behavior.

Out of scope:

  • Changing reloadInterval default or adding adaptive byte-pressure reload threshold (tracked separately per ADR 021 addendum).
  • Changing any other public API.
  • Changes to MLPredictor, CoreMLFailureLogEntry, or the broader JSONL schema (only the cpu_fallback_failed category is retired).

Prior Art / Context

Risks / Dependencies

  • Option (b) completeness: If the Research stage finds that cpu_fallback_failed JSONL rows do NOT include cpuErrorName/cpuErrorReason in the actual production files (contradicting the source code), then option (b) should be implemented first so future engineers have a record of why CPU fallback failed before the code is deleted.
  • No public API surface change: cpuPredictorFactory was never exposed in a public init — no callers outside the package are affected.
  • Test-only init signature change: The internal factory-based test init accepts cpuPredictorFactory as a parameter. Removing it changes internal-only API — no downstream impact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions