Problem
The CPU fallback path added by #87 (and improved with diagnostics in #90) does not recover any inferences in practice. Every attempt to retry an ANE-failed inference on .cpuOnly also fails, and the embedder rethrows the original IOSurface error.
What's actually keeping bulk-index alive in production is the proactive model reload every reloadInterval calls (also from #87). The CPU fallback is dead weight: it costs an extra (failed) MLModel load + predict per failure, produces no successful results, and contributes nothing to the recovery story.
Production Data
A SafariUnfucker bulk-index run (2026-05-07) processed 5,570 successful inferences and 1,038 failures:
1038 cpu_fallback_failed (every failure)
1038 Failed to allocate E5 buffer object. E5RT: Failed to allocate memory IOSurface object. (3) (every failure's reason)
0 recovered_iosurface_exhaustion (none)
category: "cpu_fallback_failed" fires when: ANE threw IOSurface, CPU fallback was attempted, and CPU also threw. Across 1,038 opportunities, CPU fallback recovered zero inferences.
The ANE pool DID recover during the run — three failure bursts each followed by recovery periods of 5–25 minutes. Recovery timing is consistent with the proactive reload (Layer 2) firing at a 500-call boundary, not with CPU fallback.
Summary
Remove the CPU fallback path (Layer 3b) from T5CoreMLEmbedder entirely. The path has a measured 0% recovery rate and adds per-failure overhead (one extra MLModel load + predict call) without benefit. The proactive reload (Layer 2) is the only mitigation that works; it should be the sole recovery mechanism. This simplifies predictWindow, eliminates cpuPredictorFactory from the actor's internal state, and removes tests for a dead code path.
Note on option (b) — log CPU-side error: The issue originally proposed logging the CPU-side error as a diagnostic step before removal. Based on source inspection, logCPUFallbackFailed (line 613 of T5CoreMLEmbedder.swift) already captures cpuErrorName, cpuErrorReason, and cpuCallStack in the JSONL row per ADR 021's addendum. Option (b) appears already complete. The Research stage should verify this against actual production JSONL rows; if the cpu error fields are absent, option (b) must be implemented before option (a) is shipped.
Requirements
cpuPredictorFactory property is removed from T5CoreMLEmbedder (both the stored let and all assignments in all init paths).
- The Layer 3b CPU fallback branch in
predictWindow (lines ~539–557) is removed. The reactive reload + ANE retry (Layer 3a, lines ~521–537) is retained.
- When an IOSurface exception is caught and the ANE retry also fails, the embedder rethrows the original error with the existing
category: "error" JSONL row (the cpuPredictorFactory == nil path). The cpu_fallback_failed category is no longer produced.
- ADR 021 is amended to reflect Layer 3b's removal and the production evidence that motivated it.
- Tests for CPU fallback success (
testIOSurfaceFallbackLogsRecovery) and CPU fallback failure (testCPUFactoryThrowsLogsDistinctError, testCPUPredictThrowsLogsDistinctError) are removed.
- All remaining tests continue to pass: proactive reload, autoreleasepool, IOSurface error logging, ANE retry.
- No change to the public
init signatures (cpuPredictorFactory was never a public parameter).
Scope
In scope:
- Remove
cpuPredictorFactory property and all wiring in production init paths.
- Remove
cpuPredictorFactory parameter from the test-only internal init.
- Remove the Layer 3b branch from
predictWindow.
- Remove the
logCPUFallbackFailed private method and extractCPUErrorFields helper if they are unused after removal (verify — extractCPUErrorFields may have no other callers).
- Remove
cpu_fallback_failed JSONL category (no longer produced).
- Update ADR 021 with a second addendum documenting this removal and the production evidence.
- Delete the three stress-test cases that cover CPU fallback behavior.
Out of scope:
- Changing
reloadInterval default or adding adaptive byte-pressure reload threshold (tracked separately per ADR 021 addendum).
- Changing any other public API.
- Changes to
MLPredictor, CoreMLFailureLogEntry, or the broader JSONL schema (only the cpu_fallback_failed category is retired).
Prior Art / Context
Risks / Dependencies
- Option (b) completeness: If the Research stage finds that
cpu_fallback_failed JSONL rows do NOT include cpuErrorName/cpuErrorReason in the actual production files (contradicting the source code), then option (b) should be implemented first so future engineers have a record of why CPU fallback failed before the code is deleted.
- No public API surface change:
cpuPredictorFactory was never exposed in a public init — no callers outside the package are affected.
- Test-only init signature change: The internal factory-based test init accepts
cpuPredictorFactory as a parameter. Removing it changes internal-only API — no downstream impact.
Problem
The CPU fallback path added by #87 (and improved with diagnostics in #90) does not recover any inferences in practice. Every attempt to retry an ANE-failed inference on
.cpuOnlyalso fails, and the embedder rethrows the original IOSurface error.What's actually keeping bulk-index alive in production is the proactive model reload every
reloadIntervalcalls (also from #87). The CPU fallback is dead weight: it costs an extra (failed)MLModelload + predict per failure, produces no successful results, and contributes nothing to the recovery story.Production Data
A SafariUnfucker bulk-index run (2026-05-07) processed 5,570 successful inferences and 1,038 failures:
category: "cpu_fallback_failed"fires when: ANE threw IOSurface, CPU fallback was attempted, and CPU also threw. Across 1,038 opportunities, CPU fallback recovered zero inferences.The ANE pool DID recover during the run — three failure bursts each followed by recovery periods of 5–25 minutes. Recovery timing is consistent with the proactive reload (Layer 2) firing at a 500-call boundary, not with CPU fallback.
Summary
Remove the CPU fallback path (Layer 3b) from
T5CoreMLEmbedderentirely. The path has a measured 0% recovery rate and adds per-failure overhead (one extraMLModelload +predictcall) without benefit. The proactive reload (Layer 2) is the only mitigation that works; it should be the sole recovery mechanism. This simplifiespredictWindow, eliminatescpuPredictorFactoryfrom the actor's internal state, and removes tests for a dead code path.Requirements
cpuPredictorFactoryproperty is removed fromT5CoreMLEmbedder(both the storedletand all assignments in allinitpaths).predictWindow(lines ~539–557) is removed. The reactive reload + ANE retry (Layer 3a, lines ~521–537) is retained.category: "error"JSONL row (thecpuPredictorFactory == nilpath). Thecpu_fallback_failedcategory is no longer produced.testIOSurfaceFallbackLogsRecovery) and CPU fallback failure (testCPUFactoryThrowsLogsDistinctError,testCPUPredictThrowsLogsDistinctError) are removed.initsignatures (cpuPredictorFactory was never a public parameter).Scope
In scope:
cpuPredictorFactoryproperty and all wiring in production init paths.cpuPredictorFactoryparameter from the test-only internal init.predictWindow.logCPUFallbackFailedprivate method andextractCPUErrorFieldshelper if they are unused after removal (verify —extractCPUErrorFieldsmay have no other callers).cpu_fallback_failedJSONL category (no longer produced).Out of scope:
reloadIntervaldefault or adding adaptive byte-pressure reload threshold (tracked separately per ADR 021 addendum).MLPredictor,CoreMLFailureLogEntry, or the broader JSONL schema (only thecpu_fallback_failedcategory is retired).Prior Art / Context
cpu_fallback_failedcategory and cpu error fields to JSONL rows; also added Layer 3a (reactive reload + ANE retry).Risks / Dependencies
cpu_fallback_failedJSONL rows do NOT includecpuErrorName/cpuErrorReasonin the actual production files (contradicting the source code), then option (b) should be implemented first so future engineers have a record of why CPU fallback failed before the code is deleted.cpuPredictorFactorywas never exposed in a public init — no callers outside the package are affected.cpuPredictorFactoryas a parameter. Removing it changes internal-only API — no downstream impact.