v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS by Harperbot · Pull Request #2 · Harperbot/metal-guard

Harperbot · 2026-04-24T17:06:41Z

Summary

Consolidates panic #7–#11 findings from Harper's production timeline (2026-04-16 → 2026-04-24) into the open-source distribution. Ports three defences from the internal fork and documents the first known-panic model repeat offender: mlx-community/gemma-4-31b-it-8bit.

B1 subprocess_inference_guard(model_id) — per-inference Metal flush contextmanager for subprocess workers (ended a 6-panic streak on the internal MAGI pipeline)
C5 cross-model cadence — CadenceGuard(cross_model_interval_sec=…) + CrossModelCadenceViolation (subclass of CadenceViolation) + 90-second floor for the gemma-4 family, enforced even when base is 0
C7 gemma4_generation_flush(model_id, generate_call_count) — first-generate settle window; renamed from internal gemma4_firstgen_guard ("guard" was misleading — this is a flush, not a block)
KNOWN_PANIC_MODELS advisory registry — check_known_panic_model() + warn_if_known_panic_model() (idempotent, one warning per process per model)
Version: pyproject.toml + __version__ both bumped to 0.9.0

This PR supersedes the b2-subprocess-inference-guard branch by bundling B2 (which was never released) with the new v0.9.0 additions into a single release.

When metal-guard is not enough

A load-bearing new section in the README and CHANGELOG: when every v0.9.0 defence is engaged and a model still kernel-panics in production, the right operational answer is to switch backend (Ollama / llama.cpp) or pivot to an MoE variant. Community data converges on this: Hannecke (M4 Max 64GB) pivoted to Qwen3-Coder-30B-A3B MoE; ronm92130 on mlx#3186 (2026-04-24) pivoted to llama.cpp on M4 base 32GB, explicitly referencing this project's two-trigger-path hypothesis. Harper's own harper-finance project migrated to Ollama on 2026-04-23 and has run zero-panic since.

mlx-community/gemma-4-31b-it-8bit kernel-panicked twice on the same Harper production pipeline, 24 hours apart, same signature (IOGPUMemory.cpp:492 prepare_count_underflow). That is the evidence driving the honest limitations note in this release.

API compatibility

All additions are backwards-compatible.
CadenceGuard() default constructor behaviour unchanged (new cross_model_interval_sec kw-only, defaults to 0.0).
require_cadence_clear() default behaviour unchanged (new cross_model_interval_sec defaults to 0.0; env var METALGUARD_CROSS_MODEL_INTERVAL provides opt-in without code changes).
CadenceGuard.check() now reads the JSON store directly under _CADENCE_FILE_LOCK rather than routing through self.last_ts(). Subclasses that override last_ts() for custom read paths should note that check() no longer invokes the override — this is documented in the v0.9.0 docstring.

Tests

213 passed (166 pre-existing + 47 new in tests/test_v090_cross_model_cadence.py)
2 critic review rounds:
- R1: 3 P0 + 6 P1 + 4 P2 found → all P0/P1 fixed
- R2: 1 P1 (require_cadence_clear hot-path test gap) + 2 P2 found → P1 fixed, 0 P0 → GO

New tests cover: _is_gemma4_family (20 parametrised cases inc. real sizes 1b/4b/12b/27b/31b, MoE variants, negatives like gemma-4b-it, google/gemma-4-31b, gemma-5-*), _resolve_cross_model_interval (default / env / explicit / invalid-fallback / negative-clamp / zero-disable), CrossModelCadenceViolation inheritance, CadenceGuard cross-model check including same-model priority and zero-still-floors-gemma4, gemma4_generation_flush env kill-switch + sleep override + non-gemma no-op, KNOWN_PANIC_MODELS advisory shape + warn_if_known_panic_model idempotence via monkeypatch.setattr, and the require_cadence_clear no-guard hot path that P0-2 addresses.

Documentation

README.md + README.zh-TW.md + README.ja.md all gain:
- Version line bumped to v0.9.0 with feature summary
- ## Known affected models (v0.9.0, 2026-04) — gemma-4-31b-8bit repeat-offender table + cross-community references (Hannecke, mlx#3186 comment 4314204974, mlx-lm#883, lmstudio-bug-tracker#1740)
- ## When MetalGuard is not enough — backend / family pivot guidance with concrete URLs
- SOP note on panic #10 (interactive import cascading into Metal MPS init during cooldown)
CHANGELOG.md — comprehensive v0.9.0 entry incorporating both B2 and the new features; explicit panic timeline #7–#11 with dates, PIDs, signatures, spawn→panic intervals; "When metal-guard is not enough" rationale; upstream references.

Test plan

Existing tests pass unchanged (pytest tests/ 166 → 166 ✓)
New tests/test_v090_cross_model_cadence.py passes (47 tests)
python -c "from metal_guard import CrossModelCadenceViolation, gemma4_generation_flush, check_known_panic_model, warn_if_known_panic_model, KNOWN_PANIC_MODELS" imports cleanly
metal_guard.__version__ == "0.9.0"
2 critic rounds clean, 0 P0 residual
Manual smoke on a real MLX workload — reviewer's environment permitting

Per-inference Metal flush for users doing subprocess isolation of MLX workers. Wraps every gen_fn(...) call with: PRE: mx.clear_cache() — release prior iter cached buffers body: (generate runs) POST: mx.synchronize() — block until GPU command buffer drained POST: mx.clear_cache() — release this iter's buffers + SUBPROC_PRE / SUBPROC_POST breadcrumbs for post-mortem forensics Parent-process PRE/POST hooks (mx.clear_cache / mx.eval) do not reach subprocess memory. Without this guard the GPU command buffer can be still in flight when the worker releases memory, drifting Metal accounting until the next load trips an IOGPUMemory.cpp:492 "completeMemory() prepare count underflow" kernel panic. synchronize() is chosen over mx.eval(result) because generate return types vary across backends (str for mlx-lm, GenerateResult for mlx-vlm). synchronize is return-type agnostic and semantically correct: wait for all pending GPU ops to complete. Design: - Pure addition, no breaking changes - Best-effort: no-ops if mlx.core cannot be imported - PRE failure does NOT block body (broken guard worse than no guard) - POST failure never masks a body exception (finally semantics) - log.warning on hook failure; log.debug on breadcrumb failure Tests: 9 new unit tests covering PRE/POST ordering, finally semantics on exception, graceful degradation without MLX, pre/post failure isolation, breadcrumb forensics, and model_id edge cases. Related upstream reports: - ml-explore/mlx-lm#1128 (prefill guard design) - ml-explore/mlx#3186 (subprocess isolation guidance) - ml-explore/mlx#3346 (kernel panic reproducers)

Release prep for v0.8.1 (subprocess_inference_guard public port). Adds: - .github/workflows/ci.yml with a 3-way matrix: * test-without-mlx: Linux + macOS across Python 3.11/3.12/3.13, no MLX installed. Gating signal — library must degrade gracefully when mlx.core is unimportable (contextmanager no-ops, all 166 tests pass). * test-with-mlx: macos-14 arm64 runner with [mlx] extra installed, Python 3.12. Allowed to fail on runner unavailability — Linux matrix is the gating signal. * lint: ruff syntax check (pyflakes subset E9/F63/F7/F82) advisory only during 0.8.x while legacy lint debt is cleared. - pyproject.toml: [test] optional dependency (pytest>=7.0) so CI has a canonical install command. - CHANGELOG.md: expanded v0.8.1 Fixed section with explicit upstream references (mlx#3186, mlx#3346, mlx-lm#883) and the driver-bug scope disclaimer ('metal-guard does not fix the driver bug — narrows the race window inside the worker so the IOGPU underflow is statistically unreachable on the workloads we've exercised'). No code changes to metal_guard.py or subprocess_inference_guard itself; that landed on 6091278. 166/166 tests still green on py3.14 locally. Not pushed — push + tag deferred until post-4/26 cooldown clears and private 50-round reproducer completes.

Consolidates panic #7-#11 findings from Harper's production timeline (2026-04-16 → 2026-04-24) into the open-source distribution. Ports three defences from Harper's internal fork and documents the first known-panic model repeat offender. Core additions: - subprocess_inference_guard(model_id) [B1] — per-inference Metal flush contextmanager for subprocess workers. Ended a 6-panic streak on Harper's MAGI pipeline when wired into the internal worker loop. - CadenceGuard cross_model_interval_sec [C5] — reject back-to-back loads of different models within a configurable window. Raises CrossModelCadenceViolation (subclass of CadenceViolation, so `except CadenceViolation` continues to catch both variants). Env var METALGUARD_CROSS_MODEL_INTERVAL for process-wide opt-in. - Gemma-4 90-second floor [C5+C7] — mlx-community/unsloth/mlx-models gemma-4-* models always enforce >= 90s cross-model cadence, even when the caller set the base to 0. 8/8 panics in Harper's timeline were in the gemma-4 family; panic #6 landed 66s after prior unload. - gemma4_generation_flush(model_id, generate_call_count) [C7] — first-generate settle window (mx.synchronize + mx.clear_cache + time.sleep(3.0)) before the first forward pass on a gemma-4 worker. Renamed from internal `gemma4_firstgen_guard` — "guard" was misleading; this is a flush, not a block. Env overrides: METALGUARD_GEMMA4_FIRSTGEN_DISABLED=1, METALGUARD_GEMMA4_FIRSTGEN_SLEEP_SEC=<seconds>. - KNOWN_PANIC_MODELS advisory registry + check_known_panic_model() + warn_if_known_panic_model() (idempotent, one warn per process per model). Ships with one entry: mlx-community/gemma-4-31b-it-8bit, which kernel-panicked twice on the same pipeline 24 hours apart. Critical operational guidance documented in README + CHANGELOG: When MetalGuard defences are fully engaged and a model still kernel-panics in production, switch backend (Ollama / llama.cpp — persistent worker architecture bypasses the teardown race) or pivot to an MoE variant (much smaller active-parameter footprint per forward, narrower KV growth trajectory). This release is explicit about that limit because community data (Hannecke, ronm92130 on mlx#3186, lmstudio#1740) converges on the same conclusion. Community citations (all 2026-04): - Hannecke "MLX Crashed My Mac" (Medium) — M4 Max 64GB, same signature, pivoted to Qwen3-Coder-30B-A3B MoE. - ronm92130 on ml-explore/mlx#3186 (2026-04-24) — Mac mini M4 base 32GB, macOS 26.4.1, Qwen3.6-35B-A3B, same panic, pivoted to llama.cpp. Explicitly references this project's two-trigger-path hypothesis. - lmstudio-ai/lmstudio-bug-tracker#1740 — corroborates the hybrid attention KV explosion in the gemma-4 family. API compatibility: - All additions are backwards-compatible. CadenceGuard() unchanged default behaviour; require_cadence_clear() unchanged default behaviour (P0 fix from v0.9.0-rc review: cross-model cadence in require_cadence_clear() defaults to 0.0 disabled, NOT 60s). - CadenceGuard.check() now reads the JSON store directly under _CADENCE_FILE_LOCK instead of calling self.last_ts(); subclasses that override last_ts() for custom read paths should note that check() no longer routes through that override. Tests: 213 passed (166 pre-existing + 47 new in test_v090_cross_model_cadence.py). 2 critic rounds (R1: 3 P0 + 6 P1 found and fixed; R2: 1 P1 test-gap found and fixed, 0 P0 → GO). README available in English, 繁體中文, 日本語.

Harper added 3 commits April 23, 2026 11:45

Harperbot merged commit 05387d2 into main Apr 24, 2026
1 of 8 checks passed

Harperbot deleted the v0.9.0-panic-11-hardening branch April 24, 2026 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2

v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2
Harperbot merged 3 commits intomainfrom
v0.9.0-panic-11-hardening

Harperbot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Harperbot commented Apr 24, 2026

Summary

When metal-guard is not enough

API compatibility

Tests

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant