v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2
Merged
v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2
Conversation
added 3 commits
April 23, 2026 11:45
Per-inference Metal flush for users doing subprocess isolation of MLX workers. Wraps every gen_fn(...) call with: PRE: mx.clear_cache() — release prior iter cached buffers body: (generate runs) POST: mx.synchronize() — block until GPU command buffer drained POST: mx.clear_cache() — release this iter's buffers + SUBPROC_PRE / SUBPROC_POST breadcrumbs for post-mortem forensics Parent-process PRE/POST hooks (mx.clear_cache / mx.eval) do not reach subprocess memory. Without this guard the GPU command buffer can be still in flight when the worker releases memory, drifting Metal accounting until the next load trips an IOGPUMemory.cpp:492 "completeMemory() prepare count underflow" kernel panic. synchronize() is chosen over mx.eval(result) because generate return types vary across backends (str for mlx-lm, GenerateResult for mlx-vlm). synchronize is return-type agnostic and semantically correct: wait for all pending GPU ops to complete. Design: - Pure addition, no breaking changes - Best-effort: no-ops if mlx.core cannot be imported - PRE failure does NOT block body (broken guard worse than no guard) - POST failure never masks a body exception (finally semantics) - log.warning on hook failure; log.debug on breadcrumb failure Tests: 9 new unit tests covering PRE/POST ordering, finally semantics on exception, graceful degradation without MLX, pre/post failure isolation, breadcrumb forensics, and model_id edge cases. Related upstream reports: - ml-explore/mlx-lm#1128 (prefill guard design) - ml-explore/mlx#3186 (subprocess isolation guidance) - ml-explore/mlx#3346 (kernel panic reproducers)
Release prep for v0.8.1 (subprocess_inference_guard public port).
Adds:
- .github/workflows/ci.yml with a 3-way matrix:
* test-without-mlx: Linux + macOS across Python 3.11/3.12/3.13, no
MLX installed. Gating signal — library must degrade gracefully when
mlx.core is unimportable (contextmanager no-ops, all 166 tests pass).
* test-with-mlx: macos-14 arm64 runner with [mlx] extra installed,
Python 3.12. Allowed to fail on runner unavailability — Linux matrix
is the gating signal.
* lint: ruff syntax check (pyflakes subset E9/F63/F7/F82) advisory
only during 0.8.x while legacy lint debt is cleared.
- pyproject.toml: [test] optional dependency (pytest>=7.0) so CI has
a canonical install command.
- CHANGELOG.md: expanded v0.8.1 Fixed section with explicit upstream
references (mlx#3186, mlx#3346, mlx-lm#883) and the driver-bug scope
disclaimer ('metal-guard does not fix the driver bug — narrows the
race window inside the worker so the IOGPU underflow is statistically
unreachable on the workloads we've exercised').
No code changes to metal_guard.py or subprocess_inference_guard itself;
that landed on 6091278. 166/166 tests still green on py3.14 locally.
Not pushed — push + tag deferred until post-4/26 cooldown clears and
private 50-round reproducer completes.
Consolidates panic #7-#11 findings from Harper's production timeline (2026-04-16 → 2026-04-24) into the open-source distribution. Ports three defences from Harper's internal fork and documents the first known-panic model repeat offender. Core additions: - subprocess_inference_guard(model_id) [B1] — per-inference Metal flush contextmanager for subprocess workers. Ended a 6-panic streak on Harper's MAGI pipeline when wired into the internal worker loop. - CadenceGuard cross_model_interval_sec [C5] — reject back-to-back loads of different models within a configurable window. Raises CrossModelCadenceViolation (subclass of CadenceViolation, so `except CadenceViolation` continues to catch both variants). Env var METALGUARD_CROSS_MODEL_INTERVAL for process-wide opt-in. - Gemma-4 90-second floor [C5+C7] — mlx-community/unsloth/mlx-models gemma-4-* models always enforce >= 90s cross-model cadence, even when the caller set the base to 0. 8/8 panics in Harper's timeline were in the gemma-4 family; panic #6 landed 66s after prior unload. - gemma4_generation_flush(model_id, generate_call_count) [C7] — first-generate settle window (mx.synchronize + mx.clear_cache + time.sleep(3.0)) before the first forward pass on a gemma-4 worker. Renamed from internal `gemma4_firstgen_guard` — "guard" was misleading; this is a flush, not a block. Env overrides: METALGUARD_GEMMA4_FIRSTGEN_DISABLED=1, METALGUARD_GEMMA4_FIRSTGEN_SLEEP_SEC=<seconds>. - KNOWN_PANIC_MODELS advisory registry + check_known_panic_model() + warn_if_known_panic_model() (idempotent, one warn per process per model). Ships with one entry: mlx-community/gemma-4-31b-it-8bit, which kernel-panicked twice on the same pipeline 24 hours apart. Critical operational guidance documented in README + CHANGELOG: When MetalGuard defences are fully engaged and a model still kernel-panics in production, switch backend (Ollama / llama.cpp — persistent worker architecture bypasses the teardown race) or pivot to an MoE variant (much smaller active-parameter footprint per forward, narrower KV growth trajectory). This release is explicit about that limit because community data (Hannecke, ronm92130 on mlx#3186, lmstudio#1740) converges on the same conclusion. Community citations (all 2026-04): - Hannecke "MLX Crashed My Mac" (Medium) — M4 Max 64GB, same signature, pivoted to Qwen3-Coder-30B-A3B MoE. - ronm92130 on ml-explore/mlx#3186 (2026-04-24) — Mac mini M4 base 32GB, macOS 26.4.1, Qwen3.6-35B-A3B, same panic, pivoted to llama.cpp. Explicitly references this project's two-trigger-path hypothesis. - lmstudio-ai/lmstudio-bug-tracker#1740 — corroborates the hybrid attention KV explosion in the gemma-4 family. API compatibility: - All additions are backwards-compatible. CadenceGuard() unchanged default behaviour; require_cadence_clear() unchanged default behaviour (P0 fix from v0.9.0-rc review: cross-model cadence in require_cadence_clear() defaults to 0.0 disabled, NOT 60s). - CadenceGuard.check() now reads the JSON store directly under _CADENCE_FILE_LOCK instead of calling self.last_ts(); subclasses that override last_ts() for custom read paths should note that check() no longer routes through that override. Tests: 213 passed (166 pre-existing + 47 new in test_v090_cross_model_cadence.py). 2 critic rounds (R1: 3 P0 + 6 P1 found and fixed; R2: 1 P1 test-gap found and fixed, 0 P0 → GO). README available in English, 繁體中文, 日本語.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates panic #7–#11 findings from Harper's production timeline (2026-04-16 → 2026-04-24) into the open-source distribution. Ports three defences from the internal fork and documents the first known-panic model repeat offender:
mlx-community/gemma-4-31b-it-8bit.subprocess_inference_guard(model_id)— per-inference Metal flush contextmanager for subprocess workers (ended a 6-panic streak on the internal MAGI pipeline)CadenceGuard(cross_model_interval_sec=…)+CrossModelCadenceViolation(subclass ofCadenceViolation) + 90-second floor for the gemma-4 family, enforced even when base is 0gemma4_generation_flush(model_id, generate_call_count)— first-generate settle window; renamed from internalgemma4_firstgen_guard("guard" was misleading — this is a flush, not a block)KNOWN_PANIC_MODELSadvisory registry —check_known_panic_model()+warn_if_known_panic_model()(idempotent, one warning per process per model)pyproject.toml+__version__both bumped to0.9.0This PR supersedes the
b2-subprocess-inference-guardbranch by bundling B2 (which was never released) with the new v0.9.0 additions into a single release.When metal-guard is not enough
A load-bearing new section in the README and CHANGELOG: when every v0.9.0 defence is engaged and a model still kernel-panics in production, the right operational answer is to switch backend (Ollama / llama.cpp) or pivot to an MoE variant. Community data converges on this: Hannecke (M4 Max 64GB) pivoted to Qwen3-Coder-30B-A3B MoE; ronm92130 on mlx#3186 (2026-04-24) pivoted to llama.cpp on M4 base 32GB, explicitly referencing this project's two-trigger-path hypothesis. Harper's own
harper-financeproject migrated to Ollama on 2026-04-23 and has run zero-panic since.mlx-community/gemma-4-31b-it-8bitkernel-panicked twice on the same Harper production pipeline, 24 hours apart, same signature (IOGPUMemory.cpp:492 prepare_count_underflow). That is the evidence driving the honest limitations note in this release.API compatibility
CadenceGuard()default constructor behaviour unchanged (newcross_model_interval_seckw-only, defaults to0.0).require_cadence_clear()default behaviour unchanged (newcross_model_interval_secdefaults to0.0; env varMETALGUARD_CROSS_MODEL_INTERVALprovides opt-in without code changes).CadenceGuard.check()now reads the JSON store directly under_CADENCE_FILE_LOCKrather than routing throughself.last_ts(). Subclasses that overridelast_ts()for custom read paths should note thatcheck()no longer invokes the override — this is documented in the v0.9.0 docstring.Tests
tests/test_v090_cross_model_cadence.py)New tests cover:
_is_gemma4_family(20 parametrised cases inc. real sizes 1b/4b/12b/27b/31b, MoE variants, negatives likegemma-4b-it,google/gemma-4-31b,gemma-5-*),_resolve_cross_model_interval(default / env / explicit / invalid-fallback / negative-clamp / zero-disable),CrossModelCadenceViolationinheritance,CadenceGuardcross-model check including same-model priority and zero-still-floors-gemma4,gemma4_generation_flushenv kill-switch + sleep override + non-gemma no-op,KNOWN_PANIC_MODELSadvisory shape +warn_if_known_panic_modelidempotence viamonkeypatch.setattr, and therequire_cadence_clearno-guard hot path that P0-2 addresses.Documentation
README.md+README.zh-TW.md+README.ja.mdall gain:## Known affected models (v0.9.0, 2026-04)— gemma-4-31b-8bit repeat-offender table + cross-community references (Hannecke, mlx#3186 comment 4314204974, mlx-lm#883, lmstudio-bug-tracker#1740)## When MetalGuard is not enough— backend / family pivot guidance with concrete URLsimportcascading into Metal MPS init during cooldown)CHANGELOG.md— comprehensive v0.9.0 entry incorporating both B2 and the new features; explicit panic timeline #7–#11 with dates, PIDs, signatures, spawn→panic intervals; "When metal-guard is not enough" rationale; upstream references.Test plan
pytest tests/166 → 166 ✓)tests/test_v090_cross_model_cadence.pypasses (47 tests)python -c "from metal_guard import CrossModelCadenceViolation, gemma4_generation_flush, check_known_panic_model, warn_if_known_panic_model, KNOWN_PANIC_MODELS"imports cleanlymetal_guard.__version__ == "0.9.0"