Skip to content

v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2

Merged
Harperbot merged 3 commits intomainfrom
v0.9.0-panic-11-hardening
Apr 24, 2026
Merged

v0.9.0 — cross-model cadence + gemma-4 floor + KNOWN_PANIC_MODELS#2
Harperbot merged 3 commits intomainfrom
v0.9.0-panic-11-hardening

Conversation

@Harperbot
Copy link
Copy Markdown
Owner

Summary

Consolidates panic #7–#11 findings from Harper's production timeline (2026-04-16 → 2026-04-24) into the open-source distribution. Ports three defences from the internal fork and documents the first known-panic model repeat offender: mlx-community/gemma-4-31b-it-8bit.

  • B1 subprocess_inference_guard(model_id) — per-inference Metal flush contextmanager for subprocess workers (ended a 6-panic streak on the internal MAGI pipeline)
  • C5 cross-model cadenceCadenceGuard(cross_model_interval_sec=…) + CrossModelCadenceViolation (subclass of CadenceViolation) + 90-second floor for the gemma-4 family, enforced even when base is 0
  • C7 gemma4_generation_flush(model_id, generate_call_count) — first-generate settle window; renamed from internal gemma4_firstgen_guard ("guard" was misleading — this is a flush, not a block)
  • KNOWN_PANIC_MODELS advisory registrycheck_known_panic_model() + warn_if_known_panic_model() (idempotent, one warning per process per model)
  • Version: pyproject.toml + __version__ both bumped to 0.9.0

This PR supersedes the b2-subprocess-inference-guard branch by bundling B2 (which was never released) with the new v0.9.0 additions into a single release.

When metal-guard is not enough

A load-bearing new section in the README and CHANGELOG: when every v0.9.0 defence is engaged and a model still kernel-panics in production, the right operational answer is to switch backend (Ollama / llama.cpp) or pivot to an MoE variant. Community data converges on this: Hannecke (M4 Max 64GB) pivoted to Qwen3-Coder-30B-A3B MoE; ronm92130 on mlx#3186 (2026-04-24) pivoted to llama.cpp on M4 base 32GB, explicitly referencing this project's two-trigger-path hypothesis. Harper's own harper-finance project migrated to Ollama on 2026-04-23 and has run zero-panic since.

mlx-community/gemma-4-31b-it-8bit kernel-panicked twice on the same Harper production pipeline, 24 hours apart, same signature (IOGPUMemory.cpp:492 prepare_count_underflow). That is the evidence driving the honest limitations note in this release.

API compatibility

  • All additions are backwards-compatible.
  • CadenceGuard() default constructor behaviour unchanged (new cross_model_interval_sec kw-only, defaults to 0.0).
  • require_cadence_clear() default behaviour unchanged (new cross_model_interval_sec defaults to 0.0; env var METALGUARD_CROSS_MODEL_INTERVAL provides opt-in without code changes).
  • CadenceGuard.check() now reads the JSON store directly under _CADENCE_FILE_LOCK rather than routing through self.last_ts(). Subclasses that override last_ts() for custom read paths should note that check() no longer invokes the override — this is documented in the v0.9.0 docstring.

Tests

  • 213 passed (166 pre-existing + 47 new in tests/test_v090_cross_model_cadence.py)
  • 2 critic review rounds:
    • R1: 3 P0 + 6 P1 + 4 P2 found → all P0/P1 fixed
    • R2: 1 P1 (require_cadence_clear hot-path test gap) + 2 P2 found → P1 fixed, 0 P0 → GO

New tests cover: _is_gemma4_family (20 parametrised cases inc. real sizes 1b/4b/12b/27b/31b, MoE variants, negatives like gemma-4b-it, google/gemma-4-31b, gemma-5-*), _resolve_cross_model_interval (default / env / explicit / invalid-fallback / negative-clamp / zero-disable), CrossModelCadenceViolation inheritance, CadenceGuard cross-model check including same-model priority and zero-still-floors-gemma4, gemma4_generation_flush env kill-switch + sleep override + non-gemma no-op, KNOWN_PANIC_MODELS advisory shape + warn_if_known_panic_model idempotence via monkeypatch.setattr, and the require_cadence_clear no-guard hot path that P0-2 addresses.

Documentation

  • README.md + README.zh-TW.md + README.ja.md all gain:
    • Version line bumped to v0.9.0 with feature summary
    • ## Known affected models (v0.9.0, 2026-04) — gemma-4-31b-8bit repeat-offender table + cross-community references (Hannecke, mlx#3186 comment 4314204974, mlx-lm#883, lmstudio-bug-tracker#1740)
    • ## When MetalGuard is not enough — backend / family pivot guidance with concrete URLs
    • SOP note on panic #10 (interactive import cascading into Metal MPS init during cooldown)
  • CHANGELOG.md — comprehensive v0.9.0 entry incorporating both B2 and the new features; explicit panic timeline #7–#11 with dates, PIDs, signatures, spawn→panic intervals; "When metal-guard is not enough" rationale; upstream references.

Test plan

  • Existing tests pass unchanged (pytest tests/ 166 → 166 ✓)
  • New tests/test_v090_cross_model_cadence.py passes (47 tests)
  • python -c "from metal_guard import CrossModelCadenceViolation, gemma4_generation_flush, check_known_panic_model, warn_if_known_panic_model, KNOWN_PANIC_MODELS" imports cleanly
  • metal_guard.__version__ == "0.9.0"
  • 2 critic rounds clean, 0 P0 residual
  • Manual smoke on a real MLX workload — reviewer's environment permitting

Harper added 3 commits April 23, 2026 11:45
Per-inference Metal flush for users doing subprocess isolation of MLX
workers. Wraps every gen_fn(...) call with:

  PRE:  mx.clear_cache()    — release prior iter cached buffers
  body: (generate runs)
  POST: mx.synchronize()    — block until GPU command buffer drained
  POST: mx.clear_cache()    — release this iter's buffers
  +     SUBPROC_PRE / SUBPROC_POST breadcrumbs for post-mortem forensics

Parent-process PRE/POST hooks (mx.clear_cache / mx.eval) do not reach
subprocess memory. Without this guard the GPU command buffer can be
still in flight when the worker releases memory, drifting Metal
accounting until the next load trips an IOGPUMemory.cpp:492
"completeMemory() prepare count underflow" kernel panic.

synchronize() is chosen over mx.eval(result) because generate return
types vary across backends (str for mlx-lm, GenerateResult for
mlx-vlm). synchronize is return-type agnostic and semantically
correct: wait for all pending GPU ops to complete.

Design:
- Pure addition, no breaking changes
- Best-effort: no-ops if mlx.core cannot be imported
- PRE failure does NOT block body (broken guard worse than no guard)
- POST failure never masks a body exception (finally semantics)
- log.warning on hook failure; log.debug on breadcrumb failure

Tests: 9 new unit tests covering PRE/POST ordering, finally semantics
on exception, graceful degradation without MLX, pre/post failure
isolation, breadcrumb forensics, and model_id edge cases.

Related upstream reports:
- ml-explore/mlx-lm#1128 (prefill guard design)
- ml-explore/mlx#3186   (subprocess isolation guidance)
- ml-explore/mlx#3346   (kernel panic reproducers)
Release prep for v0.8.1 (subprocess_inference_guard public port).

Adds:
- .github/workflows/ci.yml with a 3-way matrix:
  * test-without-mlx: Linux + macOS across Python 3.11/3.12/3.13, no
    MLX installed. Gating signal — library must degrade gracefully when
    mlx.core is unimportable (contextmanager no-ops, all 166 tests pass).
  * test-with-mlx: macos-14 arm64 runner with [mlx] extra installed,
    Python 3.12. Allowed to fail on runner unavailability — Linux matrix
    is the gating signal.
  * lint: ruff syntax check (pyflakes subset E9/F63/F7/F82) advisory
    only during 0.8.x while legacy lint debt is cleared.

- pyproject.toml: [test] optional dependency (pytest>=7.0) so CI has
  a canonical install command.

- CHANGELOG.md: expanded v0.8.1 Fixed section with explicit upstream
  references (mlx#3186, mlx#3346, mlx-lm#883) and the driver-bug scope
  disclaimer ('metal-guard does not fix the driver bug — narrows the
  race window inside the worker so the IOGPU underflow is statistically
  unreachable on the workloads we've exercised').

No code changes to metal_guard.py or subprocess_inference_guard itself;
that landed on 6091278. 166/166 tests still green on py3.14 locally.

Not pushed — push + tag deferred until post-4/26 cooldown clears and
private 50-round reproducer completes.
Consolidates panic #7-#11 findings from Harper's production timeline
(2026-04-16 → 2026-04-24) into the open-source distribution. Ports
three defences from Harper's internal fork and documents the first
known-panic model repeat offender.

Core additions:

- subprocess_inference_guard(model_id) [B1] — per-inference Metal
  flush contextmanager for subprocess workers. Ended a 6-panic streak
  on Harper's MAGI pipeline when wired into the internal worker loop.

- CadenceGuard cross_model_interval_sec [C5] — reject back-to-back
  loads of different models within a configurable window. Raises
  CrossModelCadenceViolation (subclass of CadenceViolation, so
  `except CadenceViolation` continues to catch both variants). Env
  var METALGUARD_CROSS_MODEL_INTERVAL for process-wide opt-in.

- Gemma-4 90-second floor [C5+C7] — mlx-community/unsloth/mlx-models
  gemma-4-* models always enforce >= 90s cross-model cadence, even
  when the caller set the base to 0. 8/8 panics in Harper's timeline
  were in the gemma-4 family; panic #6 landed 66s after prior unload.

- gemma4_generation_flush(model_id, generate_call_count) [C7] —
  first-generate settle window (mx.synchronize + mx.clear_cache +
  time.sleep(3.0)) before the first forward pass on a gemma-4 worker.
  Renamed from internal `gemma4_firstgen_guard` — "guard" was
  misleading; this is a flush, not a block. Env overrides:
  METALGUARD_GEMMA4_FIRSTGEN_DISABLED=1,
  METALGUARD_GEMMA4_FIRSTGEN_SLEEP_SEC=<seconds>.

- KNOWN_PANIC_MODELS advisory registry + check_known_panic_model() +
  warn_if_known_panic_model() (idempotent, one warn per process per
  model). Ships with one entry: mlx-community/gemma-4-31b-it-8bit,
  which kernel-panicked twice on the same pipeline 24 hours apart.

Critical operational guidance documented in README + CHANGELOG:

When MetalGuard defences are fully engaged and a model still
kernel-panics in production, switch backend (Ollama / llama.cpp —
persistent worker architecture bypasses the teardown race) or pivot
to an MoE variant (much smaller active-parameter footprint per
forward, narrower KV growth trajectory). This release is explicit
about that limit because community data (Hannecke, ronm92130 on
mlx#3186, lmstudio#1740) converges on the same conclusion.

Community citations (all 2026-04):

- Hannecke "MLX Crashed My Mac" (Medium) — M4 Max 64GB, same
  signature, pivoted to Qwen3-Coder-30B-A3B MoE.
- ronm92130 on ml-explore/mlx#3186 (2026-04-24) — Mac mini M4 base
  32GB, macOS 26.4.1, Qwen3.6-35B-A3B, same panic, pivoted to
  llama.cpp. Explicitly references this project's two-trigger-path
  hypothesis.
- lmstudio-ai/lmstudio-bug-tracker#1740 — corroborates the hybrid
  attention KV explosion in the gemma-4 family.

API compatibility:

- All additions are backwards-compatible. CadenceGuard() unchanged
  default behaviour; require_cadence_clear() unchanged default
  behaviour (P0 fix from v0.9.0-rc review: cross-model cadence in
  require_cadence_clear() defaults to 0.0 disabled, NOT 60s).
- CadenceGuard.check() now reads the JSON store directly under
  _CADENCE_FILE_LOCK instead of calling self.last_ts(); subclasses
  that override last_ts() for custom read paths should note that
  check() no longer routes through that override.

Tests: 213 passed (166 pre-existing + 47 new in
test_v090_cross_model_cadence.py). 2 critic rounds (R1: 3 P0 + 6 P1
found and fixed; R2: 1 P1 test-gap found and fixed, 0 P0 → GO).

README available in English, 繁體中文, 日本語.
@Harperbot Harperbot merged commit 05387d2 into main Apr 24, 2026
1 of 8 checks passed
@Harperbot Harperbot deleted the v0.9.0-panic-11-hardening branch April 24, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant