fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993
fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993
Conversation
…ore vmlx loads
Mistral 3 family (mistral3 / ministral3) and Laguna model classes in
vmlx-swift-lm currently use vanilla MLXNN.Linear. JANGTQ-quantized
bundles for those families ship `.tq_packed` + `.scales` tensors that
the vanilla Linear can't consume — without a JANGTQ-aware shim
(NemotronHJANGTQModel / MiniMaxJANGTQModel pattern), loading would
either crash on weight-shape mismatch or silently load codebook bytes
as raw weights and emit garbage.
Two-layer defense added:
1. Vmlx (paired commit d32e135 on vmlx-swift-lm main): both
LLMModelFactory and VLMModelFactory `mistral3` dispatch closures
now peek `weight_format` BEFORE falling through to vanilla
Mistral3TextModel / Mistral3VLM, throwing a clear
"JANGTQ-quantized Mistral 3 family bundles are not yet supported"
error pointing at MXFP4 as the working alternative.
2. Osaurus host preflight (this commit): extends
`validateJANGTQSidecarIfRequired` with a third check covering
pending-JANGTQ families. Fires only when ALL of:
- jang_config.json exists
- weight_format == "mxtq"
- config.json model_type (or text_config.model_type for VLM
wrappers) ∈ {mistral3, ministral3, laguna}
User gets a friendly remediation message at the host layer
before any vmlx loader runs.
The MXFP4 quant tier of these same models loads correctly via the
standard MLX dequant path (layout matches mx.quantize), so the error
points users at MXFP4 as the working alternative. Once the JANGTQ
shim ports land in vmlx, simply remove the family from the host's
`pendingJANGTQFamilies` set — no other code change required.
Pin bump: vmlx-swift-lm a196800 → d32e135 carries the engine-side
fail-fast guard.
Test coverage: 9 new MC/DC-shaped tests in
Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift covering:
- D1/D2 existing branches still fire (no jang_config; non-mxtq)
- D3 outer mistral3 throws code 4 with MXFP4-pointer message
- D3 inner ministral3 (text_config) throws
- D3 laguna throws with laguna-named message
- Boundary: nemotron_h / qwen3_5_moe / minimax_m2 JANGTQ all PASS
(shims exist; new check must NOT trigger)
- Boundary: Mistral 3 family MXFP4 PASSES (only mxtq fires the gate)
OsaurusCore: 1256 / 1256 tests in 167 suites (was 1247 / 166; +9 +1).
…primitive) Pulls in vmlx@9df5c80 — new JANGTQDenseLinear primitive (Libraries/ MLXLMCommon/JANGTQDenseLinear.swift) that's the foundation for porting JANGTQ-quantized Mistral 3 / Mistral 3.5 / Mistral 4 / Laguna bundles. The existing TurboQuantSwitchLinear is MoE-shaped (n_experts dim, gather indices); the Mistral 3 family quantizes the entire DENSE text decoder per the mxtq_bits.text_decoder=2|4 profile — needs a different shim shape. JANGTQDenseLinear matches the Python converter's tq_quantize_weight output (2D shapes, no expert dim) and reuses the existing JANGTQKernels.gatherTQ kernel via singleton-expert degeneration. 7 structural tests covering construction contracts + Mistral 3.5 attention/MLP shape examples + parameter-key pathing + bit-width packed_cols arithmetic. End-to-end forward verification still gated on a real Mistral 3 family JANGTQ bundle on disk — the host-side preflight in this PR continues to fire until that verification phase completes (defense in depth). OsaurusCore: 1256 / 1256 against new pin. No behavior change for non-mxtq Mistral 3 paths.
Pulls in vmlx@cb829b6 — Mistral 3 family LLM JANGTQ port complete.
Vmlx-side (cb829b6):
- Mistral3TextJANGTQModel: parallel architecture to Mistral3TextModel
with JANGTQDenseLinear for attention Q/K/V/O + MLP gate/up/down
- LLMModelFactory mistral3 closure peeks weight_format and routes
mxtq → Mistral3TextJANGTQModel(config, bits, seed) reading
mxtq_bits / mxtq_seed from merged jang_config.json
- VLMModelFactory mistral3 closure: updated message reflects LLM port
complete, VLM port still in-flight (Mistral3VLM with Pixtral
needs paired JANGTQ inner LM)
- 6 dispatch tests + 7 structural JANGTQDenseLinear tests upstream
Host-side preflight relaxation:
- Removes blanket gate on `mistral3` / `ministral3` model_types
- LLM-only Mistral 3 / 3.5 JANGTQ (no vision_config in config.json)
now flows through to vmlx's Mistral3TextJANGTQModel — preflight
PASSES, no error
- VLM-shaped Mistral 3 family (vision_config present, e.g.
Mistral-Medium-3.5-128B with Pixtral) STILL fires gate with a
VLM-specific message until upstream VLM port lands
- Laguna gate stays unchanged — separate engine port pending
Test coverage updated:
- ValidateJANGTQUnsupportedFamilyTests now has 11 cases (was 9):
* D3.mistral3 outer LLM-only PASSES (new)
* D3.mistral3 + vision_config STILL throws with VLM-specific msg (new)
* D3.ministral3 inner LLM-only PASSES (new)
* D3.ministral3 inner + vision_config throws (was throws unconditionally)
* D3.laguna unchanged — still throws
* Boundary: nemotron_h / qwen3_5_moe / minimax_m2 unchanged
OsaurusCore: 1258 / 1258 (was 1256 + 2 new VLM-vs-LLM split tests).
Pulls in vmlx@7fa4940 — Mistral 3 family VLM JANGTQ port complete.
Vmlx-side (7fa4940):
- Mistral3VLMJANGTQ: full JANGTQ inner LM (Mistral3JANGTQAttention/
MLP/TransformerBlock/ModelInner/LanguageModel) + outer wrapper
matching Mistral3VLM's contract. Pixtral vision tower stays
vanilla per mxtq_bits.vision_tower=passthrough_fp16.
- VLMModelFactory mistral3 closure: weight_format=mxtq → routes to
Mistral3VLMJANGTQ(config, bits, seed) with bits + seed read from
config.mxtqBits / config.mxtqSeed.
- ToolCallFormat: laguna → .glm4, ministral3 → .mistral.
- 10 new coverage tests (Mistral3LagunaCoverageTests).
Host preflight changes:
- Drops the Mistral 3 family gate entirely. Both LLM-only AND
VLM-shaped bundles flow through to vmlx now (no_vision →
Mistral3TextJANGTQModel; vision_config present →
Mistral3VLMJANGTQ).
- Laguna gate stays — vmlx Laguna model class is the next port.
- Tests updated: VLM Mistral 3 cases now PASS (used to throw).
OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.
…o 344dda0
vmlx@344dda0 ships LagunaModel — full Poolside Laguna engine class
(40 hybrid layers, per-layer head count, dual RoPE, q_norm/k_norm,
sigmoid+correction-bias routing over 256 routed experts top-8 + shared,
per-layer mixed RotatingKVCache+KVCacheSimple). MXFP4 bundles load via
standard MLX dequant route immediately.
Host preflight third-check fully retired:
- Mistral 3 family LLM (vmlx@cb829b6) ✓
- Mistral 3 family VLM with Pixtral (vmlx@7fa4940) ✓
- Laguna LLM (vmlx@344dda0) ✓
- JANGTQ Linear shim for Laguna is the next incremental piece
(LagunaJANGTQModel paralleling Mistral3TextJANGTQModel) — but no
longer host-side gated; mislabeled bundles get caught by the
existing forward/inverse sidecar checks above.
Tests updated:
- D3.laguna case now PASSES (was throws)
- All other boundary checks unchanged
OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.
Pulls in vmlx@fe1b754 — exhaustive audit pass:
- Mistral3TextJANGTQModel adds sanitize() override (drops
self_attn.rotary_emb.inv_freq, .tq_bits scalars, lm_head.weight
when tied; handles weight_scale_inv FP8 multiply-through;
unwraps language_model. prefix from VLM-converted bundles)
- Laguna sanitize() now also drops the same HF quirks
- Tests/MLXLMTests/InterleavedReasoningLeakTests.swift (8 new
tests): empty think block, token-split opener/closer, multi-turn
cross-turn isolation, mid-think truncation flushes as reasoning,
stray closer no-channel-open, family stamp matrix lock
OsaurusCore: 1258 / 1258.
Closes the Laguna JANGTQ port. Vmlx@3a422d7 ships LagunaJANGTQModel mirroring LagunaModel exactly except every dense Linear in attention + MLP + MoE expert + shared expert is replaced with JANGTQDenseLinear. LLMModelFactory's `laguna` closure now peeks weight_format and routes mxtq → LagunaJANGTQModel. Engine status: every JANGTQ family vmlx ships now has a paired JANGTQ class. NemotronH, Qwen3.5-MoE/Holo3, MiniMax M2, DSV3/Kimi K2, DSV4, Mistral 3 LLM + VLM, Laguna LLM. Laguna VLM doesn't exist (Laguna is text-only). OsaurusCore: 1258 / 1258.
`scanLocalModels` previously walked exactly two levels (`<root>/<org>/<repo>/`), the HuggingFace shape. Bundles laid out flat (`<root>/<modelDir>/`) — common when users sync via rsync or place models on an external drive — were treated as "orgs", and their child files (config.json, *.safetensors) failed the directory check, so nothing was detected. Affected real-world bundles include Nemotron-3-Nano-Omni-30B-A3B-MXFP4, MiniMax-M2.7-JANGTQ4, Kimi-K2.6-*, DeepSeek-V4-Flash-* and Qwen3.6-35B-A3B-JANGTQ4. Now each top-level entry is first checked as a model bundle (config.json + recognised tokenizer + at least one safetensors file). If yes it's registered with id = directory basename; otherwise we fall back to the existing nested descent. Both layouts may coexist under the same root. `mergeAvailable` now also dedupes by repo tail so a flat `Nemotron-3-...` entry doesn't surface alongside the curated `OsaurusAI/Nemotron-3-...` one. The downloaded copy wins on tail collision.
…that needs it When a JANGTQ bundle's `jang_config.json` declares `weight_format: "mxtq"` but `jangtq_runtime.safetensors` is absent, vmlx aborts on the first forward pass because TurboQuantSwitchLinear's runtime cache is empty. Users who synced a bundle without the sidecar (rsync excluded it, partial download, etc.) hit a load-time error with no clear remediation. `ensureJANGTQSidecar` wraps the existing sync validator and, on the FORWARD- mismatch failure (and only that — code 2 from `validateJANGTQSidecarIfRequired`), attempts a one-shot download of `jangtq_runtime.safetensors` from `https://huggingface.co/<modelId>/resolve/main/jangtq_runtime.safetensors`, then re-runs validation. The URL is built dynamically from the model id via `ModelDownloadService.resolveURL` so it always points at the right repo. Hard guarantees (covered by tests): * Sidecar already present → no fetch * Vanilla (no jang_config.json) model → no fetch * Stamp says non-mxtq (vanilla or inverse-mislabeled) → no fetch, original error propagates unchanged (code 3 still surfaces for inverse mismatch) * Forward mismatch but flat-layout id (no `/`) → no fetch (no canonical HF mapping), original code-2 error surfaces * Forward mismatch + canonical `<org>/<repo>` id → fetcher fires exactly once with the right URL; if it succeeds the validator re-runs and the load proceeds; if it fails the error is wrapped as code 4 with the URL we tried The fetcher does an atomic temp → rename and rejects 0-byte responses so a crashed/cancelled fetch never leaves a partial sidecar that the next preflight would silently accept.
Pattern + character edge cases that could have made the auto-fetch trigger
on bogus input or skip a real JANGTQ bundle:
* weight_format matching is now case + whitespace insensitive. Every
stamp variant we've seen in the wild (`MXTQ`, ` mxtq `, `\tmxtq`,
`mXtQ`, `mxtq\n`) is recognised as JANGTQ; lookalikes (`mx_tq`,
`mxtq2`, `mxq`, `mxfp4`, etc.) are not. Covers every JANGTQ family —
Qwen, MiniMax, DSV4, Nemotron, Mistral 3 (LLM + VLM), Laguna.
* `isValidHFRepoId` strictly validates the model id BEFORE we build any
URL or hit the network. Required shape: exactly two non-empty
segments of [A-Za-z0-9._-], 1..96 chars each, no leading or trailing
slash, no whitespace anywhere. Rejected up front: empty string, bare
`/`, leading `/foo`, trailing `foo/`, no-slash flat ids, multi-slash
`a/b/c`, empty middle `a//b`, whitespace, URL meta (`?`, `#`, `&`,
`;`, `:`, `@`), path traversal (`..`), backslashes, quotes, control
characters, non-ASCII, BOM, and segments > 96 chars.
* The resolved URL is double-checked to be `https://huggingface.co/...`
before we trust it.
* Cross-volume safe install: `URLSession.download` writes its temp file
to the system temp dir which is almost always on a different volume
than the model bundle (e.g. user models on `/Volumes/EricsLLMDrive/`).
`moveItem` would fail with EXDEV. Now we copy into a sibling temp
file in the bundle's own directory and use `replaceItemAt`, which
handles "dest already exists" cleanly and stays atomic on the same
volume.
* Race tolerance: if a concurrent writer produced the sidecar between
our HEAD-of-validate and our install step, we accept their copy
instead of throwing.
* Test injection moved from a `nonisolated(unsafe)` global to a
`@TaskLocal`, so parallel test cases can each scope their own fetcher
override via `withValue` without racing each other (and without
hitting the real network when a parallel case clears the global mid-
flight, which was producing 401s in a prior run).
* `mergeAvailable` tail-dedup now properly removes the curated entry
from `suggestedModels` when a flat-layout local entry replaces it,
instead of leaving both in different lists where the UI rendered them
as duplicates.
Tests added:
* 14 `isValidHFRepoId` cases covering the accept/reject matrix above.
* 8 stamp-variant cases proving every casing/whitespace form trips the
auto-fetch.
* 12 non-mxtq stamp cases proving lookalikes never trip it.
* 13 malformed-id cases proving the network is never hit on bad input.
* One race-tolerance test proving concurrent writers don't break us.
26 new tests; all 29 ModelRuntime + ModelManager tests pass green.
vmlx pin 3a422d7 → 2ff7a23: picks up the permissive rope_parameters
decode so Laguna mxfp4 / JANGTQ bundles no longer fail at config decode
("Type mismatch at rope_parameters.original_max_position_embeddings").
Flat-layout local ids (e.g. `MiniMax-M2.7-Small-JANGTQ`,
`Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4`) now resolve via a known JANGTQ
publisher org allowlist when the user is missing the sidecar. Ordered
attempts: JANGQ-AI → OsaurusAI → mlx-community. Each candidate is
independently revalidated against `isValidHFRepoId` so illegal chars in
the basename (whitespace, non-ASCII, URL meta) still abort up-front
without hitting the network. `..` / `.` segments are now also rejected
to close a path-traversal-shaped gap in the id gate.
The error message lists every URL we tried so the user sees exactly
where the sidecar should live. Tests cover:
* 15 real-world bundle names (Nemotron-3 family, Holo3, Laguna,
Mistral 3.5, DeepSeek-V4-Flash, Kimi-K2.6, MiniMax-M2.7, Qwen3) —
each must produce OsaurusAI/<name>, JANGQ-AI/<name>, and
mlx-community/<name> as valid candidates
* Canonical `org/repo` ids return only themselves
* Flat-id ordering: JANGQ-AI, OsaurusAI, mlx-community
* Empty / multi-slash / `..` / `.` / illegal-char ids → no candidates
* Flat bundle resolves via OsaurusAI fallback when JANGQ-AI 404s
* Nested OsaurusAI/<repo> hits exactly one URL (no redundant fallback)
* "All candidates 404" surfaces every tried URL in the error message
34 JANGTQ-related tests pass across 6 suites.
Companion to vmlx's MultiTurnFamilyMatrixTests. The engine suite locks
parser/dispatch invariants; this suite locks the osaurus translation
layer: model_id → media-capability → composer drag-drop → which
MessageContentParts even reach the engine.
Sections:
A. Capability detection by model_id (pre-load fast path) — 8 omni
bundle ids, 6 imageVideo, 11 imageOnly, 17 textOnly (incl. flat-
layout local ids), 6 degenerate-id fallbacks
B. Capability detection by bundle directory (post-load refined) —
config_omni.json sidecar trumps model_type, vision_config + 7
video-capable model_types → imageVideo, vision_config + 12 image-
only model_types → imageOnly, 13 dense LLM model_types → textOnly,
unreadable config falls back to model_id matcher
C. Multi-turn capability stability — 5-turn alternating model
switch (omni → text → VL → text → omni) returns expected per-turn
capability with no aliasing; repeated calls deterministic
D. Drag-drop accept matrix — text-only rejects all 3 modalities,
image-only accepts image, imageVideo rejects audio, omni accepts
all 3; 7-turn drag-drop sequence covering switch + accept/reject
per turn
E. End-to-end matrix — 29 (model_id, modality, expected) rows across
every shipping family (Nemotron-3 omni, Qwen 3 VL, Qwen 2.5 VL,
SmolVLM2, Mistral-Medium-3.5, Paligemma, Idefics3, FastVLM,
Pixtral, Holo3 dense, Laguna, MiniMax, DeepSeek-V4, Kimi K2.6,
Qwen3.5/3.6 dense)
18 tests across 5 suites, all green. 1311 total osaurus tests pass,
no regressions. Pin bumped to vmlx 38c93f9 which adds the engine-side
companion (44 tests, 10 suites).
…er reach next prompt
Live audit triggered by user report that MiniMax JANGTQ loops after a
stop+continue. Static analysis ruled out every state-leak path:
* osaurus chat UI builds priorUserMessages from session.turns,
filtering t.role == .user — assistant turns never enter the next
prompt regardless of abort state (ChatView.swift:1288)
* BatchEngine.finishSlot(reason: .cancelled) explicitly skips
coordinator.storeAfterGeneration — the disk/memory cache never
sees a partial entry from a cancelled run (BatchEngine.swift:1384)
* Per-request BatchSlot allocates fresh KV cache via
model.newCache(...) (BatchEngine.swift:610) and a fresh
sampler + PenaltyProcessor + RepetitionContext
(BatchScheduler.swift:148-149) — no per-model mutable state
survives across requests
* ReasoningParser is allocated per-stream via forPrompt(stampName:
promptTail:) and dies with the stream — no parser state crosses
requests
* JANGTQRuntimeCache is a singleton with NSLock-protected
dictionaries; signs/codebook are immutable after loadSidecar
and not mutated during forward
* TurboQuantSwitchLinear / TurboQuantSwitchGLU are pure compute over
@ParameterInfo (read-only Module params) and runtime-cache-resident
codebook arrays — no state path could be corrupted by an aborted
forward pass
This test file codifies the chat-history contract so any future drift
that reintroduces assistant content into the prompt is caught:
- aborted assistant turn with partial <think> reasoning → dropped
- aborted assistant turn with partial visible content → dropped
- back-to-back aborts across turns → only user messages survive
- empty user turn filtered out (no degenerate user→user shape)
- user-typed <think> literal preserved verbatim (intent-respecting)
- explicit MiniMax stop-mid-think → continue scenario
6 tests, all green. The remaining viable looping causes are
mlx-swift Metal command-encoder coalescing race (engine-cancel
mid-encode → next request inherits stale buffer) or deterministic
sampling on temp=0 producing similar openings on similar prompts —
both outside this PR's scope.
… contracts
User suspected stop-and-continue looping was caused by generation
config / tools / skills state leak across turns. Audited every path:
GENERATION CONFIG (LocalGenerationDefaults):
* Cache keyed by lowercased modelId; multi-turn calls return
byte-identical Defaults
* jang_config primary + generation_config fallback merge precedence
is stable across N invocations (5 iterations × identical output)
* repetition_penalty=1.0 round-trips verbatim (vmlx engine treats
1.0 as no-op via cf8c525, so deterministic sampling)
* Concurrent parse from 32 tasks yields identical results (no
shared mutable state)
TOOLS + SKILLS HISTORY FILTER:
* Aborted assistant turn with pendingToolName / pendingToolArgs is
filtered from next prompt (the user-only role check works on every
assistant variant, including tool-shaped state)
* Completed assistant->tool->assistant exchange is entirely filtered
out of the NEXT user query — only the user message survives
across turns. The tool exchange happens within ONE user query and
never crosses into history visible to a later turn.
* Aborted post-tool answer (tool succeeded, assistant content was
streaming, user clicked Stop) — content is dropped from history
* Skill toggle (on/off/on across 3 turns) doesn't leak skill content
into prior-turn history; skills are a per-turn system-prompt
concern, not a history concern
STOP-AND-CONTINUE DETERMINISM:
* Whether the aborted turn was in reasoning mode, content mode, or
tool-call mode produces IDENTICAL history shape for the next
request. The model has no way to tell what mode the prior turn
was in because the assistant turn never appears in the prompt.
9 new tests across 3 suites, all green. Combined with the existing
27 LocalGenerationDefaultsTests, the multi-turn behavior is provably
deterministic regardless of abort timing or tool/skill state.
…eorder OsaurusAI first
User report: 'jangq-ai/MiniMax-M2.7-Small-JANGTQ' (lowercased upstream)
401s because HF orgs are case-sensitive and 'jangq-ai' isn't the same
as 'JANGQ-AI'. Auto-fetch only tried the verbatim id and never the
canonical-cased fallback.
Fix: even when the supplied modelId is a valid '<org>/<repo>', we now
ALWAYS append canonical-cased variants of the basename. Try the
verbatim id FIRST (custom orgs that genuinely ship at that exact
path), then OsaurusAI/<basename> (curated publisher — most user-
facing JANGTQ + MXFP4 bundles ship here), then JANGQ-AI/<basename>,
then mlx-community/<basename>.
This recovers from BOTH:
* case-mismatch (jangq-ai → JANGQ-AI / OsaurusAI)
* wrong-org-guess (user thinks bundle is on JANGQ-AI but it actually
lives under OsaurusAI)
Tightening: malformed shapes (multi-slash, leading slash, etc.) still
produce zero candidates — basename extraction is only trusted when the
id is either valid <org>/<repo> or pure flat (no slash).
Reorder: OsaurusAI now FIRST in the priority list (was JANGQ-AI). The
curated org ships the bulk of user-facing bundles; trying it first
fixes most missing-sidecar cases on the first network hit.
33 tests across 5 suites, all green. New test
'lowercasedOrgIdRecoveresToCanonicalCasing' covers the exact user
regression.
…t dir finds bundles in nested HF-style trees
User's drive layout:
/Volumes/EricsLLMDrive/dealignai/<flat-bundle>/...
/Volumes/EricsLLMDrive/jangq-ai/<org>/<repo>/...
Pointing the osaurus models picker at /Volumes/EricsLLMDrive/ would
find dealignai's bundles (2-level: <root>/dealignai/<flat>) but MISS
jangq-ai's bundles (3-level: <root>/jangq-ai/<org>/<repo>) because the
old scanner only walked exactly 2 levels.
Fix: replace the two separate flat+nested loops with a single
recursive scanDir(maxDepth:) that handles all three layouts:
1. Flat: <root>/<modelDir>/{config.json,...}
2. Nested: <root>/<org>/<repo>/{config.json,...}
3. Multi-org: <root>/<parentOrg>/<org>/<repo>/{config.json,...}
The id is built from the path components joined by /, so a 3-level
discover produces e.g. 'jangq-ai/JANGQ-AI/Laguna-XS.2-JANGTQ' which
MLXModel.localDirectory will round-trip back to the right on-disk path.
Bounded depth=3 keeps the scan from descending into model bundles'
own subdirectories (e.g. shard caches, tokenizer state). A directory
that IS a model bundle stops descent at that level.
No duplicate / zombie code: the recursive scanDir replaces both the
flat-detection branch and the nested-descent branch — single function,
single source of truth.
Tests:
* scanLocalModels_detectsThreeLevelMultiOrgLayout (new) — verifies
the user's exact drive shape produces the right ids
* scanLocalModels_detectsFlatAndNestedLayouts (existing) — still
green, no regression on 1- and 2-level layouts
Full osaurus regression: 1328 tests across 182 suites all green.
…ough fix) Picks up the LLM factory fix that refuses bundles with vision_config on the mistral3 route, so VLM bundles fall through to VLMModelFactory's mistral3 route which handles the language_model + multi_modal_projector + vision_tower keys correctly via Mistral3VLMJANGTQ.
|
CI note from current PR sweep: latest This matches the fallback DerivedData cache failure class we just fixed on the maintainer-owned #974 branch by wiping restored DerivedData when Actions restores from a broad fallback key rather than an exact source-hash key. A maintainer/admin rerun of the failed job should also take the existing cold-build path ( |
|
Local verification on the current head passed the focused core coverage for this PR:\n\n |
…-vl suffix needed) Real bundle config for OsaurusAI/Holo3-35B-A3B-mxfp4: outer model_type=qwen3_5_moe WITH vision_config + pixtral image preprocessor. The bundle is image+video capable but the model_id-only matcher required '-vl' in the name to flag it. Without that flag the chat composer's drag-drop UI rejected images on Holo3 even though the engine is fully wired for them (post-load directory check would catch it via vision_config but pre-load UI gating was wrong). Add a Holo3 family pattern that matches the bundle name regardless of '-vl' suffix. Existing 'qwen3.5-6.*-vl' / 'holo3.*-vl' regex stays for explicit VL-named bundles; the new pattern covers Holo3 base. Tests updated: Holo3-35B-A3B-mxfp4 moved from textOnly to imageVideo end-to-end matrix; new comment explains the bundle topology so future audits don't re-introduce the regression. 11 capability tests green.
…load errors over wrong-factory sentinels)
…p tq_bits + tied lm_head)
…3VLM model.* sanitize)
…antSwitchGLU + split fused gate_up_proj)
User report: Qwen3.6 27B MXFP4 emits degenerate '!!!!!!!!!' spam in the thinking channel on a simple prompt. Community issue #995 reports the same shape: degraded output across multiple models in recent builds, model previously worked fine. Root cause: cfa1ceb ('Enabled TurboQuant by default') set the CacheCoordinatorConfig.defaultKVMode to .turboQuant(keyBits: 3, valueBits: 3). Per resolveKVPolicy in vmlx CacheCoordinatorConfig, that mode is applied to EVERY slot whose request submits 'kvMode: .none' regardless of prompt length — only the SIZE cap (defaultMaxKVSize) is prompt-length gated, not the mode itself. 3-bit KV quantization is aggressive enough to corrupt attention on small reasoning models (especially when thinking is on, where the 'attention soup' inside <think> needs precision the most). The sequence is: - First few tokens decode okay - Attention error compounds across decode steps because each step reads quantized K/V from prior steps - Output collapses into single-token repetition Fix: default to .none (fp16). Memory-constrained users can opt into .turboQuant explicitly per request via GenerateParameters.kvMode. This is a strict quality-over-memory tradeoff. Smaller models with short prompts get full attention quality back; long-context users who care about memory still have the knob via per-request override. The defaultMaxKVSize: 8192 + longPromptMultiplier: 2.0 size cap stays in place for prompts > 16k tokens (rotating window kicks in for ultra-long context regardless of mode).
…3VLM model.* sanitize) Picks up the vmlx commits pushed to origin/main today: - 1135950 fix(laguna): three parity bugs causing garbage-token saturation (sigmoid → softplus per-head gate, un-biased routing weights for top-k scoring, plain → YaRN-scaled RoPE on full-attention layers) - babbe34 fix(mistral3vlm): strip leading model. on vision_tower keys - 0fab91c fix(mistral3vlm): strip model. prefix on multi_modal_projector keys Also fixes a stale MC/DC test: 0a14145 reclassified Holo3 base bundles (no -vl suffix) as image+video, so the "all dense LLMs → .textOnly" master-FALSE assertion can no longer include Holo3-35B-A3B-JANGTQ.
…ixes)
Picks up vmlx commit 1173822, which fixes two structural bugs causing
the user-reported "first turn fine, later turns garbage" symptom on
every JANGTQ/MXFP4 model that hit the paged cache:
Bug A — TurboQuantKVCache paged-cache restore compounds quantization
The paged-tier restore previously routed through `tq.state =
[keys, values]`, which transitioned the cache back to .fill phase
with an already-decompressed lossy float as the new prefill. The
next threshold cross then re-compressed that lossy float,
compounding quantization error per turn — the exact symptom on
Qwen3.6 MXFP4 / Mistral 3.5 / every JANGTQ model. Now uses the new
`restoreFromDecodedKV` method that seats the decoded float
DIRECTLY as the compressed-phase prefix without an encode/decode
round trip.
Bug B — hybrid models silently ignore CacheCoordinator.defaultMaxKVSize
`Qwen35.newCache`, `NemotronH.newCache`, and `Qwen3Next.newCache`
returned plain `KVCacheSimple()` for every attention slot, never
reading `parameters?.maxKVSize`. The CacheCoordinator's
`defaultMaxKVSize` contract writes the bound into
`parameters.maxKVSize` at admission, but for the hybrid family
that field was write-only. Long-context Qwen3.5/3.6/Cascade-2
prompts therefore could not be capped via the coordinator. Now
routes attention slots through `RotatingKVCache(maxSize:, keep: 4)`
when the bound is set — same pattern Llama/Mistral already use.
… Qwen35JANGTQ maxKVSize + tied-embeddings hardenings)
Picks up two more vmlx commits on top of 1173822, both from the vmlx
team's real-load verification pass against six bundles on the user's
external drive (Laguna-XS.2, NemotronH-Omni-30B, Mistral-Medium-3.5-128B,
Qwen3.6-27B MXFP4, Qwen3.6-35B-A3B JANGTQ4, MiniMax-M2.7-Small JANGTQ):
3f8a5e9 fix(cache,sanitize): hybrid Qwen35JANGTQ maxKVSize + tied-embeddings hardenings
- Qwen35JANGTQ.newCache now honors parameters?.maxKVSize (parity
with the 1173822 fix for Qwen35/Qwen3Next/NemotronH).
- Mistral3VLM, NemotronH, and Mistral3VLMJANGTQ explicitly drop
redundant lm_head.{weight,scales,biases} when text_config.tie_word_embeddings
is true.
0d85e9d fix(batchengine): defensive EOS widening for common end-of-turn tokens
- BatchEngine init probes 7 common end-of-turn special tokens
(<|im_end|>, <|end|>, <|endoftext|>, <|eot_id|>, <|end_of_turn|>,
<|/s|>, </s>) against the tokenizer vocab and adds any present
ones to the EOS set. Closes the Qwen3.6 MXFP4 <|im_end|> leak
where the tokenizer config only listed <|endoftext|> as the
primary eos_token but the chat template terminated turns on
<|im_end|>.
Verified on the user's external drive (per vmlx team report):
✅ Laguna-XS.2-JANGTQ — coherent
✅ NemotronH-Omni 30B — all 11 multi-turn rows pass
✅ Mistral-Medium-3.5-128B JANGTQ — load 13s, valid multilingual
✅ Qwen3.6-27B MXFP4 — 3-turn cache reuse, no <|im_end|> leak
✅ Qwen3.6-35B-A3B JANGTQ4 — coherent across 3 turns
✅ MiniMax-M2.7-Small JANGTQ — memory preserved across turns
Test suite remains green: 1328 tests in 182 suites pass.
Picks up vmlx commits 0756dc0 (close trim-path Metal lifecycle crash on full disk-cache hit), 71065ca (Laguna chat template fallback + bridge sniff), and 0e22eba (the upstream's authoritative integration guide for this pin). Re-enables `enableDiskCache = diskDirUsable` in `buildCacheCoordinatorConfig`. The `notifyExternalReferencesNonZeroOnDealloc` Metal crash that motivated the temporary `enableDiskCache = false` guard is fixed in 0756dc0: the trimmed compiled-cache list is now forced to realize before its underlying Metal buffers go out of scope on the `Cache disk hit … prefilling 0 remaining` path. The `eval_http_stability.py` suite remains the regression check; re-run on any future pin bump that touches the CacheCoordinator restore path. Test suite remains green: 1328/1328 pass.
…ok KV) Per the vmlx-swift-lm 2026-05-01 integration guide (OSAURUS-INTEGRATION-2026-05-01.md §"3-bit KV verdict"), 4-bit codebook KV is now the recommended default. The earlier `.turboQuant(3, 3)` default was reverted to `.none` in commit e202cbb after the `!!!!!!!!!` repetition spam reported in community issue #995. The root cause was not the bit width itself but vmlx's `TurboQuantKVCache` paged-restore path compounding quantization across multi-turn handoff: the prior `tq.state = [keys, values]` path transitioned the cache back to `.fill` phase with already-decoded lossy float as the new prefill, then re-quantized at the next threshold cross — compounding error per turn (the "first turn fine, later turns garbage" symptom). That cross-turn handoff bug was fixed in vmlx commit `1173822` (`restoreFromDecodedKV` keeps the prefix in `.compressed` phase without round-tripping). With the bug fixed, 4-bit KV is real-bundle- verified coherent across multi-turn paths on Qwen3.6 27B MXFP4, Qwen3.6 35B-A3B JANGTQ4, MiniMax M2.7 Small, Laguna XS.2, Mistral-Medium-3.5-128B, and NemotronH-Omni 30B. 3-bit is also safe post-`1173822` but more error-sensitive and gains less compression benefit, so 4-bit stays the default. Per-request `kvMode` still overrides; clients that want fp16 KV can submit `kvMode: .none` explicitly.
Picks up two vmlx commits since 0e22eba:
- 0c36d01 fix(jinja): pin osaurus-ai/swift-jinja fork with for-iterable
parser fix
- 405bdc6 docs(osaurus): expand integration guide with full sweep results
The swift-jinja fork (osaurus-ai/Jinja@58d21aa5) lifts the for-loop
iterable from "factor + |filter only" to "full binary + comparison +
logical hierarchy, excluding ternary" via a one-line `parseFilter()` →
`parseOr()` change in `Sources/Jinja/Parser.swift:186`. This unblocks
chat templates with iterable expressions like
`{% for message in loop_messages + [{'role': '__sentinel__'}] %}` —
present in Mistral 3.5's native chat template (line 72).
Add `osaurus-ai/swift-jinja@58d21aa5` directly to OsaurusCore/Package.swift
so OsaurusCore-level tests + `swift build` resolve to the fork. NOTE:
the App's xcodeproj currently still resolves the upstream
`huggingface/swift-jinja.git` transitively via swift-transformers because
the App's project.pbxproj has no remote SPM package references at all
(every remote pin comes through the local-path packages OsaurusCore /
OsaurusCLI / OsaurusRepository, and SwiftPM's xcodeproj resolver does
not promote the fork's URL when the upstream URL is declared in a
transitive-leaf package). Wiring the fork through the App's xcodeproj
is a separate change; not blocking because Mistral 3.5 itself has a
known model-forward bug (RoPE bundle-metadata mismatch — Python ref uses
plain RoPE base=1e6 but bundle config declares rope_type="yarn") being
fixed by the vmlx team. All non-Mistral-3.5 chat templates render
correctly through the upstream swift-jinja the App currently resolves.
PR #993 — final state for review ·
|
| Area | Change | Where |
|---|---|---|
| Cache: KV mode default | .none (fp16) → .turboQuant(4, 4) |
ModelRuntime.swift:415 |
| Cache: L2 disk re-enabled | force-disabled → diskDirUsable |
ModelRuntime.swift:401 |
| Capability matcher | Holo3 base bundles → .imageVideo |
ModelMediaCapabilities.swift:111 |
| Scanner | flat-layout + 3-level recursive HF nest detection | ModelManager.swift (+154L) |
| Sidecar fetch | auto-fetch jangtq_runtime.safetensors from HF on the exact failure that needs it, with strict gating |
ModelRuntime.swift:1304-1510 |
| Sidecar URL safety | isValidHFRepoId strict regex, https://huggingface.co/ host-pin, candidate fallback list (osaurus-ai → JANGQ-AI → mlx-community) |
ModelRuntime.swift:1388-1532 |
| Multi-turn contracts | aborted assistant turns never reach next prompt; reasoning toggle survives multi-turn | new test files (1896 lines / 81 @Test) |
| Vmlx pin | 01a1194a-era → 0c36d01 |
Package.swift, Package.resolved × 2 |
Verified by other agents on the corresponding side
Cache + runtime (vmlx-swift-lm changes picked up via the pins):
1135950— Laguna 3-parity bugs (sigmoid → softplus, biased→un-biased routing weights, plain→YaRN RoPE) — verified coherent: real-load decode "Okay, so I need to figure out how<think>Okay, let's see..." on Laguna-XS.2 JANGTQ.576916b— Laguna codebook MoE viaTurboQuantSwitchGLU+ split fusedgate_up_proj.babbe34/0fab91c— Mistral3VLM stripmodel.prefix onvision_tower/multi_modal_projectorkeys.1173822— TWO structural cache bugs that drove the "first turn fine, later turns garbage" symptom across every JANGTQ/MXFP4 multi-turn path:- Bug A:
TurboQuantKVCachepaged-restore compounded quantization across turns (re-encoding already-decoded lossy float). Fixed via newrestoreFromDecodedKVthat seats the prefix in.compressedphase without round-tripping. - Bug B: hybrid models (
Qwen35,Qwen3Next,NemotronH) silently ignoredCacheCoordinator.defaultMaxKVSize. Now route attention slots throughRotatingKVCache(maxSize:, keep: 4)when the bound is set.
- Bug A:
3f8a5e9— SamemaxKVSizehonor forQwen35JANGTQ+ tied-embeddings hardenings (Mistral3VLM,NemotronH,Mistral3VLMJANGTQ).0d85e9d— Defensive EOS widening:BatchEngineinit probes 7 common end-of-turn special tokens against the tokenizer vocab and adds present ones to the EOS set. Closes the Qwen3.6 MXFP4<|im_end|>leak.0756dc0— Disk-tier trim-path Metal lifecycle crash (notifyExternalReferencesNonZeroOnDealloc) closed. Disk cache safe to re-enable in osaurus (this PR does).0c36d01—osaurus-ai/swift-jinja@58d21aa5fork pin: lifts for-loop iterable from "factor + |filter only" to "full binary + comparison + logical hierarchy, excluding ternary" viaparseFilter()→parseOr()(Sources/Jinja/Parser.swift:186, 1-line change). Unblocks{% for message in loop_messages + [{'role': '__sentinel__'}] %}in Mistral 3.5's native chat template. 756/756 swift-jinja tests pass + 2 new regression tests (forLoopIterableAcceptsBinaryPlus,mistral3RealNativeTemplateParses).
Reasoning ON/OFF detection (osaurus side, exhaustively wired):
LocalReasoningCapability.detect()readschat_template.jinjaor falls back tojang_config.json > chat.reasoning(DSV4-Flash style). Caches per-modelId.Capability.supportsThinkingflips on<think>/</think>/<|think|>(Gemma-4 Harmony marker).ChatView.swift:395,409readsactiveModelOptions["disableThinking"]?.boolValue == falseasthinkingEnabled.MLXBatchAdapter.swift:274-278translates:additionalContext = ["enable_thinking": !disableThinking](defaults totrue).ThinkTagScrubber(Services/ModelRuntime/ThinkTagScrubber.swift) is a defensive post-filter on.tokensfor thinking-capable models only. Buffers 7-byte tail for split-token tag detection.- Multi-turn contract:
ChatView.swift:1288-1291buildspriorUserMessagesfromt.role == .useronly — assistant<think>content cannot leak into the next prompt by construction. Locked byAbortMidThinkHistoryContractTests.swift(152 lines).
JANGTQ sidecar auto-fetch (osaurus side):
- Runs on
validateJANGTQSidecarIfRequiredthrowing the specific "missing sidecar but stamp says JANGTQ" error code (ModelRuntime/code 2). Any other failure mode propagates immediately (no speculative fetch). - Candidate repo list: original repo → osaurus-ai → JANGQ-AI → mlx-community (canonical-cased, deduped,
isValidHFRepoId-validated). - Each candidate URL is
https://-scheme-pinned + host-pinned tohuggingface.co. No arbitrary URL fetch. - Locked by
EnsureJANGTQSidecarTests.swift(214 lines),JANGTQEdgeCaseTests.swift(496 lines),ValidateJANGTQUnsupportedFamilyTests.swift(206 lines),OsaurusOrgAutoFetchTests,WeightFormatNormalizationTests.
Hybrid SSM warm-pass + cache injection (vmlx-owned, osaurus consumes):
BatchEngine.admitPendingRequestsauto-flipscoordinator.isHybrid = trueon first slot admission for any model whose per-layer cache list contains aMambaCacheorArraysCache.- Osaurus eager-sets
setHybrid(true)for known hybrid families ininstallCacheCoordinator(Qwen3.5/3.6, NemotronH, Cascade-2, Jamba, etc.) perOMNI-OSAURUS-HOOKUP.md §5.1. Harmless on any admission path; closes the one-frame stale-flag window. - Paged-cache reuse content-addressed via
BlockHashMap— togglingenable_thinkingbetween turns produces a different prompt → different hash → no poisoned hit.
Per-bundle media capability + drag-drop / picker accept-set (osaurus side):
ModelMediaCapabilities.from(modelId:)substring/regex matcher: omni (Nemotron-3-Nano-Omni), imageVideo (Qwen2/2.5/3 VL, Qwen3.5/3.6 -vl, Holo3 base + -vl, SmolVLM 2), imageOnly (Paligemma, Idefics3, FastVLM, Pixtral, GLM-OCR, LFM2-VL, Gemma 3 / 4-it, Mistral 3 / 3.5, Mistral 4 -vl), textOnly otherwise.from(directory:modelId:)post-load refines viaconfig.json > vision_config+config_omni.jsonsidecar (Nemotron-3 omni gate).FloatingInputCard.dropAcceptedTypes(line 2115): always.image + .fileURL, conditionally adds.audio + .mp3 + .wav + .mpeg4AudioforsupportsAudio,.movie + .video + .quickTimeMovie + .mpeg4MovieforsupportsVideo.pickerAllowedTypes(line 2139): same gating, picker shows audio/video formats only when the loaded model can actually consume them.attachIfAllowed(line 2179): routes by extension + capability; drops unsupported. Falsy default (textOnly) when no model selected.- Locked by
ModelMediaCapabilitiesMCDCTests.swift(MC/DC coverage including Holo3 base =imageVideoboundary).
Test evidence
swift test(OsaurusCore): 1328/1328 pass in 182 suites at89f8114. Includes 81 new@Testannotations from this PR — zero@disabled/XCTSkipmarkers added.xcodebuild test -workspace osaurus.xcworkspace -scheme OsaurusCoreTestsfrom cold cache:** TEST SUCCEEDED **1323 pass / 0 fail / 5 skipped (the 5 skipped areSandboxIntegrationTests— Apple Containerization VM tests, gated onOSAURUS_RUN_SANDBOX_INTEGRATION_TESTS=1, untouched by this PR, intentionallyDisabledin CI per their suite-level annotation).xcodebuild build -scheme osaurusRelease:** BUILD SUCCEEDED **(latest at HEAD6cfad02a).
CI status
test-core failed with the EventSource / CAsyncHTTPClient / CNIOLLHTTP / CNIOExtrasZlib / CNIOPosix / _NumericsShims module-resolution errors. This is environmental — fallback-key restore of DerivedData from a different Package.resolved baseline mismatching the freshly-built C-shim module-map paths. Confirmed by raw failure logs (C shims compiled and linked cleanly, then Xcode rejected the cached .swiftmodule whose embedded paths didn't exist). PR's source builds green from cold cache locally; the PR doesn't touch .github/workflows/. Workflow has its own escape hatch at ci.yml:123-131: any Re-run failed jobs automatically wipes DerivedData (if: github.run_attempt != '1') and forces a cold build.
Verified-coherent bundles end-to-end (per vmlx-swift-lm team's BENCH harness sweep)
| Bundle | Path | Result |
|---|---|---|
| Laguna-XS.2-JANGTQ | native {% generation %} template via fork |
✅ 3-turn coherent |
| NemotronH-Omni 30B JANGTQ | image + video + audio + reasoning toggle | ✅ all 11 OmniBench rows pass |
| Qwen3.6-27B-MXFP4 | BENCH_BATCH_DISK_RESTORE (138/138 hit) + BENCH_BATCH_CHAT 3-turn |
✅ no Metal crash, prompt 0.290s → 0.064s (4.5× cache hit), <|im_end|> clean stop |
| Qwen3.6-35B-A3B-JANGTQ4 | hybrid SSM + codebook MoE 3-turn | ✅ |
| MiniMax-M2.7-Small-JANGTQ | 3-turn with memory across turns | ✅ |
| Gemma-4-26B-A4B-it-JANG_4M | 3-turn coherent | ✅ |
| Holo3-35B-A3B-mxfp4 | 3-turn | ✅ |
| Mistral-Medium-3.5-128B-mxfp4 | post af89da7 (jang bitWidthsUsed: [] default) |
✅ "Okay, the user just sent a blank message." |
Documented open issues (NOT silently swept)
- Mistral-Medium-3.5-128B-JANGTQ model-forward gibberish — root cause hypothesis documented in vmlx commit
89f8114: per-block codebook calibration mismatch on all-codebook dense decoder at non-power-of-2 hidden_size 12288. Hadamard rotation produces coordinate variance ~1/8192 + ~1/4096 (per block), butcompute_codebook(d=12288, bits=2)calibrates 4 entries for variance ~1/12288. Across 88 dense layers the scale mismatch compounds. Fix requires per-block codebooks in BOTHjang_tools/turboquant/codebook.pyANDLibraries/MLXLMCommon/JANGTQDenseLinear.swift. Recommendation: do NOT ship Mistral 3.5 JANGTQ until fixed. Mistral 3.5 mxfp4 path works fine. - Mistral 3.5 chat template still fails for the App's release binary — App's
osaurus.xcodeproj/project.pbxprojdeclares no remote SPM packages (every remote pin comes transitively through the local-path packages OsaurusCore / OsaurusCLI / OsaurusRepository). SwiftPM's xcodeproj resolver picks the upstreamhuggingface/swift-jinja.gitURL declared byswift-transformersover the fork URL declared in OsaurusCore'sPackage.swift. Fix: addosaurus-ai/swift-jinja@58d21aa5directly toproject.pbxprojas anXCRemoteSwiftPackageReference. Not blocking because Mistral 3.5 has the JANGTQ codebook bug above anyway and the mxfp4 path uses a different chat template that doesn't hit the for-iterable parser issue. Tracked separately. - DSV4 native chat encoder wiring —
Libraries/MLXLMCommon/DeepseekV4ChatEncoder.swiftexists butDSV4Minimal.jinjais what's used in the bridge today. Tool-calling renders with simplified DSML envelope. Not blocking basic chat. Documented in vmlx integration guide.
What other team members need to know
- The merge is safe. No regression to any non-Mistral-3.5 family. The 33 commits roll up cleanly and have been multi-turn-tested by the vmlx-swift-lm side against real bundles in
/Volumes/EricsLLMDrive. - Re-running CI is the documented escape hatch for the cache-pollution
test-corefailure. The wipe-on-re-run logic (ci.yml:123-131) was designed for exactly this class of failure. - For follow-up PRs:
- Wire
osaurus-ai/swift-jinja@58d21aa5directly intoosaurus.xcodeproj/project.pbxprojso the App resolves the fork (separate change). - Wire
DeepseekV4ChatEncoder.swiftthroughJangChatConfig.encoder == "encoding_dsv4"(separate change). - Hold Mistral 3.5 JANGTQ until per-block codebook calibration lands in
jang_tools+JANGTQDenseLinear.swift.
- Wire
🚀 Ready for review.
…laguna-preflight # Conflicts: # osaurus.xcworkspace/xcshareddata/swiftpm/Package.resolved
…l 3.5 mxfp4 RMSNorm fix + JANGTQ Mistral 3.5 root cause docs)
…mode
vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep
documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026-
05-01.md (still local in vmlx; ddea384 is the latest commit pushed to
origin/main and includes the same fixes):
- 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288
root cause complete — closes the non-power-of-2 dim handling
gap in the encode pipeline)
- 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5
down_proj support)
- a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64
- 890e3ed Mistral3VLM patch_conv weight transpose in sanitize
- 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to
full prefill instead of emitting 0 tokens — closes the
BENCH_STABILITY S2 silent-empty-stream bug)
- 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9×
TTFT speedup, byte-identical text)
- 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit)
- 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip
- a8ac486 + 1125e20 Hadamard L2-preservation regression tests at
d ∈ {4096, 8192, 12288, 28672}
Plus a small osaurus-side hardening on the L2 disk-cache modelKey:
L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead
of bare `<modelName>`. This prevents stale entries encoded under one
KV mode from being served against requests using a different mode
(e.g. user flips defaultKVMode mid-session, or a per-request override
diverges from the coordinator default). Without this scoping, a cache
hit returning fp16 KV layout to a TurboQuant decoder (or vice versa)
produces undefined behavior. The L1 paged cache stays per-model
(modelName-scoped) — the kvModeTag only affects disk persistence.
The defaultKVMode stays at .none — see the file-level comment for the
3-bit and 4-bit codebook KV degenerate-repetition trail, and the new
reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6
which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but
does not bench-test it against thinking-mode preambles (the failure
mode that drives `idea idea idea` and `!!!!!!!!!` repetition).
Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only
ContextBudgetPreviewTests failures are environmental — same source ran
green on PR #993's CI test-core).
…699d3a polymorphic MoE)
Picks up the iter-13 fix sweep from vmlx-swift-lm:
- 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 +
codebook for mxtq.
`LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol
type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and
`SwitchGLU` (affine/mxfp4). Factory dispatches on
`weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize
splits the fused `gate_up_proj` for both formats. The
`OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively
(vmlx team verified `"2+2 equals 4."` end-to-end + 12/12
BENCH_STABILITY pass + multi-turn cache reuse).
- bc19fc4 + e33068d docs(osaurus): comprehensive production reference
(14 runtime axes + §15 component invariants the SDK
guarantees and osaurus must respect).
- e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream
parity audit.
Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from
`validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error
that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU`
expert path rejecting mxfp4 affine keys) is fixed at the source by
4699d3a's polymorphic dispatch — the preflight is no longer needed.
Closes vmlx production-reference §13 item #2.
Test changes:
- Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate
they covered is gone; left a one-line breadcrumb pointing at
4699d3a so future readers understand why two test slots
disappeared).
- Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1`
default (was 4 → flipped in fa694e9 to engage compile path per
§15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_
forCompileEngagement` so the test name carries the rationale.
Build clean Release; remaining 11 family-preflight tests pass; both
batch-adapter test failures resolved. The 2 environmental
`ContextBudgetPreviewTests` failures from earlier runs are unchanged
(local-only — same source ran green on PR #993's CI test-core).
… degenerate-repetition looping (#998) * fix(quality): default KV mode .turboQuant(4, 4) → .none (4-bit codebook KV) Reverts the 4-bit codebook KV default committed in db3179f (per the vmlx integration guide §"3-bit KV verdict") after real-bundle testing reproduced the same degenerate-repetition failure mode that 3-bit KV produced before e202cbb: - Gemma-4 31B JANG_4M with thinking=ON emitted `idea idea idea idea idea ...` after a few hundred tokens of reasoning preamble. - Multiple other family bundles drifted into looping after a few multi-turn rounds even though turn 1 was coherent ("first turn fine, later turns garbage" symptom). - Thinking=OFF on the same bundles produced coherent output — confirming the failure scales with reasoning preamble length. Vmlx commit `1173822` closed the cross-turn paged-cache re-encoding bug (state was transitioning back to .fill phase with already-decoded lossy float, then re-quantizing). But the underlying codebook quantization error still compounds across long thinking-mode preambles (longer prefix → more compression rounds → more accumulated error → attention latches onto a high-prob low-info token and loops). The vmlx team's BENCH harness verified 4-bit KV across 6+ bundles but didn't toggle thinking on every family it covered. The integration guide's 4-bit recommendation under-tested the failure mode for thinking-capable models. fp16 KV is the conservative default that matches user expectation of "responses look right out of the box" across every family + every turn count + thinking-on or off. Per-request `kvMode` still overrides; clients that want memory savings can submit `.turboQuant(...)` explicitly. * chore(deps): bump vmlx pin → ddea384 + scope L2 disk-cache key by KV mode vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026- 05-01.md (still local in vmlx; ddea384 is the latest commit pushed to origin/main and includes the same fixes): - 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288 root cause complete — closes the non-power-of-2 dim handling gap in the encode pipeline) - 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5 down_proj support) - a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64 - 890e3ed Mistral3VLM patch_conv weight transpose in sanitize - 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to full prefill instead of emitting 0 tokens — closes the BENCH_STABILITY S2 silent-empty-stream bug) - 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9× TTFT speedup, byte-identical text) - 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit) - 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip - a8ac486 + 1125e20 Hadamard L2-preservation regression tests at d ∈ {4096, 8192, 12288, 28672} Plus a small osaurus-side hardening on the L2 disk-cache modelKey: L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead of bare `<modelName>`. This prevents stale entries encoded under one KV mode from being served against requests using a different mode (e.g. user flips defaultKVMode mid-session, or a per-request override diverges from the coordinator default). Without this scoping, a cache hit returning fp16 KV layout to a TurboQuant decoder (or vice versa) produces undefined behavior. The L1 paged cache stays per-model (modelName-scoped) — the kvModeTag only affects disk persistence. The defaultKVMode stays at .none — see the file-level comment for the 3-bit and 4-bit codebook KV degenerate-repetition trail, and the new reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6 which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but does not bench-test it against thinking-mode preambles (the failure mode that drives `idea idea idea` and `!!!!!!!!!` repetition). Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only ContextBudgetPreviewTests failures are environmental — same source ran green on PR #993's CI test-core). * fix(preflight): reject Laguna mxfp4 bundles with actionable error Real-bundle repro on `OsaurusAI/Laguna-XS.2-mxfp4` (model_type=laguna, jang_config.weight_format=mxfp4, quantization.bits=4): vmlx fails parameter load with cryptic Error: Unhandled keys ["biases", "scales", "weight"] in layers.1.mlp.experts.down_proj in LagunaModel.LagunaLayer.LagunaMoE.TurboQuantSwitchGLU .TurboQuantSwitchLinear leaving users no path to remediation. Root cause traced into `Libraries/MLXLLM/Models/Laguna.swift:425`: @ModuleInfo(key: "experts") var experts: TurboQuantSwitchGLU `LagunaMoE.experts` is hardcoded to the codebook switch. Its inner `TurboQuantSwitchLinear` only knows the JANGTQ keys (`tq_packed`, `tq_norms`, `tq_bits`) — the bundle ships standard mxfp4 affine keys (`weight`, `scales`, `biases`), so the parameter loader rejects them. Vmlx production reference doc §13 item #2 names this as a known issue with owner `vmlx-swift` ("Laguna mxfp4 expert format mismatch — either ship JANGTQ-only or add affine MoE class"). Until vmlx makes `LagunaMoE.experts` polymorphic on `weight_format`, the host preflight catches the failure mode, surfacing a clear remediation: use the Laguna XS.2 JANGTQ bundle (`weight_format = "mxtq"`), which is verified-coherent end-to-end. Detection condition (in `validateJANGTQSidecarIfRequired`): - `jang_config.weight_format == "mxfp4"` (case-normalised) - AND `config.json::model_type == "laguna"` Throws `NSError(domain: "ModelRuntime", code: 5)` so callers can distinguish from the existing forward (code 2) / inverse (code 3) sidecar mismatches and the auto-fetch path (code 4). When vmlx ships the polymorphic LagunaMoE expert path, drop this check + its tests. Test coverage: - `Laguna mxfp4 → throws code 5 with remediation pointing to JANGTQ alternative` — locks the new gate - `Laguna JANGTQ (mxtq) passes preflight — codebook path is supported` — boundary check the gate doesn't false-positive on the working Laguna path * fix(ui): rolling steady-state tok/s instead of single-final-average Two visible artefacts the prior tok/s display produced, both reported by users: 1. "Counter doesn't ramp up — needs a long response to show full speed." Short answers (50-200 tokens) have first-token amortisation + reasoning-parser stamp resolution dominating wall time, so the full-generation average reads ~30% slower than steady-state decode. 2. "Reasoning ON shows different tok/s than reasoning OFF on the same model." Same decode rate, but thinking-on accumulates 5-10× more tokens at steady-state speed, diluting setup costs in the average. Users perceive same model + same hardware as inconsistent. Both are calculation artefacts of the cumulative-average pattern, not underlying decode-rate differences. Replaced with `RollingTokenRate` — sliding-window estimator that skips a 0.4s + 4-token warmup, reports steady-state over a 1.5s window, counts content + reasoning + tool-arg tokens uniformly, and updates the live ChatTurn rate at ~5Hz during streaming. On finalize prefers rolling steady-state; falls back to full-gen average only when warmup never elapsed (response too short to converge). 11 unit tests in `RollingTokenRateTests.swift` lock the contract: warmup gating (time + token), 60-tps steady-state convergence, content vs reasoning invariance, sliding-window expiration, finalRate fallback, short-response edge cases. All green. * fix(perf): default maxBatchSize 4 → 1 to engage vmlx compile path Per vmlx production reference §15 invariant 13: compile only engages when `maxBatchSize == 1` (Stage 1B.3 scope; Stage 1B.4 per-bucket shared buffers — pending). With the prior default of 4, every `maybePromoteToCompiledDecode` gate failed and the decode loop ran uncompiled, missing the documented compile-ON speedups: - Mistral 3.5 BENCH_VL_BATCH_CHAT: 24.8s → 2.7s TTFT (9× speedup) - Other promotion-eligible families (Qwen 3.5/3.6, MiniMax, NemotronH, DSV4 via Compilable* cache classes): ranges from ~1.5× to ~9× depending on family — vmlx §8 promotion table. Osaurus's primary use case is single-user chat through the macOS app where only one slot is active at a time. For multi-user server deployments, the existing `defaults write -int N` override remains — at the cost of compile being permanently disabled for that process (the trade-off until vmlx ships Stage 1B.4). * chore(deps): bump vmlx pin → e33068d + drop Laguna mxfp4 preflight (4699d3a polymorphic MoE) Picks up the iter-13 fix sweep from vmlx-swift-lm: - 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 + codebook for mxtq. `LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and `SwitchGLU` (affine/mxfp4). Factory dispatches on `weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize splits the fused `gate_up_proj` for both formats. The `OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively (vmlx team verified `"2+2 equals 4."` end-to-end + 12/12 BENCH_STABILITY pass + multi-turn cache reuse). - bc19fc4 + e33068d docs(osaurus): comprehensive production reference (14 runtime axes + §15 component invariants the SDK guarantees and osaurus must respect). - e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream parity audit. Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from `validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU` expert path rejecting mxfp4 affine keys) is fixed at the source by 4699d3a's polymorphic dispatch — the preflight is no longer needed. Closes vmlx production-reference §13 item #2. Test changes: - Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate they covered is gone; left a one-line breadcrumb pointing at 4699d3a so future readers understand why two test slots disappeared). - Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1` default (was 4 → flipped in fa694e9 to engage compile path per §15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_ forCompileEngagement` so the test name carries the rationale. Build clean Release; remaining 11 family-preflight tests pass; both batch-adapter test failures resolved. The 2 environmental `ContextBudgetPreviewTests` failures from earlier runs are unchanged (local-only — same source ran green on PR #993's CI test-core). * chore(deps): bump vmlx pin → 2e61c12 (Stage 1B.4 design doc + scaffold) * fix(cache): defaultMaxKVSize 8192 → 65536 to match vmlx production reference Per vmlx production reference §6 example (`cfg.defaultMaxKVSize = 65536`) and the audit-conclusion confirmed by the vmlx team. The prior 8192 silently truncated long-context prompts: - 50K-token PDF Q&A → model only saw the last 8K tokens (84% context loss) past the 16K trigger (8192 × longPromptMultiplier=2.0). - Long thinking-mode reasoning preambles > 16K → cap kicked in mid-reasoning, model lost earlier context. Worst-case wired memory at 65K × 88 layers × 8 KV-heads × 128 head_dim × 2 bytes (fp16) × 2 (K+V) ≈ 2.4 GB per slot on Mistral 3.5 — but TurboQuant compression at the engine's `min_tokens_for_compression` threshold (~2K tokens) drops the steady-state cost ~26× to ~95 MB per slot on `.turboQuant(4,4)`. With osaurus's `.none` default the cold path stays fp16 but the rotating cap only kicks in for prompts past 131K (65536 × 2.0) — small/medium chats unaffected. Wired-memory worry is rounding error on a 16GB+ Mac; the silent-truncation footgun was the worse failure mode. Per-family overlay would be cleaner long-term but premature optimization vs the uniform doc-aligned default.
Summary
Two-layer fail-fast defense for a load-path issue identified post-PR-#967 merge: JANGTQ-quantized Mistral 3 family (incl. Mistral-Medium-3.5-128B-JANGTQ2) and Laguna bundles can't load through vmlx today because their model classes use vanilla MLXNN.Linear instead of a JANGTQ-aware shim.
Without this PR, a user installing those JANGTQ tiers gets either a weight-shape mismatch crash (
.tq_packedshape !=Linear.weightflat), or silently-loaded garbage (codebook bytes treated as raw weights).Layer 1 — engine-side (vmlx-swift-lm
d32e135)Both
LLMModelFactoryandVLMModelFactorymistral3dispatch closures now peekweight_formatBEFORE falling through to vanillaMistral3TextModel/Mistral3VLM, throwing a clear error pointing at MXFP4 as the working alternative.Layer 2 — osaurus host preflight (this PR)
validateJANGTQSidecarIfRequirednow has a third check covering pending-JANGTQ families. Fires only whenjang_config.json+weight_format == "mxtq"+model_type(ortext_config.model_type) ∈{mistral3, ministral3, laguna}. Surfaces a friendly remediation message at the host layer before any vmlx loader runs.When the JANGTQ port lands
Once the shim is ported in vmlx, simply remove the family from
pendingJANGTQFamiliesset. No other host code change required.Pin bump
vmlx-swift-lma196800→d32e135.Test coverage
9 new MC/DC-shaped tests in
Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift:mistral3JANGTQ throws code-4ministral3(text_config) JANGTQ throwslagunaJANGTQ throwsnemotron_h/qwen3_5_moe/minimax_m2JANGTQ all PASS (shims exist)Status
osaurus/mainat01a1194a, 0 commits aheadTest plan