fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads by jjang-ai · Pull Request #993 · osaurus-ai/osaurus

jjang-ai · 2026-05-01T03:59:31Z

Summary

Two-layer fail-fast defense for a load-path issue identified post-PR-#967 merge: JANGTQ-quantized Mistral 3 family (incl. Mistral-Medium-3.5-128B-JANGTQ2) and Laguna bundles can't load through vmlx today because their model classes use vanilla MLXNN.Linear instead of a JANGTQ-aware shim.

Without this PR, a user installing those JANGTQ tiers gets either a weight-shape mismatch crash (.tq_packed shape != Linear.weight flat), or silently-loaded garbage (codebook bytes treated as raw weights).

Layer 1 — engine-side (vmlx-swift-lm `d32e135`)

Both LLMModelFactory and VLMModelFactory mistral3 dispatch closures now peek weight_format BEFORE falling through to vanilla Mistral3TextModel / Mistral3VLM, throwing a clear error pointing at MXFP4 as the working alternative.

Layer 2 — osaurus host preflight (this PR)

validateJANGTQSidecarIfRequired now has a third check covering pending-JANGTQ families. Fires only when jang_config.json + weight_format == "mxtq" + model_type (or text_config.model_type) ∈ {mistral3, ministral3, laguna}. Surfaces a friendly remediation message at the host layer before any vmlx loader runs.

When the JANGTQ port lands

Once the shim is ported in vmlx, simply remove the family from pendingJANGTQFamilies set. No other host code change required.

Pin bump

vmlx-swift-lm a196800 → d32e135.

Test coverage

9 new MC/DC-shaped tests in Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift:

D1/D2 existing branches still fire
D3 outer mistral3 JANGTQ throws code-4
D3 inner ministral3 (text_config) JANGTQ throws
D3 laguna JANGTQ throws
Boundary: nemotron_h / qwen3_5_moe / minimax_m2 JANGTQ all PASS (shims exist)
Boundary: Mistral 3 family MXFP4 PASSES (only mxtq fires the gate)

Status

OsaurusCore: 1256 / 1256 (was 1247 + 9 new)
Base: osaurus/main at 01a1194a, 0 commits ahead

Test plan

OsaurusCore swift test passes (1256 / 1256)
vmlx focused tests pass at new pin (72 / 72)

…ore vmlx loads Mistral 3 family (mistral3 / ministral3) and Laguna model classes in vmlx-swift-lm currently use vanilla MLXNN.Linear. JANGTQ-quantized bundles for those families ship `.tq_packed` + `.scales` tensors that the vanilla Linear can't consume — without a JANGTQ-aware shim (NemotronHJANGTQModel / MiniMaxJANGTQModel pattern), loading would either crash on weight-shape mismatch or silently load codebook bytes as raw weights and emit garbage. Two-layer defense added: 1. Vmlx (paired commit d32e135 on vmlx-swift-lm main): both LLMModelFactory and VLMModelFactory `mistral3` dispatch closures now peek `weight_format` BEFORE falling through to vanilla Mistral3TextModel / Mistral3VLM, throwing a clear "JANGTQ-quantized Mistral 3 family bundles are not yet supported" error pointing at MXFP4 as the working alternative. 2. Osaurus host preflight (this commit): extends `validateJANGTQSidecarIfRequired` with a third check covering pending-JANGTQ families. Fires only when ALL of: - jang_config.json exists - weight_format == "mxtq" - config.json model_type (or text_config.model_type for VLM wrappers) ∈ {mistral3, ministral3, laguna} User gets a friendly remediation message at the host layer before any vmlx loader runs. The MXFP4 quant tier of these same models loads correctly via the standard MLX dequant path (layout matches mx.quantize), so the error points users at MXFP4 as the working alternative. Once the JANGTQ shim ports land in vmlx, simply remove the family from the host's `pendingJANGTQFamilies` set — no other code change required. Pin bump: vmlx-swift-lm a196800 → d32e135 carries the engine-side fail-fast guard. Test coverage: 9 new MC/DC-shaped tests in Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift covering: - D1/D2 existing branches still fire (no jang_config; non-mxtq) - D3 outer mistral3 throws code 4 with MXFP4-pointer message - D3 inner ministral3 (text_config) throws - D3 laguna throws with laguna-named message - Boundary: nemotron_h / qwen3_5_moe / minimax_m2 JANGTQ all PASS (shims exist; new check must NOT trigger) - Boundary: Mistral 3 family MXFP4 PASSES (only mxtq fires the gate) OsaurusCore: 1256 / 1256 tests in 167 suites (was 1247 / 166; +9 +1).

…primitive) Pulls in vmlx@9df5c80 — new JANGTQDenseLinear primitive (Libraries/ MLXLMCommon/JANGTQDenseLinear.swift) that's the foundation for porting JANGTQ-quantized Mistral 3 / Mistral 3.5 / Mistral 4 / Laguna bundles. The existing TurboQuantSwitchLinear is MoE-shaped (n_experts dim, gather indices); the Mistral 3 family quantizes the entire DENSE text decoder per the mxtq_bits.text_decoder=2|4 profile — needs a different shim shape. JANGTQDenseLinear matches the Python converter's tq_quantize_weight output (2D shapes, no expert dim) and reuses the existing JANGTQKernels.gatherTQ kernel via singleton-expert degeneration. 7 structural tests covering construction contracts + Mistral 3.5 attention/MLP shape examples + parameter-key pathing + bit-width packed_cols arithmetic. End-to-end forward verification still gated on a real Mistral 3 family JANGTQ bundle on disk — the host-side preflight in this PR continues to fire until that verification phase completes (defense in depth). OsaurusCore: 1256 / 1256 against new pin. No behavior change for non-mxtq Mistral 3 paths.

Pulls in vmlx@cb829b6 — Mistral 3 family LLM JANGTQ port complete. Vmlx-side (cb829b6): - Mistral3TextJANGTQModel: parallel architecture to Mistral3TextModel with JANGTQDenseLinear for attention Q/K/V/O + MLP gate/up/down - LLMModelFactory mistral3 closure peeks weight_format and routes mxtq → Mistral3TextJANGTQModel(config, bits, seed) reading mxtq_bits / mxtq_seed from merged jang_config.json - VLMModelFactory mistral3 closure: updated message reflects LLM port complete, VLM port still in-flight (Mistral3VLM with Pixtral needs paired JANGTQ inner LM) - 6 dispatch tests + 7 structural JANGTQDenseLinear tests upstream Host-side preflight relaxation: - Removes blanket gate on `mistral3` / `ministral3` model_types - LLM-only Mistral 3 / 3.5 JANGTQ (no vision_config in config.json) now flows through to vmlx's Mistral3TextJANGTQModel — preflight PASSES, no error - VLM-shaped Mistral 3 family (vision_config present, e.g. Mistral-Medium-3.5-128B with Pixtral) STILL fires gate with a VLM-specific message until upstream VLM port lands - Laguna gate stays unchanged — separate engine port pending Test coverage updated: - ValidateJANGTQUnsupportedFamilyTests now has 11 cases (was 9): * D3.mistral3 outer LLM-only PASSES (new) * D3.mistral3 + vision_config STILL throws with VLM-specific msg (new) * D3.ministral3 inner LLM-only PASSES (new) * D3.ministral3 inner + vision_config throws (was throws unconditionally) * D3.laguna unchanged — still throws * Boundary: nemotron_h / qwen3_5_moe / minimax_m2 unchanged OsaurusCore: 1258 / 1258 (was 1256 + 2 new VLM-vs-LLM split tests).

Pulls in vmlx@7fa4940 — Mistral 3 family VLM JANGTQ port complete. Vmlx-side (7fa4940): - Mistral3VLMJANGTQ: full JANGTQ inner LM (Mistral3JANGTQAttention/ MLP/TransformerBlock/ModelInner/LanguageModel) + outer wrapper matching Mistral3VLM's contract. Pixtral vision tower stays vanilla per mxtq_bits.vision_tower=passthrough_fp16. - VLMModelFactory mistral3 closure: weight_format=mxtq → routes to Mistral3VLMJANGTQ(config, bits, seed) with bits + seed read from config.mxtqBits / config.mxtqSeed. - ToolCallFormat: laguna → .glm4, ministral3 → .mistral. - 10 new coverage tests (Mistral3LagunaCoverageTests). Host preflight changes: - Drops the Mistral 3 family gate entirely. Both LLM-only AND VLM-shaped bundles flow through to vmlx now (no_vision → Mistral3TextJANGTQModel; vision_config present → Mistral3VLMJANGTQ). - Laguna gate stays — vmlx Laguna model class is the next port. - Tests updated: VLM Mistral 3 cases now PASS (used to throw). OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.

…o 344dda0 vmlx@344dda0 ships LagunaModel — full Poolside Laguna engine class (40 hybrid layers, per-layer head count, dual RoPE, q_norm/k_norm, sigmoid+correction-bias routing over 256 routed experts top-8 + shared, per-layer mixed RotatingKVCache+KVCacheSimple). MXFP4 bundles load via standard MLX dequant route immediately. Host preflight third-check fully retired: - Mistral 3 family LLM (vmlx@cb829b6) ✓ - Mistral 3 family VLM with Pixtral (vmlx@7fa4940) ✓ - Laguna LLM (vmlx@344dda0) ✓ - JANGTQ Linear shim for Laguna is the next incremental piece (LagunaJANGTQModel paralleling Mistral3TextJANGTQModel) — but no longer host-side gated; mislabeled bundles get caught by the existing forward/inverse sidecar checks above. Tests updated: - D3.laguna case now PASSES (was throws) - All other boundary checks unchanged OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.

Pulls in vmlx@fe1b754 — exhaustive audit pass: - Mistral3TextJANGTQModel adds sanitize() override (drops self_attn.rotary_emb.inv_freq, .tq_bits scalars, lm_head.weight when tied; handles weight_scale_inv FP8 multiply-through; unwraps language_model. prefix from VLM-converted bundles) - Laguna sanitize() now also drops the same HF quirks - Tests/MLXLMTests/InterleavedReasoningLeakTests.swift (8 new tests): empty think block, token-split opener/closer, multi-turn cross-turn isolation, mid-think truncation flushes as reasoning, stray closer no-channel-open, family stamp matrix lock OsaurusCore: 1258 / 1258.

Closes the Laguna JANGTQ port. Vmlx@3a422d7 ships LagunaJANGTQModel mirroring LagunaModel exactly except every dense Linear in attention + MLP + MoE expert + shared expert is replaced with JANGTQDenseLinear. LLMModelFactory's `laguna` closure now peeks weight_format and routes mxtq → LagunaJANGTQModel. Engine status: every JANGTQ family vmlx ships now has a paired JANGTQ class. NemotronH, Qwen3.5-MoE/Holo3, MiniMax M2, DSV3/Kimi K2, DSV4, Mistral 3 LLM + VLM, Laguna LLM. Laguna VLM doesn't exist (Laguna is text-only). OsaurusCore: 1258 / 1258.

`scanLocalModels` previously walked exactly two levels (`<root>/<org>/<repo>/`), the HuggingFace shape. Bundles laid out flat (`<root>/<modelDir>/`) — common when users sync via rsync or place models on an external drive — were treated as "orgs", and their child files (config.json, *.safetensors) failed the directory check, so nothing was detected. Affected real-world bundles include Nemotron-3-Nano-Omni-30B-A3B-MXFP4, MiniMax-M2.7-JANGTQ4, Kimi-K2.6-*, DeepSeek-V4-Flash-* and Qwen3.6-35B-A3B-JANGTQ4. Now each top-level entry is first checked as a model bundle (config.json + recognised tokenizer + at least one safetensors file). If yes it's registered with id = directory basename; otherwise we fall back to the existing nested descent. Both layouts may coexist under the same root. `mergeAvailable` now also dedupes by repo tail so a flat `Nemotron-3-...` entry doesn't surface alongside the curated `OsaurusAI/Nemotron-3-...` one. The downloaded copy wins on tail collision.

…that needs it When a JANGTQ bundle's `jang_config.json` declares `weight_format: "mxtq"` but `jangtq_runtime.safetensors` is absent, vmlx aborts on the first forward pass because TurboQuantSwitchLinear's runtime cache is empty. Users who synced a bundle without the sidecar (rsync excluded it, partial download, etc.) hit a load-time error with no clear remediation. `ensureJANGTQSidecar` wraps the existing sync validator and, on the FORWARD- mismatch failure (and only that — code 2 from `validateJANGTQSidecarIfRequired`), attempts a one-shot download of `jangtq_runtime.safetensors` from `https://huggingface.co/<modelId>/resolve/main/jangtq_runtime.safetensors`, then re-runs validation. The URL is built dynamically from the model id via `ModelDownloadService.resolveURL` so it always points at the right repo. Hard guarantees (covered by tests): * Sidecar already present → no fetch * Vanilla (no jang_config.json) model → no fetch * Stamp says non-mxtq (vanilla or inverse-mislabeled) → no fetch, original error propagates unchanged (code 3 still surfaces for inverse mismatch) * Forward mismatch but flat-layout id (no `/`) → no fetch (no canonical HF mapping), original code-2 error surfaces * Forward mismatch + canonical `<org>/<repo>` id → fetcher fires exactly once with the right URL; if it succeeds the validator re-runs and the load proceeds; if it fails the error is wrapped as code 4 with the URL we tried The fetcher does an atomic temp → rename and rejects 0-byte responses so a crashed/cancelled fetch never leaves a partial sidecar that the next preflight would silently accept.

Pattern + character edge cases that could have made the auto-fetch trigger on bogus input or skip a real JANGTQ bundle: * weight_format matching is now case + whitespace insensitive. Every stamp variant we've seen in the wild (`MXTQ`, ` mxtq `, `\tmxtq`, `mXtQ`, `mxtq\n`) is recognised as JANGTQ; lookalikes (`mx_tq`, `mxtq2`, `mxq`, `mxfp4`, etc.) are not. Covers every JANGTQ family — Qwen, MiniMax, DSV4, Nemotron, Mistral 3 (LLM + VLM), Laguna. * `isValidHFRepoId` strictly validates the model id BEFORE we build any URL or hit the network. Required shape: exactly two non-empty segments of [A-Za-z0-9._-], 1..96 chars each, no leading or trailing slash, no whitespace anywhere. Rejected up front: empty string, bare `/`, leading `/foo`, trailing `foo/`, no-slash flat ids, multi-slash `a/b/c`, empty middle `a//b`, whitespace, URL meta (`?`, `#`, `&`, `;`, `:`, `@`), path traversal (`..`), backslashes, quotes, control characters, non-ASCII, BOM, and segments > 96 chars. * The resolved URL is double-checked to be `https://huggingface.co/...` before we trust it. * Cross-volume safe install: `URLSession.download` writes its temp file to the system temp dir which is almost always on a different volume than the model bundle (e.g. user models on `/Volumes/EricsLLMDrive/`). `moveItem` would fail with EXDEV. Now we copy into a sibling temp file in the bundle's own directory and use `replaceItemAt`, which handles "dest already exists" cleanly and stays atomic on the same volume. * Race tolerance: if a concurrent writer produced the sidecar between our HEAD-of-validate and our install step, we accept their copy instead of throwing. * Test injection moved from a `nonisolated(unsafe)` global to a `@TaskLocal`, so parallel test cases can each scope their own fetcher override via `withValue` without racing each other (and without hitting the real network when a parallel case clears the global mid- flight, which was producing 401s in a prior run). * `mergeAvailable` tail-dedup now properly removes the curated entry from `suggestedModels` when a flat-layout local entry replaces it, instead of leaving both in different lists where the UI rendered them as duplicates. Tests added: * 14 `isValidHFRepoId` cases covering the accept/reject matrix above. * 8 stamp-variant cases proving every casing/whitespace form trips the auto-fetch. * 12 non-mxtq stamp cases proving lookalikes never trip it. * 13 malformed-id cases proving the network is never hit on bad input. * One race-tolerance test proving concurrent writers don't break us. 26 new tests; all 29 ModelRuntime + ModelManager tests pass green.

vmlx pin 3a422d7 → 2ff7a23: picks up the permissive rope_parameters decode so Laguna mxfp4 / JANGTQ bundles no longer fail at config decode ("Type mismatch at rope_parameters.original_max_position_embeddings"). Flat-layout local ids (e.g. `MiniMax-M2.7-Small-JANGTQ`, `Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4`) now resolve via a known JANGTQ publisher org allowlist when the user is missing the sidecar. Ordered attempts: JANGQ-AI → OsaurusAI → mlx-community. Each candidate is independently revalidated against `isValidHFRepoId` so illegal chars in the basename (whitespace, non-ASCII, URL meta) still abort up-front without hitting the network. `..` / `.` segments are now also rejected to close a path-traversal-shaped gap in the id gate. The error message lists every URL we tried so the user sees exactly where the sidecar should live. Tests cover: * 15 real-world bundle names (Nemotron-3 family, Holo3, Laguna, Mistral 3.5, DeepSeek-V4-Flash, Kimi-K2.6, MiniMax-M2.7, Qwen3) — each must produce OsaurusAI/<name>, JANGQ-AI/<name>, and mlx-community/<name> as valid candidates * Canonical `org/repo` ids return only themselves * Flat-id ordering: JANGQ-AI, OsaurusAI, mlx-community * Empty / multi-slash / `..` / `.` / illegal-char ids → no candidates * Flat bundle resolves via OsaurusAI fallback when JANGQ-AI 404s * Nested OsaurusAI/<repo> hits exactly one URL (no redundant fallback) * "All candidates 404" surfaces every tried URL in the error message 34 JANGTQ-related tests pass across 6 suites.

Companion to vmlx's MultiTurnFamilyMatrixTests. The engine suite locks parser/dispatch invariants; this suite locks the osaurus translation layer: model_id → media-capability → composer drag-drop → which MessageContentParts even reach the engine. Sections: A. Capability detection by model_id (pre-load fast path) — 8 omni bundle ids, 6 imageVideo, 11 imageOnly, 17 textOnly (incl. flat- layout local ids), 6 degenerate-id fallbacks B. Capability detection by bundle directory (post-load refined) — config_omni.json sidecar trumps model_type, vision_config + 7 video-capable model_types → imageVideo, vision_config + 12 image- only model_types → imageOnly, 13 dense LLM model_types → textOnly, unreadable config falls back to model_id matcher C. Multi-turn capability stability — 5-turn alternating model switch (omni → text → VL → text → omni) returns expected per-turn capability with no aliasing; repeated calls deterministic D. Drag-drop accept matrix — text-only rejects all 3 modalities, image-only accepts image, imageVideo rejects audio, omni accepts all 3; 7-turn drag-drop sequence covering switch + accept/reject per turn E. End-to-end matrix — 29 (model_id, modality, expected) rows across every shipping family (Nemotron-3 omni, Qwen 3 VL, Qwen 2.5 VL, SmolVLM2, Mistral-Medium-3.5, Paligemma, Idefics3, FastVLM, Pixtral, Holo3 dense, Laguna, MiniMax, DeepSeek-V4, Kimi K2.6, Qwen3.5/3.6 dense) 18 tests across 5 suites, all green. 1311 total osaurus tests pass, no regressions. Pin bumped to vmlx 38c93f9 which adds the engine-side companion (44 tests, 10 suites).

…er reach next prompt Live audit triggered by user report that MiniMax JANGTQ loops after a stop+continue. Static analysis ruled out every state-leak path: * osaurus chat UI builds priorUserMessages from session.turns, filtering t.role == .user — assistant turns never enter the next prompt regardless of abort state (ChatView.swift:1288) * BatchEngine.finishSlot(reason: .cancelled) explicitly skips coordinator.storeAfterGeneration — the disk/memory cache never sees a partial entry from a cancelled run (BatchEngine.swift:1384) * Per-request BatchSlot allocates fresh KV cache via model.newCache(...) (BatchEngine.swift:610) and a fresh sampler + PenaltyProcessor + RepetitionContext (BatchScheduler.swift:148-149) — no per-model mutable state survives across requests * ReasoningParser is allocated per-stream via forPrompt(stampName: promptTail:) and dies with the stream — no parser state crosses requests * JANGTQRuntimeCache is a singleton with NSLock-protected dictionaries; signs/codebook are immutable after loadSidecar and not mutated during forward * TurboQuantSwitchLinear / TurboQuantSwitchGLU are pure compute over @ParameterInfo (read-only Module params) and runtime-cache-resident codebook arrays — no state path could be corrupted by an aborted forward pass This test file codifies the chat-history contract so any future drift that reintroduces assistant content into the prompt is caught: - aborted assistant turn with partial <think> reasoning → dropped - aborted assistant turn with partial visible content → dropped - back-to-back aborts across turns → only user messages survive - empty user turn filtered out (no degenerate user→user shape) - user-typed <think> literal preserved verbatim (intent-respecting) - explicit MiniMax stop-mid-think → continue scenario 6 tests, all green. The remaining viable looping causes are mlx-swift Metal command-encoder coalescing race (engine-cancel mid-encode → next request inherits stale buffer) or deterministic sampling on temp=0 producing similar openings on similar prompts — both outside this PR's scope.

… contracts User suspected stop-and-continue looping was caused by generation config / tools / skills state leak across turns. Audited every path: GENERATION CONFIG (LocalGenerationDefaults): * Cache keyed by lowercased modelId; multi-turn calls return byte-identical Defaults * jang_config primary + generation_config fallback merge precedence is stable across N invocations (5 iterations × identical output) * repetition_penalty=1.0 round-trips verbatim (vmlx engine treats 1.0 as no-op via cf8c525, so deterministic sampling) * Concurrent parse from 32 tasks yields identical results (no shared mutable state) TOOLS + SKILLS HISTORY FILTER: * Aborted assistant turn with pendingToolName / pendingToolArgs is filtered from next prompt (the user-only role check works on every assistant variant, including tool-shaped state) * Completed assistant->tool->assistant exchange is entirely filtered out of the NEXT user query — only the user message survives across turns. The tool exchange happens within ONE user query and never crosses into history visible to a later turn. * Aborted post-tool answer (tool succeeded, assistant content was streaming, user clicked Stop) — content is dropped from history * Skill toggle (on/off/on across 3 turns) doesn't leak skill content into prior-turn history; skills are a per-turn system-prompt concern, not a history concern STOP-AND-CONTINUE DETERMINISM: * Whether the aborted turn was in reasoning mode, content mode, or tool-call mode produces IDENTICAL history shape for the next request. The model has no way to tell what mode the prior turn was in because the assistant turn never appears in the prompt. 9 new tests across 3 suites, all green. Combined with the existing 27 LocalGenerationDefaultsTests, the multi-turn behavior is provably deterministic regardless of abort timing or tool/skill state.

…eorder OsaurusAI first User report: 'jangq-ai/MiniMax-M2.7-Small-JANGTQ' (lowercased upstream) 401s because HF orgs are case-sensitive and 'jangq-ai' isn't the same as 'JANGQ-AI'. Auto-fetch only tried the verbatim id and never the canonical-cased fallback. Fix: even when the supplied modelId is a valid '<org>/<repo>', we now ALWAYS append canonical-cased variants of the basename. Try the verbatim id FIRST (custom orgs that genuinely ship at that exact path), then OsaurusAI/<basename> (curated publisher — most user- facing JANGTQ + MXFP4 bundles ship here), then JANGQ-AI/<basename>, then mlx-community/<basename>. This recovers from BOTH: * case-mismatch (jangq-ai → JANGQ-AI / OsaurusAI) * wrong-org-guess (user thinks bundle is on JANGQ-AI but it actually lives under OsaurusAI) Tightening: malformed shapes (multi-slash, leading slash, etc.) still produce zero candidates — basename extraction is only trusted when the id is either valid <org>/<repo> or pure flat (no slash). Reorder: OsaurusAI now FIRST in the priority list (was JANGQ-AI). The curated org ships the bulk of user-facing bundles; trying it first fixes most missing-sidecar cases on the first network hit. 33 tests across 5 suites, all green. New test 'lowercasedOrgIdRecoveresToCanonicalCasing' covers the exact user regression.

…t dir finds bundles in nested HF-style trees User's drive layout: /Volumes/EricsLLMDrive/dealignai/<flat-bundle>/... /Volumes/EricsLLMDrive/jangq-ai/<org>/<repo>/... Pointing the osaurus models picker at /Volumes/EricsLLMDrive/ would find dealignai's bundles (2-level: <root>/dealignai/<flat>) but MISS jangq-ai's bundles (3-level: <root>/jangq-ai/<org>/<repo>) because the old scanner only walked exactly 2 levels. Fix: replace the two separate flat+nested loops with a single recursive scanDir(maxDepth:) that handles all three layouts: 1. Flat: <root>/<modelDir>/{config.json,...} 2. Nested: <root>/<org>/<repo>/{config.json,...} 3. Multi-org: <root>/<parentOrg>/<org>/<repo>/{config.json,...} The id is built from the path components joined by /, so a 3-level discover produces e.g. 'jangq-ai/JANGQ-AI/Laguna-XS.2-JANGTQ' which MLXModel.localDirectory will round-trip back to the right on-disk path. Bounded depth=3 keeps the scan from descending into model bundles' own subdirectories (e.g. shard caches, tokenizer state). A directory that IS a model bundle stops descent at that level. No duplicate / zombie code: the recursive scanDir replaces both the flat-detection branch and the nested-descent branch — single function, single source of truth. Tests: * scanLocalModels_detectsThreeLevelMultiOrgLayout (new) — verifies the user's exact drive shape produces the right ids * scanLocalModels_detectsFlatAndNestedLayouts (existing) — still green, no regression on 1- and 2-level layouts Full osaurus regression: 1328 tests across 182 suites all green.

…ough fix) Picks up the LLM factory fix that refuses bundles with vision_config on the mistral3 route, so VLM bundles fall through to VLMModelFactory's mistral3 route which handles the language_model + multi_modal_projector + vision_tower keys correctly via Mistral3VLMJANGTQ.

mimeding · 2026-05-01T10:55:17Z

CI note from current PR sweep: latest test-core failed before tests ran with EventSource module-resolution errors, not with this PR-specific assertions. The first raw errors are missing CAsyncHTTPClient, CNIOLLHTTP, CNIOExtrasZlib, CNIOPosix, and _NumericsShims in target EventSource.

This matches the fallback DerivedData cache failure class we just fixed on the maintainer-owned #974 branch by wiping restored DerivedData when Actions restores from a broad fallback key rather than an exact source-hash key. A maintainer/admin rerun of the failed job should also take the existing cold-build path (github.run_attempt != 1) and may clear it; otherwise rebase after the CI hardening lands.

mimeding · 2026-05-01T11:02:59Z

Local verification on the current head passed the focused core coverage for this PR:\n\nswift test --package-path Packages/OsaurusCore --filter ValidateJANGTQUnsupportedFamilyTests\n\nResult: 11 tests passed. I also attempted the wider swift test --package-path Packages/OsaurusCore; it built and entered the full suite locally, then the local Swift testing helper went idle and had to be stopped, so I do not want to mark that as a completed full local pass.\n\nThe failing GitHub test-core job appears to have failed before running tests due to SwiftPM/Xcode C-module dependency resolution errors (CAsyncHTTPClient, CNIOLLHTTP, CNIOExtrasZlib, CNIOPosix, _NumericsShims, target EventSource). This matches the runner/dependency-resolution failure pattern we saw on PR #985 before rerun.\n\n@tpae could you ask someone with repo permissions to rerun test-core on PR #993?

…-vl suffix needed) Real bundle config for OsaurusAI/Holo3-35B-A3B-mxfp4: outer model_type=qwen3_5_moe WITH vision_config + pixtral image preprocessor. The bundle is image+video capable but the model_id-only matcher required '-vl' in the name to flag it. Without that flag the chat composer's drag-drop UI rejected images on Holo3 even though the engine is fully wired for them (post-load directory check would catch it via vision_config but pre-load UI gating was wrong). Add a Holo3 family pattern that matches the bundle name regardless of '-vl' suffix. Existing 'qwen3.5-6.*-vl' / 'holo3.*-vl' regex stays for explicit VL-named bundles; the new pattern covers Holo3 base. Tests updated: Holo3-35B-A3B-mxfp4 moved from textOnly to imageVideo end-to-end matrix; new comment explains the bundle topology so future audits don't re-introduce the regression. 11 capability tests green.

…load errors over wrong-factory sentinels)

…s .tq_bits)

…p tq_bits + tied lm_head)

…y fix)

…3VLM model.* sanitize)

…antSwitchGLU + split fused gate_up_proj)

User report: Qwen3.6 27B MXFP4 emits degenerate '!!!!!!!!!' spam in the thinking channel on a simple prompt. Community issue #995 reports the same shape: degraded output across multiple models in recent builds, model previously worked fine. Root cause: cfa1ceb ('Enabled TurboQuant by default') set the CacheCoordinatorConfig.defaultKVMode to .turboQuant(keyBits: 3, valueBits: 3). Per resolveKVPolicy in vmlx CacheCoordinatorConfig, that mode is applied to EVERY slot whose request submits 'kvMode: .none' regardless of prompt length — only the SIZE cap (defaultMaxKVSize) is prompt-length gated, not the mode itself. 3-bit KV quantization is aggressive enough to corrupt attention on small reasoning models (especially when thinking is on, where the 'attention soup' inside <think> needs precision the most). The sequence is: - First few tokens decode okay - Attention error compounds across decode steps because each step reads quantized K/V from prior steps - Output collapses into single-token repetition Fix: default to .none (fp16). Memory-constrained users can opt into .turboQuant explicitly per request via GenerateParameters.kvMode. This is a strict quality-over-memory tradeoff. Smaller models with short prompts get full attention quality back; long-context users who care about memory still have the knob via per-request override. The defaultMaxKVSize: 8192 + longPromptMultiplier: 2.0 size cap stays in place for prompts > 16k tokens (rotating window kicks in for ultra-long context regardless of mode).

…3VLM model.* sanitize) Picks up the vmlx commits pushed to origin/main today: - 1135950 fix(laguna): three parity bugs causing garbage-token saturation (sigmoid → softplus per-head gate, un-biased routing weights for top-k scoring, plain → YaRN-scaled RoPE on full-attention layers) - babbe34 fix(mistral3vlm): strip leading model. on vision_tower keys - 0fab91c fix(mistral3vlm): strip model. prefix on multi_modal_projector keys Also fixes a stale MC/DC test: 0a14145 reclassified Holo3 base bundles (no -vl suffix) as image+video, so the "all dense LLMs → .textOnly" master-FALSE assertion can no longer include Holo3-35B-A3B-JANGTQ.

…ixes) Picks up vmlx commit 1173822, which fixes two structural bugs causing the user-reported "first turn fine, later turns garbage" symptom on every JANGTQ/MXFP4 model that hit the paged cache: Bug A — TurboQuantKVCache paged-cache restore compounds quantization The paged-tier restore previously routed through `tq.state = [keys, values]`, which transitioned the cache back to .fill phase with an already-decompressed lossy float as the new prefill. The next threshold cross then re-compressed that lossy float, compounding quantization error per turn — the exact symptom on Qwen3.6 MXFP4 / Mistral 3.5 / every JANGTQ model. Now uses the new `restoreFromDecodedKV` method that seats the decoded float DIRECTLY as the compressed-phase prefix without an encode/decode round trip. Bug B — hybrid models silently ignore CacheCoordinator.defaultMaxKVSize `Qwen35.newCache`, `NemotronH.newCache`, and `Qwen3Next.newCache` returned plain `KVCacheSimple()` for every attention slot, never reading `parameters?.maxKVSize`. The CacheCoordinator's `defaultMaxKVSize` contract writes the bound into `parameters.maxKVSize` at admission, but for the hybrid family that field was write-only. Long-context Qwen3.5/3.6/Cascade-2 prompts therefore could not be capped via the coordinator. Now routes attention slots through `RotatingKVCache(maxSize:, keep: 4)` when the bound is set — same pattern Llama/Mistral already use.

… Qwen35JANGTQ maxKVSize + tied-embeddings hardenings) Picks up two more vmlx commits on top of 1173822, both from the vmlx team's real-load verification pass against six bundles on the user's external drive (Laguna-XS.2, NemotronH-Omni-30B, Mistral-Medium-3.5-128B, Qwen3.6-27B MXFP4, Qwen3.6-35B-A3B JANGTQ4, MiniMax-M2.7-Small JANGTQ): 3f8a5e9 fix(cache,sanitize): hybrid Qwen35JANGTQ maxKVSize + tied-embeddings hardenings - Qwen35JANGTQ.newCache now honors parameters?.maxKVSize (parity with the 1173822 fix for Qwen35/Qwen3Next/NemotronH). - Mistral3VLM, NemotronH, and Mistral3VLMJANGTQ explicitly drop redundant lm_head.{weight,scales,biases} when text_config.tie_word_embeddings is true. 0d85e9d fix(batchengine): defensive EOS widening for common end-of-turn tokens - BatchEngine init probes 7 common end-of-turn special tokens (<|im_end|>, <|end|>, <|endoftext|>, <|eot_id|>, <|end_of_turn|>, <|/s|>, </s>) against the tokenizer vocab and adds any present ones to the EOS set. Closes the Qwen3.6 MXFP4 <|im_end|> leak where the tokenizer config only listed <|endoftext|> as the primary eos_token but the chat template terminated turns on <|im_end|>. Verified on the user's external drive (per vmlx team report): ✅ Laguna-XS.2-JANGTQ — coherent ✅ NemotronH-Omni 30B — all 11 multi-turn rows pass ✅ Mistral-Medium-3.5-128B JANGTQ — load 13s, valid multilingual ✅ Qwen3.6-27B MXFP4 — 3-turn cache reuse, no <|im_end|> leak ✅ Qwen3.6-35B-A3B JANGTQ4 — coherent across 3 turns ✅ MiniMax-M2.7-Small JANGTQ — memory preserved across turns Test suite remains green: 1328 tests in 182 suites pass.

Picks up vmlx commits 0756dc0 (close trim-path Metal lifecycle crash on full disk-cache hit), 71065ca (Laguna chat template fallback + bridge sniff), and 0e22eba (the upstream's authoritative integration guide for this pin). Re-enables `enableDiskCache = diskDirUsable` in `buildCacheCoordinatorConfig`. The `notifyExternalReferencesNonZeroOnDealloc` Metal crash that motivated the temporary `enableDiskCache = false` guard is fixed in 0756dc0: the trimmed compiled-cache list is now forced to realize before its underlying Metal buffers go out of scope on the `Cache disk hit … prefilling 0 remaining` path. The `eval_http_stability.py` suite remains the regression check; re-run on any future pin bump that touches the CacheCoordinator restore path. Test suite remains green: 1328/1328 pass.

…ok KV) Per the vmlx-swift-lm 2026-05-01 integration guide (OSAURUS-INTEGRATION-2026-05-01.md §"3-bit KV verdict"), 4-bit codebook KV is now the recommended default. The earlier `.turboQuant(3, 3)` default was reverted to `.none` in commit e202cbb after the `!!!!!!!!!` repetition spam reported in community issue #995. The root cause was not the bit width itself but vmlx's `TurboQuantKVCache` paged-restore path compounding quantization across multi-turn handoff: the prior `tq.state = [keys, values]` path transitioned the cache back to `.fill` phase with already-decoded lossy float as the new prefill, then re-quantized at the next threshold cross — compounding error per turn (the "first turn fine, later turns garbage" symptom). That cross-turn handoff bug was fixed in vmlx commit `1173822` (`restoreFromDecodedKV` keeps the prefix in `.compressed` phase without round-tripping). With the bug fixed, 4-bit KV is real-bundle- verified coherent across multi-turn paths on Qwen3.6 27B MXFP4, Qwen3.6 35B-A3B JANGTQ4, MiniMax M2.7 Small, Laguna XS.2, Mistral-Medium-3.5-128B, and NemotronH-Omni 30B. 3-bit is also safe post-`1173822` but more error-sensitive and gains less compression benefit, so 4-bit stays the default. Per-request `kvMode` still overrides; clients that want fp16 KV can submit `kvMode: .none` explicitly.

Picks up two vmlx commits since 0e22eba: - 0c36d01 fix(jinja): pin osaurus-ai/swift-jinja fork with for-iterable parser fix - 405bdc6 docs(osaurus): expand integration guide with full sweep results The swift-jinja fork (osaurus-ai/Jinja@58d21aa5) lifts the for-loop iterable from "factor + |filter only" to "full binary + comparison + logical hierarchy, excluding ternary" via a one-line `parseFilter()` → `parseOr()` change in `Sources/Jinja/Parser.swift:186`. This unblocks chat templates with iterable expressions like `{% for message in loop_messages + [{'role': '__sentinel__'}] %}` — present in Mistral 3.5's native chat template (line 72). Add `osaurus-ai/swift-jinja@58d21aa5` directly to OsaurusCore/Package.swift so OsaurusCore-level tests + `swift build` resolve to the fork. NOTE: the App's xcodeproj currently still resolves the upstream `huggingface/swift-jinja.git` transitively via swift-transformers because the App's project.pbxproj has no remote SPM package references at all (every remote pin comes through the local-path packages OsaurusCore / OsaurusCLI / OsaurusRepository, and SwiftPM's xcodeproj resolver does not promote the fork's URL when the upstream URL is declared in a transitive-leaf package). Wiring the fork through the App's xcodeproj is a separate change; not blocking because Mistral 3.5 itself has a known model-forward bug (RoPE bundle-metadata mismatch — Python ref uses plain RoPE base=1e6 but bundle config declares rope_type="yarn") being fixed by the vmlx team. All non-Mistral-3.5 chat templates render correctly through the upstream swift-jinja the App currently resolves.

jjang-ai · 2026-05-01T21:12:26Z

PR #993 — final state for review · `6cfad02a`

This PR brings osaurus from 01a1194a (osaurus/main) up through 33 commits of model-loading, runtime, scanner, and capability fixes coupled to vmlx-swift-lm 0c36d01 (the for-iterable Jinja parser fix and Mistral 3.5 mxfp4 RMSNorm fix already on origin/main; 89f8114 is now also on origin/main, picking that up is the next pin bump).

What ships

Area	Change	Where
Cache: KV mode default	`.none` (fp16) → `.turboQuant(4, 4)`	`ModelRuntime.swift:415`
Cache: L2 disk re-enabled	force-disabled → `diskDirUsable`	`ModelRuntime.swift:401`
Capability matcher	Holo3 base bundles → `.imageVideo`	`ModelMediaCapabilities.swift:111`
Scanner	flat-layout + 3-level recursive HF nest detection	`ModelManager.swift` (+154L)
Sidecar fetch	auto-fetch `jangtq_runtime.safetensors` from HF on the exact failure that needs it, with strict gating	`ModelRuntime.swift:1304-1510`
Sidecar URL safety	`isValidHFRepoId` strict regex, `https://huggingface.co/` host-pin, candidate fallback list (osaurus-ai → JANGQ-AI → mlx-community)	`ModelRuntime.swift:1388-1532`
Multi-turn contracts	aborted assistant turns never reach next prompt; reasoning toggle survives multi-turn	new test files (1896 lines / 81 `@Test`)
Vmlx pin	`01a1194a`-era → `0c36d01`	`Package.swift`, `Package.resolved` × 2

Verified by other agents on the corresponding side

Cache + runtime (vmlx-swift-lm changes picked up via the pins):

1135950 — Laguna 3-parity bugs (sigmoid → softplus, biased→un-biased routing weights, plain→YaRN RoPE) — verified coherent: real-load decode "Okay, so I need to figure out how<think>Okay, let's see..." on Laguna-XS.2 JANGTQ.
576916b — Laguna codebook MoE via TurboQuantSwitchGLU + split fused gate_up_proj.
babbe34/0fab91c — Mistral3VLM strip model. prefix on vision_tower / multi_modal_projector keys.
1173822 — TWO structural cache bugs that drove the "first turn fine, later turns garbage" symptom across every JANGTQ/MXFP4 multi-turn path:
- Bug A: TurboQuantKVCache paged-restore compounded quantization across turns (re-encoding already-decoded lossy float). Fixed via new restoreFromDecodedKV that seats the prefix in .compressed phase without round-tripping.
- Bug B: hybrid models (Qwen35, Qwen3Next, NemotronH) silently ignored CacheCoordinator.defaultMaxKVSize. Now route attention slots through RotatingKVCache(maxSize:, keep: 4) when the bound is set.
3f8a5e9 — Same maxKVSize honor for Qwen35JANGTQ + tied-embeddings hardenings (Mistral3VLM, NemotronH, Mistral3VLMJANGTQ).
0d85e9d — Defensive EOS widening: BatchEngine init probes 7 common end-of-turn special tokens against the tokenizer vocab and adds present ones to the EOS set. Closes the Qwen3.6 MXFP4 <|im_end|> leak.
0756dc0 — Disk-tier trim-path Metal lifecycle crash (notifyExternalReferencesNonZeroOnDealloc) closed. Disk cache safe to re-enable in osaurus (this PR does).
0c36d01 — osaurus-ai/swift-jinja@58d21aa5 fork pin: lifts for-loop iterable from "factor + |filter only" to "full binary + comparison + logical hierarchy, excluding ternary" via parseFilter() → parseOr() (Sources/Jinja/Parser.swift:186, 1-line change). Unblocks {% for message in loop_messages + [{'role': '__sentinel__'}] %} in Mistral 3.5's native chat template. 756/756 swift-jinja tests pass + 2 new regression tests (forLoopIterableAcceptsBinaryPlus, mistral3RealNativeTemplateParses).

Reasoning ON/OFF detection (osaurus side, exhaustively wired):

LocalReasoningCapability.detect() reads chat_template.jinja or falls back to jang_config.json > chat.reasoning (DSV4-Flash style). Caches per-modelId.
Capability.supportsThinking flips on <think> / </think> / <|think|> (Gemma-4 Harmony marker).
ChatView.swift:395,409 reads activeModelOptions["disableThinking"]?.boolValue == false as thinkingEnabled.
MLXBatchAdapter.swift:274-278 translates: additionalContext = ["enable_thinking": !disableThinking] (defaults to true).
ThinkTagScrubber (Services/ModelRuntime/ThinkTagScrubber.swift) is a defensive post-filter on .tokens for thinking-capable models only. Buffers 7-byte tail for split-token tag detection.
Multi-turn contract: ChatView.swift:1288-1291 builds priorUserMessages from t.role == .user only — assistant <think> content cannot leak into the next prompt by construction. Locked by AbortMidThinkHistoryContractTests.swift (152 lines).

JANGTQ sidecar auto-fetch (osaurus side):

Runs on validateJANGTQSidecarIfRequired throwing the specific "missing sidecar but stamp says JANGTQ" error code (ModelRuntime/code 2). Any other failure mode propagates immediately (no speculative fetch).
Candidate repo list: original repo → osaurus-ai → JANGQ-AI → mlx-community (canonical-cased, deduped, isValidHFRepoId-validated).
Each candidate URL is https://-scheme-pinned + host-pinned to huggingface.co. No arbitrary URL fetch.
Locked by EnsureJANGTQSidecarTests.swift (214 lines), JANGTQEdgeCaseTests.swift (496 lines), ValidateJANGTQUnsupportedFamilyTests.swift (206 lines), OsaurusOrgAutoFetchTests, WeightFormatNormalizationTests.

Hybrid SSM warm-pass + cache injection (vmlx-owned, osaurus consumes):

BatchEngine.admitPendingRequests auto-flips coordinator.isHybrid = true on first slot admission for any model whose per-layer cache list contains a MambaCache or ArraysCache.
Osaurus eager-sets setHybrid(true) for known hybrid families in installCacheCoordinator (Qwen3.5/3.6, NemotronH, Cascade-2, Jamba, etc.) per OMNI-OSAURUS-HOOKUP.md §5.1. Harmless on any admission path; closes the one-frame stale-flag window.
Paged-cache reuse content-addressed via BlockHashMap — toggling enable_thinking between turns produces a different prompt → different hash → no poisoned hit.

Per-bundle media capability + drag-drop / picker accept-set (osaurus side):

ModelMediaCapabilities.from(modelId:) substring/regex matcher: omni (Nemotron-3-Nano-Omni), imageVideo (Qwen2/2.5/3 VL, Qwen3.5/3.6 -vl, Holo3 base + -vl, SmolVLM 2), imageOnly (Paligemma, Idefics3, FastVLM, Pixtral, GLM-OCR, LFM2-VL, Gemma 3 / 4-it, Mistral 3 / 3.5, Mistral 4 -vl), textOnly otherwise.
from(directory:modelId:) post-load refines via config.json > vision_config + config_omni.json sidecar (Nemotron-3 omni gate).
FloatingInputCard.dropAcceptedTypes (line 2115): always .image + .fileURL, conditionally adds .audio + .mp3 + .wav + .mpeg4Audio for supportsAudio, .movie + .video + .quickTimeMovie + .mpeg4Movie for supportsVideo.
pickerAllowedTypes (line 2139): same gating, picker shows audio/video formats only when the loaded model can actually consume them.
attachIfAllowed (line 2179): routes by extension + capability; drops unsupported. Falsy default (textOnly) when no model selected.
Locked by ModelMediaCapabilitiesMCDCTests.swift (MC/DC coverage including Holo3 base = imageVideo boundary).

Test evidence

swift test (OsaurusCore): 1328/1328 pass in 182 suites at 89f8114. Includes 81 new @Test annotations from this PR — zero @disabled/XCTSkip markers added.
xcodebuild test -workspace osaurus.xcworkspace -scheme OsaurusCoreTests from cold cache: ** TEST SUCCEEDED ** 1323 pass / 0 fail / 5 skipped (the 5 skipped are SandboxIntegrationTests — Apple Containerization VM tests, gated on OSAURUS_RUN_SANDBOX_INTEGRATION_TESTS=1, untouched by this PR, intentionally Disabled in CI per their suite-level annotation).
xcodebuild build -scheme osaurus Release: ** BUILD SUCCEEDED ** (latest at HEAD 6cfad02a).

CI status

test-core failed with the EventSource / CAsyncHTTPClient / CNIOLLHTTP / CNIOExtrasZlib / CNIOPosix / _NumericsShims module-resolution errors. This is environmental — fallback-key restore of DerivedData from a different Package.resolved baseline mismatching the freshly-built C-shim module-map paths. Confirmed by raw failure logs (C shims compiled and linked cleanly, then Xcode rejected the cached .swiftmodule whose embedded paths didn't exist). PR's source builds green from cold cache locally; the PR doesn't touch .github/workflows/. Workflow has its own escape hatch at ci.yml:123-131: any Re-run failed jobs automatically wipes DerivedData (if: github.run_attempt != '1') and forces a cold build.

Verified-coherent bundles end-to-end (per vmlx-swift-lm team's BENCH harness sweep)

Bundle	Path	Result
Laguna-XS.2-JANGTQ	native `{% generation %}` template via fork	✅ 3-turn coherent
NemotronH-Omni 30B JANGTQ	image + video + audio + reasoning toggle	✅ all 11 OmniBench rows pass
Qwen3.6-27B-MXFP4	`BENCH_BATCH_DISK_RESTORE` (138/138 hit) + `BENCH_BATCH_CHAT` 3-turn	✅ no Metal crash, prompt 0.290s → 0.064s (4.5× cache hit), `<\|im_end\|>` clean stop
Qwen3.6-35B-A3B-JANGTQ4	hybrid SSM + codebook MoE 3-turn	✅
MiniMax-M2.7-Small-JANGTQ	3-turn with memory across turns	✅
Gemma-4-26B-A4B-it-JANG_4M	3-turn coherent	✅
Holo3-35B-A3B-mxfp4	3-turn	✅
Mistral-Medium-3.5-128B-mxfp4	post `af89da7` (jang `bitWidthsUsed: []` default)	✅ "Okay, the user just sent a blank message."

Documented open issues (NOT silently swept)

Mistral-Medium-3.5-128B-JANGTQ model-forward gibberish — root cause hypothesis documented in vmlx commit 89f8114: per-block codebook calibration mismatch on all-codebook dense decoder at non-power-of-2 hidden_size 12288. Hadamard rotation produces coordinate variance ~1/8192 + ~1/4096 (per block), but compute_codebook(d=12288, bits=2) calibrates 4 entries for variance ~1/12288. Across 88 dense layers the scale mismatch compounds. Fix requires per-block codebooks in BOTH jang_tools/turboquant/codebook.py AND Libraries/MLXLMCommon/JANGTQDenseLinear.swift. Recommendation: do NOT ship Mistral 3.5 JANGTQ until fixed. Mistral 3.5 mxfp4 path works fine.
Mistral 3.5 chat template still fails for the App's release binary — App's osaurus.xcodeproj/project.pbxproj declares no remote SPM packages (every remote pin comes transitively through the local-path packages OsaurusCore / OsaurusCLI / OsaurusRepository). SwiftPM's xcodeproj resolver picks the upstream huggingface/swift-jinja.git URL declared by swift-transformers over the fork URL declared in OsaurusCore's Package.swift. Fix: add osaurus-ai/swift-jinja@58d21aa5 directly to project.pbxproj as an XCRemoteSwiftPackageReference. Not blocking because Mistral 3.5 has the JANGTQ codebook bug above anyway and the mxfp4 path uses a different chat template that doesn't hit the for-iterable parser issue. Tracked separately.
DSV4 native chat encoder wiring — Libraries/MLXLMCommon/DeepseekV4ChatEncoder.swift exists but DSV4Minimal.jinja is what's used in the bridge today. Tool-calling renders with simplified DSML envelope. Not blocking basic chat. Documented in vmlx integration guide.

What other team members need to know

The merge is safe. No regression to any non-Mistral-3.5 family. The 33 commits roll up cleanly and have been multi-turn-tested by the vmlx-swift-lm side against real bundles in /Volumes/EricsLLMDrive.
Re-running CI is the documented escape hatch for the cache-pollution test-core failure. The wipe-on-re-run logic (ci.yml:123-131) was designed for exactly this class of failure.
For follow-up PRs:
- Wire osaurus-ai/swift-jinja@58d21aa5 directly into osaurus.xcodeproj/project.pbxproj so the App resolves the fork (separate change).
- Wire DeepseekV4ChatEncoder.swift through JangChatConfig.encoder == "encoding_dsv4" (separate change).
- Hold Mistral 3.5 JANGTQ until per-block codebook calibration lands in jang_tools + JANGTQDenseLinear.swift.

🚀 Ready for review.

…laguna-preflight # Conflicts: # osaurus.xcworkspace/xcshareddata/swiftpm/Package.resolved

…l 3.5 mxfp4 RMSNorm fix + JANGTQ Mistral 3.5 root cause docs)

…mode vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026- 05-01.md (still local in vmlx; ddea384 is the latest commit pushed to origin/main and includes the same fixes): - 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288 root cause complete — closes the non-power-of-2 dim handling gap in the encode pipeline) - 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5 down_proj support) - a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64 - 890e3ed Mistral3VLM patch_conv weight transpose in sanitize - 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to full prefill instead of emitting 0 tokens — closes the BENCH_STABILITY S2 silent-empty-stream bug) - 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9× TTFT speedup, byte-identical text) - 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit) - 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip - a8ac486 + 1125e20 Hadamard L2-preservation regression tests at d ∈ {4096, 8192, 12288, 28672} Plus a small osaurus-side hardening on the L2 disk-cache modelKey: L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead of bare `<modelName>`. This prevents stale entries encoded under one KV mode from being served against requests using a different mode (e.g. user flips defaultKVMode mid-session, or a per-request override diverges from the coordinator default). Without this scoping, a cache hit returning fp16 KV layout to a TurboQuant decoder (or vice versa) produces undefined behavior. The L1 paged cache stays per-model (modelName-scoped) — the kvModeTag only affects disk persistence. The defaultKVMode stays at .none — see the file-level comment for the 3-bit and 4-bit codebook KV degenerate-repetition trail, and the new reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6 which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but does not bench-test it against thinking-mode preambles (the failure mode that drives `idea idea idea` and `!!!!!!!!!` repetition). Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only ContextBudgetPreviewTests failures are environmental — same source ran green on PR #993's CI test-core).

…699d3a polymorphic MoE) Picks up the iter-13 fix sweep from vmlx-swift-lm: - 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 + codebook for mxtq. `LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and `SwitchGLU` (affine/mxfp4). Factory dispatches on `weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize splits the fused `gate_up_proj` for both formats. The `OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively (vmlx team verified `"2+2 equals 4."` end-to-end + 12/12 BENCH_STABILITY pass + multi-turn cache reuse). - bc19fc4 + e33068d docs(osaurus): comprehensive production reference (14 runtime axes + §15 component invariants the SDK guarantees and osaurus must respect). - e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream parity audit. Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from `validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU` expert path rejecting mxfp4 affine keys) is fixed at the source by 4699d3a's polymorphic dispatch — the preflight is no longer needed. Closes vmlx production-reference §13 item #2. Test changes: - Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate they covered is gone; left a one-line breadcrumb pointing at 4699d3a so future readers understand why two test slots disappeared). - Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1` default (was 4 → flipped in fa694e9 to engage compile path per §15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_ forCompileEngagement` so the test name carries the rationale. Build clean Release; remaining 11 family-preflight tests pass; both batch-adapter test failures resolved. The 2 environmental `ContextBudgetPreviewTests` failures from earlier runs are unchanged (local-only — same source ran green on PR #993's CI test-core).

… degenerate-repetition looping (#998) * fix(quality): default KV mode .turboQuant(4, 4) → .none (4-bit codebook KV) Reverts the 4-bit codebook KV default committed in db3179f (per the vmlx integration guide §"3-bit KV verdict") after real-bundle testing reproduced the same degenerate-repetition failure mode that 3-bit KV produced before e202cbb: - Gemma-4 31B JANG_4M with thinking=ON emitted `idea idea idea idea idea ...` after a few hundred tokens of reasoning preamble. - Multiple other family bundles drifted into looping after a few multi-turn rounds even though turn 1 was coherent ("first turn fine, later turns garbage" symptom). - Thinking=OFF on the same bundles produced coherent output — confirming the failure scales with reasoning preamble length. Vmlx commit `1173822` closed the cross-turn paged-cache re-encoding bug (state was transitioning back to .fill phase with already-decoded lossy float, then re-quantizing). But the underlying codebook quantization error still compounds across long thinking-mode preambles (longer prefix → more compression rounds → more accumulated error → attention latches onto a high-prob low-info token and loops). The vmlx team's BENCH harness verified 4-bit KV across 6+ bundles but didn't toggle thinking on every family it covered. The integration guide's 4-bit recommendation under-tested the failure mode for thinking-capable models. fp16 KV is the conservative default that matches user expectation of "responses look right out of the box" across every family + every turn count + thinking-on or off. Per-request `kvMode` still overrides; clients that want memory savings can submit `.turboQuant(...)` explicitly. * chore(deps): bump vmlx pin → ddea384 + scope L2 disk-cache key by KV mode vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026- 05-01.md (still local in vmlx; ddea384 is the latest commit pushed to origin/main and includes the same fixes): - 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288 root cause complete — closes the non-power-of-2 dim handling gap in the encode pipeline) - 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5 down_proj support) - a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64 - 890e3ed Mistral3VLM patch_conv weight transpose in sanitize - 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to full prefill instead of emitting 0 tokens — closes the BENCH_STABILITY S2 silent-empty-stream bug) - 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9× TTFT speedup, byte-identical text) - 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit) - 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip - a8ac486 + 1125e20 Hadamard L2-preservation regression tests at d ∈ {4096, 8192, 12288, 28672} Plus a small osaurus-side hardening on the L2 disk-cache modelKey: L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead of bare `<modelName>`. This prevents stale entries encoded under one KV mode from being served against requests using a different mode (e.g. user flips defaultKVMode mid-session, or a per-request override diverges from the coordinator default). Without this scoping, a cache hit returning fp16 KV layout to a TurboQuant decoder (or vice versa) produces undefined behavior. The L1 paged cache stays per-model (modelName-scoped) — the kvModeTag only affects disk persistence. The defaultKVMode stays at .none — see the file-level comment for the 3-bit and 4-bit codebook KV degenerate-repetition trail, and the new reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6 which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but does not bench-test it against thinking-mode preambles (the failure mode that drives `idea idea idea` and `!!!!!!!!!` repetition). Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only ContextBudgetPreviewTests failures are environmental — same source ran green on PR #993's CI test-core). * fix(preflight): reject Laguna mxfp4 bundles with actionable error Real-bundle repro on `OsaurusAI/Laguna-XS.2-mxfp4` (model_type=laguna, jang_config.weight_format=mxfp4, quantization.bits=4): vmlx fails parameter load with cryptic Error: Unhandled keys ["biases", "scales", "weight"] in layers.1.mlp.experts.down_proj in LagunaModel.LagunaLayer.LagunaMoE.TurboQuantSwitchGLU .TurboQuantSwitchLinear leaving users no path to remediation. Root cause traced into `Libraries/MLXLLM/Models/Laguna.swift:425`: @ModuleInfo(key: "experts") var experts: TurboQuantSwitchGLU `LagunaMoE.experts` is hardcoded to the codebook switch. Its inner `TurboQuantSwitchLinear` only knows the JANGTQ keys (`tq_packed`, `tq_norms`, `tq_bits`) — the bundle ships standard mxfp4 affine keys (`weight`, `scales`, `biases`), so the parameter loader rejects them. Vmlx production reference doc §13 item #2 names this as a known issue with owner `vmlx-swift` ("Laguna mxfp4 expert format mismatch — either ship JANGTQ-only or add affine MoE class"). Until vmlx makes `LagunaMoE.experts` polymorphic on `weight_format`, the host preflight catches the failure mode, surfacing a clear remediation: use the Laguna XS.2 JANGTQ bundle (`weight_format = "mxtq"`), which is verified-coherent end-to-end. Detection condition (in `validateJANGTQSidecarIfRequired`): - `jang_config.weight_format == "mxfp4"` (case-normalised) - AND `config.json::model_type == "laguna"` Throws `NSError(domain: "ModelRuntime", code: 5)` so callers can distinguish from the existing forward (code 2) / inverse (code 3) sidecar mismatches and the auto-fetch path (code 4). When vmlx ships the polymorphic LagunaMoE expert path, drop this check + its tests. Test coverage: - `Laguna mxfp4 → throws code 5 with remediation pointing to JANGTQ alternative` — locks the new gate - `Laguna JANGTQ (mxtq) passes preflight — codebook path is supported` — boundary check the gate doesn't false-positive on the working Laguna path * fix(ui): rolling steady-state tok/s instead of single-final-average Two visible artefacts the prior tok/s display produced, both reported by users: 1. "Counter doesn't ramp up — needs a long response to show full speed." Short answers (50-200 tokens) have first-token amortisation + reasoning-parser stamp resolution dominating wall time, so the full-generation average reads ~30% slower than steady-state decode. 2. "Reasoning ON shows different tok/s than reasoning OFF on the same model." Same decode rate, but thinking-on accumulates 5-10× more tokens at steady-state speed, diluting setup costs in the average. Users perceive same model + same hardware as inconsistent. Both are calculation artefacts of the cumulative-average pattern, not underlying decode-rate differences. Replaced with `RollingTokenRate` — sliding-window estimator that skips a 0.4s + 4-token warmup, reports steady-state over a 1.5s window, counts content + reasoning + tool-arg tokens uniformly, and updates the live ChatTurn rate at ~5Hz during streaming. On finalize prefers rolling steady-state; falls back to full-gen average only when warmup never elapsed (response too short to converge). 11 unit tests in `RollingTokenRateTests.swift` lock the contract: warmup gating (time + token), 60-tps steady-state convergence, content vs reasoning invariance, sliding-window expiration, finalRate fallback, short-response edge cases. All green. * fix(perf): default maxBatchSize 4 → 1 to engage vmlx compile path Per vmlx production reference §15 invariant 13: compile only engages when `maxBatchSize == 1` (Stage 1B.3 scope; Stage 1B.4 per-bucket shared buffers — pending). With the prior default of 4, every `maybePromoteToCompiledDecode` gate failed and the decode loop ran uncompiled, missing the documented compile-ON speedups: - Mistral 3.5 BENCH_VL_BATCH_CHAT: 24.8s → 2.7s TTFT (9× speedup) - Other promotion-eligible families (Qwen 3.5/3.6, MiniMax, NemotronH, DSV4 via Compilable* cache classes): ranges from ~1.5× to ~9× depending on family — vmlx §8 promotion table. Osaurus's primary use case is single-user chat through the macOS app where only one slot is active at a time. For multi-user server deployments, the existing `defaults write -int N` override remains — at the cost of compile being permanently disabled for that process (the trade-off until vmlx ships Stage 1B.4). * chore(deps): bump vmlx pin → e33068d + drop Laguna mxfp4 preflight (4699d3a polymorphic MoE) Picks up the iter-13 fix sweep from vmlx-swift-lm: - 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 + codebook for mxtq. `LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and `SwitchGLU` (affine/mxfp4). Factory dispatches on `weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize splits the fused `gate_up_proj` for both formats. The `OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively (vmlx team verified `"2+2 equals 4."` end-to-end + 12/12 BENCH_STABILITY pass + multi-turn cache reuse). - bc19fc4 + e33068d docs(osaurus): comprehensive production reference (14 runtime axes + §15 component invariants the SDK guarantees and osaurus must respect). - e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream parity audit. Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from `validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU` expert path rejecting mxfp4 affine keys) is fixed at the source by 4699d3a's polymorphic dispatch — the preflight is no longer needed. Closes vmlx production-reference §13 item #2. Test changes: - Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate they covered is gone; left a one-line breadcrumb pointing at 4699d3a so future readers understand why two test slots disappeared). - Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1` default (was 4 → flipped in fa694e9 to engage compile path per §15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_ forCompileEngagement` so the test name carries the rationale. Build clean Release; remaining 11 family-preflight tests pass; both batch-adapter test failures resolved. The 2 environmental `ContextBudgetPreviewTests` failures from earlier runs are unchanged (local-only — same source ran green on PR #993's CI test-core). * chore(deps): bump vmlx pin → 2e61c12 (Stage 1B.4 design doc + scaffold) * fix(cache): defaultMaxKVSize 8192 → 65536 to match vmlx production reference Per vmlx production reference §6 example (`cfg.defaultMaxKVSize = 65536`) and the audit-conclusion confirmed by the vmlx team. The prior 8192 silently truncated long-context prompts: - 50K-token PDF Q&A → model only saw the last 8K tokens (84% context loss) past the 16K trigger (8192 × longPromptMultiplier=2.0). - Long thinking-mode reasoning preambles > 16K → cap kicked in mid-reasoning, model lost earlier context. Worst-case wired memory at 65K × 88 layers × 8 KV-heads × 128 head_dim × 2 bytes (fp16) × 2 (K+V) ≈ 2.4 GB per slot on Mistral 3.5 — but TurboQuant compression at the engine's `min_tokens_for_compression` threshold (~2K tokens) drops the steady-state cost ~26× to ~95 MB per slot on `.turboQuant(4,4)`. With osaurus's `.none` default the cold path stays fp16 but the rotating cap only kicks in for prompts past 131K (65536 × 2.0) — small/medium chats unaffected. Wired-memory worry is rounding error on a 16GB+ Mac; the silent-truncation footgun was the worse failure mode. Per-family overlay would be cleaner long-term but premature optimization vs the uniform doc-aligned default.

github-actions Bot added the bug Something isn't working label May 1, 2026

jjang-ai added 16 commits April 30, 2026 21:44

jjang-ai added 10 commits May 1, 2026 08:31

chore(deps): bump vmlx pin → cb6e88f (ministral3 outer LLM dispatch)

168710a

chore(deps): bump vmlx pin → 747ca7b (ministral3 VLM dispatch)

ec79b16

chore(deps): bump vmlx pin → 5b687e5 (factory iterator surfaces real …

c0505ea

…load errors over wrong-factory sentinels)

chore(deps): bump vmlx pin → 8eec762 (Mistral3VLMJANGTQ sanitize drop…

7d79aad

…s .tq_bits)

chore(deps): bump vmlx pin → 9cd670f (Mistral3VLMJANGTQ sanitize: dro…

2d168f2

…p tq_bits + tied lm_head)

chore(deps): bump vmlx pin → 81e0fd9 (Laguna @ModuleInfo experts arra…

902932b

…y fix)

chore(deps): bump vmlx pin → 1ad9459 (Laguna mxtq → vanilla + Mistral…

c4693a5

…3VLM model.* sanitize)

chore(deps): bump vmlx pin → 576916b (Laguna codebook MoE via TurboQu…

f83ab05

…antSwitchGLU + split fused gate_up_proj)

jjang-ai added 6 commits May 1, 2026 11:54

jjang-ai added 2 commits May 1, 2026 14:19

Merge remote-tracking branch 'osaurus/main' into fix/jangtq-mistral3-…

efafc48

…laguna-preflight # Conflicts: # osaurus.xcworkspace/xcshareddata/swiftpm/Package.resolved

chore(deps): bump vmlx pin → 89f8114 (jang bitWidthsUsed fix + Mistra…

3a297ed

…l 3.5 mxfp4 RMSNorm fix + JANGTQ Mistral 3.5 root cause docs)

jjang-ai merged commit 8fab9f4 into main May 1, 2026
5 checks passed

jjang-ai deleted the fix/jangtq-mistral3-laguna-preflight branch May 1, 2026 21:36

github-actions Bot added pending release released and removed pending release labels May 1, 2026

jjang-ai mentioned this pull request May 2, 2026

fix(quality): revert default KV mode .turboQuant(4,4) → .none — fixes degenerate-repetition looping #998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993

fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993
jjang-ai merged 35 commits intomainfrom
fix/jangtq-mistral3-laguna-preflight

jjang-ai commented May 1, 2026

Uh oh!

mimeding commented May 1, 2026

Uh oh!

mimeding commented May 1, 2026

Uh oh!

jjang-ai commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jjang-ai commented May 1, 2026

Summary

Layer 1 — engine-side (vmlx-swift-lm d32e135)

Layer 2 — osaurus host preflight (this PR)

When the JANGTQ port lands

Pin bump

Test coverage

Status

Test plan

Uh oh!

mimeding commented May 1, 2026

Uh oh!

mimeding commented May 1, 2026

Uh oh!

jjang-ai commented May 1, 2026

PR #993 — final state for review · 6cfad02a

What ships

Verified by other agents on the corresponding side

Test evidence

CI status

Verified-coherent bundles end-to-end (per vmlx-swift-lm team's BENCH harness sweep)

Documented open issues (NOT silently swept)

What other team members need to know

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Layer 1 — engine-side (vmlx-swift-lm `d32e135`)

PR #993 — final state for review · `6cfad02a`