Skip to content

fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993

Merged
jjang-ai merged 35 commits intomainfrom
fix/jangtq-mistral3-laguna-preflight
May 1, 2026
Merged

fix(preflight): reject JANGTQ Mistral 3 / Laguna before vmlx loads#993
jjang-ai merged 35 commits intomainfrom
fix/jangtq-mistral3-laguna-preflight

Conversation

@jjang-ai
Copy link
Copy Markdown
Contributor

@jjang-ai jjang-ai commented May 1, 2026

Summary

Two-layer fail-fast defense for a load-path issue identified post-PR-#967 merge: JANGTQ-quantized Mistral 3 family (incl. Mistral-Medium-3.5-128B-JANGTQ2) and Laguna bundles can't load through vmlx today because their model classes use vanilla MLXNN.Linear instead of a JANGTQ-aware shim.

Without this PR, a user installing those JANGTQ tiers gets either a weight-shape mismatch crash (.tq_packed shape != Linear.weight flat), or silently-loaded garbage (codebook bytes treated as raw weights).

Layer 1 — engine-side (vmlx-swift-lm d32e135)

Both LLMModelFactory and VLMModelFactory mistral3 dispatch closures now peek weight_format BEFORE falling through to vanilla Mistral3TextModel / Mistral3VLM, throwing a clear error pointing at MXFP4 as the working alternative.

Layer 2 — osaurus host preflight (this PR)

validateJANGTQSidecarIfRequired now has a third check covering pending-JANGTQ families. Fires only when jang_config.json + weight_format == "mxtq" + model_type (or text_config.model_type) ∈ {mistral3, ministral3, laguna}. Surfaces a friendly remediation message at the host layer before any vmlx loader runs.

When the JANGTQ port lands

Once the shim is ported in vmlx, simply remove the family from pendingJANGTQFamilies set. No other host code change required.

Pin bump

vmlx-swift-lm a196800d32e135.

Test coverage

9 new MC/DC-shaped tests in Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift:

  • D1/D2 existing branches still fire
  • D3 outer mistral3 JANGTQ throws code-4
  • D3 inner ministral3 (text_config) JANGTQ throws
  • D3 laguna JANGTQ throws
  • Boundary: nemotron_h / qwen3_5_moe / minimax_m2 JANGTQ all PASS (shims exist)
  • Boundary: Mistral 3 family MXFP4 PASSES (only mxtq fires the gate)

Status

  • OsaurusCore: 1256 / 1256 (was 1247 + 9 new)
  • Base: osaurus/main at 01a1194a, 0 commits ahead

Test plan

  • OsaurusCore swift test passes (1256 / 1256)
  • vmlx focused tests pass at new pin (72 / 72)

…ore vmlx loads

Mistral 3 family (mistral3 / ministral3) and Laguna model classes in
vmlx-swift-lm currently use vanilla MLXNN.Linear. JANGTQ-quantized
bundles for those families ship `.tq_packed` + `.scales` tensors that
the vanilla Linear can't consume — without a JANGTQ-aware shim
(NemotronHJANGTQModel / MiniMaxJANGTQModel pattern), loading would
either crash on weight-shape mismatch or silently load codebook bytes
as raw weights and emit garbage.

Two-layer defense added:

  1. Vmlx (paired commit d32e135 on vmlx-swift-lm main): both
     LLMModelFactory and VLMModelFactory `mistral3` dispatch closures
     now peek `weight_format` BEFORE falling through to vanilla
     Mistral3TextModel / Mistral3VLM, throwing a clear
     "JANGTQ-quantized Mistral 3 family bundles are not yet supported"
     error pointing at MXFP4 as the working alternative.

  2. Osaurus host preflight (this commit): extends
     `validateJANGTQSidecarIfRequired` with a third check covering
     pending-JANGTQ families. Fires only when ALL of:
        - jang_config.json exists
        - weight_format == "mxtq"
        - config.json model_type (or text_config.model_type for VLM
          wrappers) ∈ {mistral3, ministral3, laguna}
     User gets a friendly remediation message at the host layer
     before any vmlx loader runs.

The MXFP4 quant tier of these same models loads correctly via the
standard MLX dequant path (layout matches mx.quantize), so the error
points users at MXFP4 as the working alternative. Once the JANGTQ
shim ports land in vmlx, simply remove the family from the host's
`pendingJANGTQFamilies` set — no other code change required.

Pin bump: vmlx-swift-lm a196800 → d32e135 carries the engine-side
fail-fast guard.

Test coverage: 9 new MC/DC-shaped tests in
Tests/Service/ValidateJANGTQUnsupportedFamilyTests.swift covering:
  - D1/D2 existing branches still fire (no jang_config; non-mxtq)
  - D3 outer mistral3 throws code 4 with MXFP4-pointer message
  - D3 inner ministral3 (text_config) throws
  - D3 laguna throws with laguna-named message
  - Boundary: nemotron_h / qwen3_5_moe / minimax_m2 JANGTQ all PASS
    (shims exist; new check must NOT trigger)
  - Boundary: Mistral 3 family MXFP4 PASSES (only mxtq fires the gate)

OsaurusCore: 1256 / 1256 tests in 167 suites (was 1247 / 166; +9 +1).
@github-actions github-actions Bot added the bug Something isn't working label May 1, 2026
jjang-ai added 16 commits April 30, 2026 21:44
…primitive)

Pulls in vmlx@9df5c80 — new JANGTQDenseLinear primitive (Libraries/
MLXLMCommon/JANGTQDenseLinear.swift) that's the foundation for porting
JANGTQ-quantized Mistral 3 / Mistral 3.5 / Mistral 4 / Laguna bundles.

The existing TurboQuantSwitchLinear is MoE-shaped (n_experts dim,
gather indices); the Mistral 3 family quantizes the entire DENSE text
decoder per the mxtq_bits.text_decoder=2|4 profile — needs a different
shim shape. JANGTQDenseLinear matches the Python converter's
tq_quantize_weight output (2D shapes, no expert dim) and reuses the
existing JANGTQKernels.gatherTQ kernel via singleton-expert
degeneration.

7 structural tests covering construction contracts + Mistral 3.5
attention/MLP shape examples + parameter-key pathing + bit-width
packed_cols arithmetic.

End-to-end forward verification still gated on a real Mistral 3 family
JANGTQ bundle on disk — the host-side preflight in this PR continues
to fire until that verification phase completes (defense in depth).

OsaurusCore: 1256 / 1256 against new pin. No behavior change for
non-mxtq Mistral 3 paths.
Pulls in vmlx@cb829b6 — Mistral 3 family LLM JANGTQ port complete.

Vmlx-side (cb829b6):
  - Mistral3TextJANGTQModel: parallel architecture to Mistral3TextModel
    with JANGTQDenseLinear for attention Q/K/V/O + MLP gate/up/down
  - LLMModelFactory mistral3 closure peeks weight_format and routes
    mxtq → Mistral3TextJANGTQModel(config, bits, seed) reading
    mxtq_bits / mxtq_seed from merged jang_config.json
  - VLMModelFactory mistral3 closure: updated message reflects LLM port
    complete, VLM port still in-flight (Mistral3VLM with Pixtral
    needs paired JANGTQ inner LM)
  - 6 dispatch tests + 7 structural JANGTQDenseLinear tests upstream

Host-side preflight relaxation:
  - Removes blanket gate on `mistral3` / `ministral3` model_types
  - LLM-only Mistral 3 / 3.5 JANGTQ (no vision_config in config.json)
    now flows through to vmlx's Mistral3TextJANGTQModel — preflight
    PASSES, no error
  - VLM-shaped Mistral 3 family (vision_config present, e.g.
    Mistral-Medium-3.5-128B with Pixtral) STILL fires gate with a
    VLM-specific message until upstream VLM port lands
  - Laguna gate stays unchanged — separate engine port pending

Test coverage updated:
  - ValidateJANGTQUnsupportedFamilyTests now has 11 cases (was 9):
    * D3.mistral3 outer LLM-only PASSES (new)
    * D3.mistral3 + vision_config STILL throws with VLM-specific msg (new)
    * D3.ministral3 inner LLM-only PASSES (new)
    * D3.ministral3 inner + vision_config throws (was throws unconditionally)
    * D3.laguna unchanged — still throws
    * Boundary: nemotron_h / qwen3_5_moe / minimax_m2 unchanged

OsaurusCore: 1258 / 1258 (was 1256 + 2 new VLM-vs-LLM split tests).
Pulls in vmlx@7fa4940 — Mistral 3 family VLM JANGTQ port complete.

Vmlx-side (7fa4940):
  - Mistral3VLMJANGTQ: full JANGTQ inner LM (Mistral3JANGTQAttention/
    MLP/TransformerBlock/ModelInner/LanguageModel) + outer wrapper
    matching Mistral3VLM's contract. Pixtral vision tower stays
    vanilla per mxtq_bits.vision_tower=passthrough_fp16.
  - VLMModelFactory mistral3 closure: weight_format=mxtq → routes to
    Mistral3VLMJANGTQ(config, bits, seed) with bits + seed read from
    config.mxtqBits / config.mxtqSeed.
  - ToolCallFormat: laguna → .glm4, ministral3 → .mistral.
  - 10 new coverage tests (Mistral3LagunaCoverageTests).

Host preflight changes:
  - Drops the Mistral 3 family gate entirely. Both LLM-only AND
    VLM-shaped bundles flow through to vmlx now (no_vision →
    Mistral3TextJANGTQModel; vision_config present →
    Mistral3VLMJANGTQ).
  - Laguna gate stays — vmlx Laguna model class is the next port.
  - Tests updated: VLM Mistral 3 cases now PASS (used to throw).

OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.
…o 344dda0

vmlx@344dda0 ships LagunaModel — full Poolside Laguna engine class
(40 hybrid layers, per-layer head count, dual RoPE, q_norm/k_norm,
sigmoid+correction-bias routing over 256 routed experts top-8 + shared,
per-layer mixed RotatingKVCache+KVCacheSimple). MXFP4 bundles load via
standard MLX dequant route immediately.

Host preflight third-check fully retired:
  - Mistral 3 family LLM (vmlx@cb829b6) ✓
  - Mistral 3 family VLM with Pixtral (vmlx@7fa4940) ✓
  - Laguna LLM (vmlx@344dda0) ✓
  - JANGTQ Linear shim for Laguna is the next incremental piece
    (LagunaJANGTQModel paralleling Mistral3TextJANGTQModel) — but no
    longer host-side gated; mislabeled bundles get caught by the
    existing forward/inverse sidecar checks above.

Tests updated:
  - D3.laguna case now PASSES (was throws)
  - All other boundary checks unchanged

OsaurusCore: 1258 / 1258 tests in 167 suites against new pin.
Pulls in vmlx@fe1b754 — exhaustive audit pass:
  - Mistral3TextJANGTQModel adds sanitize() override (drops
    self_attn.rotary_emb.inv_freq, .tq_bits scalars, lm_head.weight
    when tied; handles weight_scale_inv FP8 multiply-through;
    unwraps language_model. prefix from VLM-converted bundles)
  - Laguna sanitize() now also drops the same HF quirks
  - Tests/MLXLMTests/InterleavedReasoningLeakTests.swift (8 new
    tests): empty think block, token-split opener/closer, multi-turn
    cross-turn isolation, mid-think truncation flushes as reasoning,
    stray closer no-channel-open, family stamp matrix lock

OsaurusCore: 1258 / 1258.
Closes the Laguna JANGTQ port. Vmlx@3a422d7 ships LagunaJANGTQModel
mirroring LagunaModel exactly except every dense Linear in attention
+ MLP + MoE expert + shared expert is replaced with JANGTQDenseLinear.
LLMModelFactory's `laguna` closure now peeks weight_format and routes
mxtq → LagunaJANGTQModel.

Engine status: every JANGTQ family vmlx ships now has a paired JANGTQ
class. NemotronH, Qwen3.5-MoE/Holo3, MiniMax M2, DSV3/Kimi K2, DSV4,
Mistral 3 LLM + VLM, Laguna LLM. Laguna VLM doesn't exist (Laguna is
text-only).

OsaurusCore: 1258 / 1258.
`scanLocalModels` previously walked exactly two levels (`<root>/<org>/<repo>/`),
the HuggingFace shape. Bundles laid out flat (`<root>/<modelDir>/`) — common
when users sync via rsync or place models on an external drive — were treated
as "orgs", and their child files (config.json, *.safetensors) failed the
directory check, so nothing was detected. Affected real-world bundles include
Nemotron-3-Nano-Omni-30B-A3B-MXFP4, MiniMax-M2.7-JANGTQ4, Kimi-K2.6-*,
DeepSeek-V4-Flash-* and Qwen3.6-35B-A3B-JANGTQ4.

Now each top-level entry is first checked as a model bundle (config.json +
recognised tokenizer + at least one safetensors file). If yes it's registered
with id = directory basename; otherwise we fall back to the existing nested
descent. Both layouts may coexist under the same root.

`mergeAvailable` now also dedupes by repo tail so a flat `Nemotron-3-...`
entry doesn't surface alongside the curated `OsaurusAI/Nemotron-3-...` one.
The downloaded copy wins on tail collision.
…that needs it

When a JANGTQ bundle's `jang_config.json` declares `weight_format: "mxtq"` but
`jangtq_runtime.safetensors` is absent, vmlx aborts on the first forward pass
because TurboQuantSwitchLinear's runtime cache is empty. Users who synced a
bundle without the sidecar (rsync excluded it, partial download, etc.) hit a
load-time error with no clear remediation.

`ensureJANGTQSidecar` wraps the existing sync validator and, on the FORWARD-
mismatch failure (and only that — code 2 from `validateJANGTQSidecarIfRequired`),
attempts a one-shot download of `jangtq_runtime.safetensors` from
`https://huggingface.co/<modelId>/resolve/main/jangtq_runtime.safetensors`,
then re-runs validation. The URL is built dynamically from the model id via
`ModelDownloadService.resolveURL` so it always points at the right repo.

Hard guarantees (covered by tests):
  * Sidecar already present → no fetch
  * Vanilla (no jang_config.json) model → no fetch
  * Stamp says non-mxtq (vanilla or inverse-mislabeled) → no fetch, original
    error propagates unchanged (code 3 still surfaces for inverse mismatch)
  * Forward mismatch but flat-layout id (no `/`) → no fetch (no canonical HF
    mapping), original code-2 error surfaces
  * Forward mismatch + canonical `<org>/<repo>` id → fetcher fires exactly once
    with the right URL; if it succeeds the validator re-runs and the load
    proceeds; if it fails the error is wrapped as code 4 with the URL we tried

The fetcher does an atomic temp → rename and rejects 0-byte responses so a
crashed/cancelled fetch never leaves a partial sidecar that the next preflight
would silently accept.
Pattern + character edge cases that could have made the auto-fetch trigger
on bogus input or skip a real JANGTQ bundle:

  * weight_format matching is now case + whitespace insensitive.  Every
    stamp variant we've seen in the wild (`MXTQ`, ` mxtq `, `\tmxtq`,
    `mXtQ`, `mxtq\n`) is recognised as JANGTQ; lookalikes (`mx_tq`,
    `mxtq2`, `mxq`, `mxfp4`, etc.) are not.  Covers every JANGTQ family —
    Qwen, MiniMax, DSV4, Nemotron, Mistral 3 (LLM + VLM), Laguna.

  * `isValidHFRepoId` strictly validates the model id BEFORE we build any
    URL or hit the network.  Required shape: exactly two non-empty
    segments of [A-Za-z0-9._-], 1..96 chars each, no leading or trailing
    slash, no whitespace anywhere.  Rejected up front: empty string, bare
    `/`, leading `/foo`, trailing `foo/`, no-slash flat ids, multi-slash
    `a/b/c`, empty middle `a//b`, whitespace, URL meta (`?`, `#`, `&`,
    `;`, `:`, `@`), path traversal (`..`), backslashes, quotes, control
    characters, non-ASCII, BOM, and segments > 96 chars.

  * The resolved URL is double-checked to be `https://huggingface.co/...`
    before we trust it.

  * Cross-volume safe install: `URLSession.download` writes its temp file
    to the system temp dir which is almost always on a different volume
    than the model bundle (e.g. user models on `/Volumes/EricsLLMDrive/`).
    `moveItem` would fail with EXDEV.  Now we copy into a sibling temp
    file in the bundle's own directory and use `replaceItemAt`, which
    handles "dest already exists" cleanly and stays atomic on the same
    volume.

  * Race tolerance: if a concurrent writer produced the sidecar between
    our HEAD-of-validate and our install step, we accept their copy
    instead of throwing.

  * Test injection moved from a `nonisolated(unsafe)` global to a
    `@TaskLocal`, so parallel test cases can each scope their own fetcher
    override via `withValue` without racing each other (and without
    hitting the real network when a parallel case clears the global mid-
    flight, which was producing 401s in a prior run).

  * `mergeAvailable` tail-dedup now properly removes the curated entry
    from `suggestedModels` when a flat-layout local entry replaces it,
    instead of leaving both in different lists where the UI rendered them
    as duplicates.

Tests added:
  * 14 `isValidHFRepoId` cases covering the accept/reject matrix above.
  * 8 stamp-variant cases proving every casing/whitespace form trips the
    auto-fetch.
  * 12 non-mxtq stamp cases proving lookalikes never trip it.
  * 13 malformed-id cases proving the network is never hit on bad input.
  * One race-tolerance test proving concurrent writers don't break us.

26 new tests; all 29 ModelRuntime + ModelManager tests pass green.
vmlx pin 3a422d7 → 2ff7a23: picks up the permissive rope_parameters
decode so Laguna mxfp4 / JANGTQ bundles no longer fail at config decode
("Type mismatch at rope_parameters.original_max_position_embeddings").

Flat-layout local ids (e.g. `MiniMax-M2.7-Small-JANGTQ`,
`Nemotron-3-Nano-Omni-30B-A3B-JANGTQ4`) now resolve via a known JANGTQ
publisher org allowlist when the user is missing the sidecar. Ordered
attempts: JANGQ-AI → OsaurusAI → mlx-community. Each candidate is
independently revalidated against `isValidHFRepoId` so illegal chars in
the basename (whitespace, non-ASCII, URL meta) still abort up-front
without hitting the network. `..` / `.` segments are now also rejected
to close a path-traversal-shaped gap in the id gate.

The error message lists every URL we tried so the user sees exactly
where the sidecar should live. Tests cover:

  * 15 real-world bundle names (Nemotron-3 family, Holo3, Laguna,
    Mistral 3.5, DeepSeek-V4-Flash, Kimi-K2.6, MiniMax-M2.7, Qwen3) —
    each must produce OsaurusAI/<name>, JANGQ-AI/<name>, and
    mlx-community/<name> as valid candidates
  * Canonical `org/repo` ids return only themselves
  * Flat-id ordering: JANGQ-AI, OsaurusAI, mlx-community
  * Empty / multi-slash / `..` / `.` / illegal-char ids → no candidates
  * Flat bundle resolves via OsaurusAI fallback when JANGQ-AI 404s
  * Nested OsaurusAI/<repo> hits exactly one URL (no redundant fallback)
  * "All candidates 404" surfaces every tried URL in the error message

34 JANGTQ-related tests pass across 6 suites.
Companion to vmlx's MultiTurnFamilyMatrixTests. The engine suite locks
parser/dispatch invariants; this suite locks the osaurus translation
layer: model_id → media-capability → composer drag-drop → which
MessageContentParts even reach the engine.

Sections:
  A. Capability detection by model_id (pre-load fast path) — 8 omni
     bundle ids, 6 imageVideo, 11 imageOnly, 17 textOnly (incl. flat-
     layout local ids), 6 degenerate-id fallbacks
  B. Capability detection by bundle directory (post-load refined) —
     config_omni.json sidecar trumps model_type, vision_config + 7
     video-capable model_types → imageVideo, vision_config + 12 image-
     only model_types → imageOnly, 13 dense LLM model_types → textOnly,
     unreadable config falls back to model_id matcher
  C. Multi-turn capability stability — 5-turn alternating model
     switch (omni → text → VL → text → omni) returns expected per-turn
     capability with no aliasing; repeated calls deterministic
  D. Drag-drop accept matrix — text-only rejects all 3 modalities,
     image-only accepts image, imageVideo rejects audio, omni accepts
     all 3; 7-turn drag-drop sequence covering switch + accept/reject
     per turn
  E. End-to-end matrix — 29 (model_id, modality, expected) rows across
     every shipping family (Nemotron-3 omni, Qwen 3 VL, Qwen 2.5 VL,
     SmolVLM2, Mistral-Medium-3.5, Paligemma, Idefics3, FastVLM,
     Pixtral, Holo3 dense, Laguna, MiniMax, DeepSeek-V4, Kimi K2.6,
     Qwen3.5/3.6 dense)

18 tests across 5 suites, all green. 1311 total osaurus tests pass,
no regressions. Pin bumped to vmlx 38c93f9 which adds the engine-side
companion (44 tests, 10 suites).
…er reach next prompt

Live audit triggered by user report that MiniMax JANGTQ loops after a
stop+continue. Static analysis ruled out every state-leak path:

  * osaurus chat UI builds priorUserMessages from session.turns,
    filtering t.role == .user — assistant turns never enter the next
    prompt regardless of abort state (ChatView.swift:1288)
  * BatchEngine.finishSlot(reason: .cancelled) explicitly skips
    coordinator.storeAfterGeneration — the disk/memory cache never
    sees a partial entry from a cancelled run (BatchEngine.swift:1384)
  * Per-request BatchSlot allocates fresh KV cache via
    model.newCache(...) (BatchEngine.swift:610) and a fresh
    sampler + PenaltyProcessor + RepetitionContext
    (BatchScheduler.swift:148-149) — no per-model mutable state
    survives across requests
  * ReasoningParser is allocated per-stream via forPrompt(stampName:
    promptTail:) and dies with the stream — no parser state crosses
    requests
  * JANGTQRuntimeCache is a singleton with NSLock-protected
    dictionaries; signs/codebook are immutable after loadSidecar
    and not mutated during forward
  * TurboQuantSwitchLinear / TurboQuantSwitchGLU are pure compute over
    @ParameterInfo (read-only Module params) and runtime-cache-resident
    codebook arrays — no state path could be corrupted by an aborted
    forward pass

This test file codifies the chat-history contract so any future drift
that reintroduces assistant content into the prompt is caught:

  - aborted assistant turn with partial <think> reasoning → dropped
  - aborted assistant turn with partial visible content → dropped
  - back-to-back aborts across turns → only user messages survive
  - empty user turn filtered out (no degenerate user→user shape)
  - user-typed <think> literal preserved verbatim (intent-respecting)
  - explicit MiniMax stop-mid-think → continue scenario

6 tests, all green. The remaining viable looping causes are
mlx-swift Metal command-encoder coalescing race (engine-cancel
mid-encode → next request inherits stale buffer) or deterministic
sampling on temp=0 producing similar openings on similar prompts —
both outside this PR's scope.
… contracts

User suspected stop-and-continue looping was caused by generation
config / tools / skills state leak across turns. Audited every path:

GENERATION CONFIG (LocalGenerationDefaults):
  * Cache keyed by lowercased modelId; multi-turn calls return
    byte-identical Defaults
  * jang_config primary + generation_config fallback merge precedence
    is stable across N invocations (5 iterations × identical output)
  * repetition_penalty=1.0 round-trips verbatim (vmlx engine treats
    1.0 as no-op via cf8c525, so deterministic sampling)
  * Concurrent parse from 32 tasks yields identical results (no
    shared mutable state)

TOOLS + SKILLS HISTORY FILTER:
  * Aborted assistant turn with pendingToolName / pendingToolArgs is
    filtered from next prompt (the user-only role check works on every
    assistant variant, including tool-shaped state)
  * Completed assistant->tool->assistant exchange is entirely filtered
    out of the NEXT user query — only the user message survives
    across turns. The tool exchange happens within ONE user query and
    never crosses into history visible to a later turn.
  * Aborted post-tool answer (tool succeeded, assistant content was
    streaming, user clicked Stop) — content is dropped from history
  * Skill toggle (on/off/on across 3 turns) doesn't leak skill content
    into prior-turn history; skills are a per-turn system-prompt
    concern, not a history concern

STOP-AND-CONTINUE DETERMINISM:
  * Whether the aborted turn was in reasoning mode, content mode, or
    tool-call mode produces IDENTICAL history shape for the next
    request. The model has no way to tell what mode the prior turn
    was in because the assistant turn never appears in the prompt.

9 new tests across 3 suites, all green. Combined with the existing
27 LocalGenerationDefaultsTests, the multi-turn behavior is provably
deterministic regardless of abort timing or tool/skill state.
…eorder OsaurusAI first

User report: 'jangq-ai/MiniMax-M2.7-Small-JANGTQ' (lowercased upstream)
401s because HF orgs are case-sensitive and 'jangq-ai' isn't the same
as 'JANGQ-AI'. Auto-fetch only tried the verbatim id and never the
canonical-cased fallback.

Fix: even when the supplied modelId is a valid '<org>/<repo>', we now
ALWAYS append canonical-cased variants of the basename. Try the
verbatim id FIRST (custom orgs that genuinely ship at that exact
path), then OsaurusAI/<basename> (curated publisher — most user-
facing JANGTQ + MXFP4 bundles ship here), then JANGQ-AI/<basename>,
then mlx-community/<basename>.

This recovers from BOTH:
  * case-mismatch (jangq-ai → JANGQ-AI / OsaurusAI)
  * wrong-org-guess (user thinks bundle is on JANGQ-AI but it actually
    lives under OsaurusAI)

Tightening: malformed shapes (multi-slash, leading slash, etc.) still
produce zero candidates — basename extraction is only trusted when the
id is either valid <org>/<repo> or pure flat (no slash).

Reorder: OsaurusAI now FIRST in the priority list (was JANGQ-AI). The
curated org ships the bulk of user-facing bundles; trying it first
fixes most missing-sidecar cases on the first network hit.

33 tests across 5 suites, all green. New test
'lowercasedOrgIdRecoveresToCanonicalCasing' covers the exact user
regression.
…t dir finds bundles in nested HF-style trees

User's drive layout:
  /Volumes/EricsLLMDrive/dealignai/<flat-bundle>/...
  /Volumes/EricsLLMDrive/jangq-ai/<org>/<repo>/...

Pointing the osaurus models picker at /Volumes/EricsLLMDrive/ would
find dealignai's bundles (2-level: <root>/dealignai/<flat>) but MISS
jangq-ai's bundles (3-level: <root>/jangq-ai/<org>/<repo>) because the
old scanner only walked exactly 2 levels.

Fix: replace the two separate flat+nested loops with a single
recursive scanDir(maxDepth:) that handles all three layouts:
  1. Flat:      <root>/<modelDir>/{config.json,...}
  2. Nested:    <root>/<org>/<repo>/{config.json,...}
  3. Multi-org: <root>/<parentOrg>/<org>/<repo>/{config.json,...}

The id is built from the path components joined by /, so a 3-level
discover produces e.g. 'jangq-ai/JANGQ-AI/Laguna-XS.2-JANGTQ' which
MLXModel.localDirectory will round-trip back to the right on-disk path.

Bounded depth=3 keeps the scan from descending into model bundles'
own subdirectories (e.g. shard caches, tokenizer state). A directory
that IS a model bundle stops descent at that level.

No duplicate / zombie code: the recursive scanDir replaces both the
flat-detection branch and the nested-descent branch — single function,
single source of truth.

Tests:
  * scanLocalModels_detectsThreeLevelMultiOrgLayout (new) — verifies
    the user's exact drive shape produces the right ids
  * scanLocalModels_detectsFlatAndNestedLayouts (existing) — still
    green, no regression on 1- and 2-level layouts

Full osaurus regression: 1328 tests across 182 suites all green.
…ough fix)

Picks up the LLM factory fix that refuses bundles with vision_config
on the mistral3 route, so VLM bundles fall through to VLMModelFactory's
mistral3 route which handles the language_model + multi_modal_projector
+ vision_tower keys correctly via Mistral3VLMJANGTQ.
@mimeding
Copy link
Copy Markdown
Contributor

mimeding commented May 1, 2026

CI note from current PR sweep: latest test-core failed before tests ran with EventSource module-resolution errors, not with this PR-specific assertions. The first raw errors are missing CAsyncHTTPClient, CNIOLLHTTP, CNIOExtrasZlib, CNIOPosix, and _NumericsShims in target EventSource.

This matches the fallback DerivedData cache failure class we just fixed on the maintainer-owned #974 branch by wiping restored DerivedData when Actions restores from a broad fallback key rather than an exact source-hash key. A maintainer/admin rerun of the failed job should also take the existing cold-build path (github.run_attempt != 1) and may clear it; otherwise rebase after the CI hardening lands.

@mimeding
Copy link
Copy Markdown
Contributor

mimeding commented May 1, 2026

Local verification on the current head passed the focused core coverage for this PR:\n\nswift test --package-path Packages/OsaurusCore --filter ValidateJANGTQUnsupportedFamilyTests\n\nResult: 11 tests passed. I also attempted the wider swift test --package-path Packages/OsaurusCore; it built and entered the full suite locally, then the local Swift testing helper went idle and had to be stopped, so I do not want to mark that as a completed full local pass.\n\nThe failing GitHub test-core job appears to have failed before running tests due to SwiftPM/Xcode C-module dependency resolution errors (CAsyncHTTPClient, CNIOLLHTTP, CNIOExtrasZlib, CNIOPosix, _NumericsShims, target EventSource). This matches the runner/dependency-resolution failure pattern we saw on PR #985 before rerun.\n\n@tpae could you ask someone with repo permissions to rerun test-core on PR #993?

jjang-ai added 10 commits May 1, 2026 08:31
…-vl suffix needed)

Real bundle config for OsaurusAI/Holo3-35B-A3B-mxfp4: outer
model_type=qwen3_5_moe WITH vision_config + pixtral image
preprocessor. The bundle is image+video capable but the model_id-only
matcher required '-vl' in the name to flag it. Without that flag the
chat composer's drag-drop UI rejected images on Holo3 even though the
engine is fully wired for them (post-load directory check would
catch it via vision_config but pre-load UI gating was wrong).

Add a Holo3 family pattern that matches the bundle name regardless of
'-vl' suffix. Existing 'qwen3.5-6.*-vl' / 'holo3.*-vl' regex stays for
explicit VL-named bundles; the new pattern covers Holo3 base.

Tests updated: Holo3-35B-A3B-mxfp4 moved from textOnly to imageVideo
end-to-end matrix; new comment explains the bundle topology so future
audits don't re-introduce the regression.

11 capability tests green.
User report: Qwen3.6 27B MXFP4 emits degenerate '!!!!!!!!!' spam in
the thinking channel on a simple prompt. Community issue #995 reports
the same shape: degraded output across multiple models in recent
builds, model previously worked fine.

Root cause: cfa1ceb ('Enabled TurboQuant by default') set the
CacheCoordinatorConfig.defaultKVMode to .turboQuant(keyBits: 3,
valueBits: 3). Per resolveKVPolicy in vmlx CacheCoordinatorConfig,
that mode is applied to EVERY slot whose request submits
'kvMode: .none' regardless of prompt length — only the SIZE cap
(defaultMaxKVSize) is prompt-length gated, not the mode itself.

3-bit KV quantization is aggressive enough to corrupt attention on
small reasoning models (especially when thinking is on, where the
'attention soup' inside <think> needs precision the most). The
sequence is:
  - First few tokens decode okay
  - Attention error compounds across decode steps because each step
    reads quantized K/V from prior steps
  - Output collapses into single-token repetition

Fix: default to .none (fp16). Memory-constrained users can opt into
.turboQuant explicitly per request via GenerateParameters.kvMode.

This is a strict quality-over-memory tradeoff. Smaller models with
short prompts get full attention quality back; long-context users
who care about memory still have the knob via per-request override.
The defaultMaxKVSize: 8192 + longPromptMultiplier: 2.0 size cap stays
in place for prompts > 16k tokens (rotating window kicks in for
ultra-long context regardless of mode).
jjang-ai added 6 commits May 1, 2026 11:54
…3VLM model.* sanitize)

Picks up the vmlx commits pushed to origin/main today:

- 1135950 fix(laguna): three parity bugs causing garbage-token saturation
  (sigmoid → softplus per-head gate, un-biased routing weights for top-k
  scoring, plain → YaRN-scaled RoPE on full-attention layers)
- babbe34 fix(mistral3vlm): strip leading model. on vision_tower keys
- 0fab91c fix(mistral3vlm): strip model. prefix on multi_modal_projector keys

Also fixes a stale MC/DC test: 0a14145 reclassified Holo3 base bundles
(no -vl suffix) as image+video, so the "all dense LLMs → .textOnly"
master-FALSE assertion can no longer include Holo3-35B-A3B-JANGTQ.
…ixes)

Picks up vmlx commit 1173822, which fixes two structural bugs causing
the user-reported "first turn fine, later turns garbage" symptom on
every JANGTQ/MXFP4 model that hit the paged cache:

  Bug A — TurboQuantKVCache paged-cache restore compounds quantization
    The paged-tier restore previously routed through `tq.state =
    [keys, values]`, which transitioned the cache back to .fill phase
    with an already-decompressed lossy float as the new prefill. The
    next threshold cross then re-compressed that lossy float,
    compounding quantization error per turn — the exact symptom on
    Qwen3.6 MXFP4 / Mistral 3.5 / every JANGTQ model. Now uses the new
    `restoreFromDecodedKV` method that seats the decoded float
    DIRECTLY as the compressed-phase prefix without an encode/decode
    round trip.

  Bug B — hybrid models silently ignore CacheCoordinator.defaultMaxKVSize
    `Qwen35.newCache`, `NemotronH.newCache`, and `Qwen3Next.newCache`
    returned plain `KVCacheSimple()` for every attention slot, never
    reading `parameters?.maxKVSize`. The CacheCoordinator's
    `defaultMaxKVSize` contract writes the bound into
    `parameters.maxKVSize` at admission, but for the hybrid family
    that field was write-only. Long-context Qwen3.5/3.6/Cascade-2
    prompts therefore could not be capped via the coordinator. Now
    routes attention slots through `RotatingKVCache(maxSize:, keep: 4)`
    when the bound is set — same pattern Llama/Mistral already use.
… Qwen35JANGTQ maxKVSize + tied-embeddings hardenings)

Picks up two more vmlx commits on top of 1173822, both from the vmlx
team's real-load verification pass against six bundles on the user's
external drive (Laguna-XS.2, NemotronH-Omni-30B, Mistral-Medium-3.5-128B,
Qwen3.6-27B MXFP4, Qwen3.6-35B-A3B JANGTQ4, MiniMax-M2.7-Small JANGTQ):

  3f8a5e9 fix(cache,sanitize): hybrid Qwen35JANGTQ maxKVSize + tied-embeddings hardenings
    - Qwen35JANGTQ.newCache now honors parameters?.maxKVSize (parity
      with the 1173822 fix for Qwen35/Qwen3Next/NemotronH).
    - Mistral3VLM, NemotronH, and Mistral3VLMJANGTQ explicitly drop
      redundant lm_head.{weight,scales,biases} when text_config.tie_word_embeddings
      is true.

  0d85e9d fix(batchengine): defensive EOS widening for common end-of-turn tokens
    - BatchEngine init probes 7 common end-of-turn special tokens
      (<|im_end|>, <|end|>, <|endoftext|>, <|eot_id|>, <|end_of_turn|>,
      <|/s|>, </s>) against the tokenizer vocab and adds any present
      ones to the EOS set. Closes the Qwen3.6 MXFP4 <|im_end|> leak
      where the tokenizer config only listed <|endoftext|> as the
      primary eos_token but the chat template terminated turns on
      <|im_end|>.

Verified on the user's external drive (per vmlx team report):
  ✅ Laguna-XS.2-JANGTQ — coherent
  ✅ NemotronH-Omni 30B — all 11 multi-turn rows pass
  ✅ Mistral-Medium-3.5-128B JANGTQ — load 13s, valid multilingual
  ✅ Qwen3.6-27B MXFP4 — 3-turn cache reuse, no <|im_end|> leak
  ✅ Qwen3.6-35B-A3B JANGTQ4 — coherent across 3 turns
  ✅ MiniMax-M2.7-Small JANGTQ — memory preserved across turns

Test suite remains green: 1328 tests in 182 suites pass.
Picks up vmlx commits 0756dc0 (close trim-path Metal lifecycle crash on
full disk-cache hit), 71065ca (Laguna chat template fallback + bridge
sniff), and 0e22eba (the upstream's authoritative integration guide for
this pin).

Re-enables `enableDiskCache = diskDirUsable` in
`buildCacheCoordinatorConfig`. The
`notifyExternalReferencesNonZeroOnDealloc` Metal crash that motivated
the temporary `enableDiskCache = false` guard is fixed in 0756dc0:
the trimmed compiled-cache list is now forced to realize before its
underlying Metal buffers go out of scope on the
`Cache disk hit … prefilling 0 remaining` path. The
`eval_http_stability.py` suite remains the regression check; re-run on
any future pin bump that touches the CacheCoordinator restore path.

Test suite remains green: 1328/1328 pass.
…ok KV)

Per the vmlx-swift-lm 2026-05-01 integration guide
(OSAURUS-INTEGRATION-2026-05-01.md §"3-bit KV verdict"), 4-bit codebook
KV is now the recommended default. The earlier `.turboQuant(3, 3)`
default was reverted to `.none` in commit e202cbb after the
`!!!!!!!!!` repetition spam reported in community issue #995.

The root cause was not the bit width itself but vmlx's
`TurboQuantKVCache` paged-restore path compounding quantization across
multi-turn handoff: the prior `tq.state = [keys, values]` path
transitioned the cache back to `.fill` phase with already-decoded lossy
float as the new prefill, then re-quantized at the next threshold cross
— compounding error per turn (the "first turn fine, later turns
garbage" symptom).

That cross-turn handoff bug was fixed in vmlx commit `1173822`
(`restoreFromDecodedKV` keeps the prefix in `.compressed` phase
without round-tripping). With the bug fixed, 4-bit KV is real-bundle-
verified coherent across multi-turn paths on Qwen3.6 27B MXFP4,
Qwen3.6 35B-A3B JANGTQ4, MiniMax M2.7 Small, Laguna XS.2,
Mistral-Medium-3.5-128B, and NemotronH-Omni 30B. 3-bit is also safe
post-`1173822` but more error-sensitive and gains less compression
benefit, so 4-bit stays the default.

Per-request `kvMode` still overrides; clients that want fp16 KV can
submit `kvMode: .none` explicitly.
Picks up two vmlx commits since 0e22eba:
  - 0c36d01 fix(jinja): pin osaurus-ai/swift-jinja fork with for-iterable
            parser fix
  - 405bdc6 docs(osaurus): expand integration guide with full sweep results

The swift-jinja fork (osaurus-ai/Jinja@58d21aa5) lifts the for-loop
iterable from "factor + |filter only" to "full binary + comparison +
logical hierarchy, excluding ternary" via a one-line `parseFilter()` →
`parseOr()` change in `Sources/Jinja/Parser.swift:186`. This unblocks
chat templates with iterable expressions like
`{% for message in loop_messages + [{'role': '__sentinel__'}] %}` —
present in Mistral 3.5's native chat template (line 72).

Add `osaurus-ai/swift-jinja@58d21aa5` directly to OsaurusCore/Package.swift
so OsaurusCore-level tests + `swift build` resolve to the fork. NOTE:
the App's xcodeproj currently still resolves the upstream
`huggingface/swift-jinja.git` transitively via swift-transformers because
the App's project.pbxproj has no remote SPM package references at all
(every remote pin comes through the local-path packages OsaurusCore /
OsaurusCLI / OsaurusRepository, and SwiftPM's xcodeproj resolver does
not promote the fork's URL when the upstream URL is declared in a
transitive-leaf package). Wiring the fork through the App's xcodeproj
is a separate change; not blocking because Mistral 3.5 itself has a
known model-forward bug (RoPE bundle-metadata mismatch — Python ref uses
plain RoPE base=1e6 but bundle config declares rope_type="yarn") being
fixed by the vmlx team. All non-Mistral-3.5 chat templates render
correctly through the upstream swift-jinja the App currently resolves.
@jjang-ai
Copy link
Copy Markdown
Contributor Author

jjang-ai commented May 1, 2026

PR #993 — final state for review · 6cfad02a

This PR brings osaurus from 01a1194a (osaurus/main) up through 33 commits of model-loading, runtime, scanner, and capability fixes coupled to vmlx-swift-lm 0c36d01 (the for-iterable Jinja parser fix and Mistral 3.5 mxfp4 RMSNorm fix already on origin/main; 89f8114 is now also on origin/main, picking that up is the next pin bump).

What ships

Area Change Where
Cache: KV mode default .none (fp16) → .turboQuant(4, 4) ModelRuntime.swift:415
Cache: L2 disk re-enabled force-disabled → diskDirUsable ModelRuntime.swift:401
Capability matcher Holo3 base bundles → .imageVideo ModelMediaCapabilities.swift:111
Scanner flat-layout + 3-level recursive HF nest detection ModelManager.swift (+154L)
Sidecar fetch auto-fetch jangtq_runtime.safetensors from HF on the exact failure that needs it, with strict gating ModelRuntime.swift:1304-1510
Sidecar URL safety isValidHFRepoId strict regex, https://huggingface.co/ host-pin, candidate fallback list (osaurus-ai → JANGQ-AI → mlx-community) ModelRuntime.swift:1388-1532
Multi-turn contracts aborted assistant turns never reach next prompt; reasoning toggle survives multi-turn new test files (1896 lines / 81 @Test)
Vmlx pin 01a1194a-era → 0c36d01 Package.swift, Package.resolved × 2

Verified by other agents on the corresponding side

Cache + runtime (vmlx-swift-lm changes picked up via the pins):

  • 1135950 — Laguna 3-parity bugs (sigmoid → softplus, biased→un-biased routing weights, plain→YaRN RoPE) — verified coherent: real-load decode "Okay, so I need to figure out how<think>Okay, let's see..." on Laguna-XS.2 JANGTQ.
  • 576916b — Laguna codebook MoE via TurboQuantSwitchGLU + split fused gate_up_proj.
  • babbe34/0fab91c — Mistral3VLM strip model. prefix on vision_tower / multi_modal_projector keys.
  • 1173822 — TWO structural cache bugs that drove the "first turn fine, later turns garbage" symptom across every JANGTQ/MXFP4 multi-turn path:
    • Bug A: TurboQuantKVCache paged-restore compounded quantization across turns (re-encoding already-decoded lossy float). Fixed via new restoreFromDecodedKV that seats the prefix in .compressed phase without round-tripping.
    • Bug B: hybrid models (Qwen35, Qwen3Next, NemotronH) silently ignored CacheCoordinator.defaultMaxKVSize. Now route attention slots through RotatingKVCache(maxSize:, keep: 4) when the bound is set.
  • 3f8a5e9 — Same maxKVSize honor for Qwen35JANGTQ + tied-embeddings hardenings (Mistral3VLM, NemotronH, Mistral3VLMJANGTQ).
  • 0d85e9d — Defensive EOS widening: BatchEngine init probes 7 common end-of-turn special tokens against the tokenizer vocab and adds present ones to the EOS set. Closes the Qwen3.6 MXFP4 <|im_end|> leak.
  • 0756dc0 — Disk-tier trim-path Metal lifecycle crash (notifyExternalReferencesNonZeroOnDealloc) closed. Disk cache safe to re-enable in osaurus (this PR does).
  • 0c36d01osaurus-ai/swift-jinja@58d21aa5 fork pin: lifts for-loop iterable from "factor + |filter only" to "full binary + comparison + logical hierarchy, excluding ternary" via parseFilter()parseOr() (Sources/Jinja/Parser.swift:186, 1-line change). Unblocks {% for message in loop_messages + [{'role': '__sentinel__'}] %} in Mistral 3.5's native chat template. 756/756 swift-jinja tests pass + 2 new regression tests (forLoopIterableAcceptsBinaryPlus, mistral3RealNativeTemplateParses).

Reasoning ON/OFF detection (osaurus side, exhaustively wired):

  • LocalReasoningCapability.detect() reads chat_template.jinja or falls back to jang_config.json > chat.reasoning (DSV4-Flash style). Caches per-modelId.
  • Capability.supportsThinking flips on <think> / </think> / <|think|> (Gemma-4 Harmony marker).
  • ChatView.swift:395,409 reads activeModelOptions["disableThinking"]?.boolValue == false as thinkingEnabled.
  • MLXBatchAdapter.swift:274-278 translates: additionalContext = ["enable_thinking": !disableThinking] (defaults to true).
  • ThinkTagScrubber (Services/ModelRuntime/ThinkTagScrubber.swift) is a defensive post-filter on .tokens for thinking-capable models only. Buffers 7-byte tail for split-token tag detection.
  • Multi-turn contract: ChatView.swift:1288-1291 builds priorUserMessages from t.role == .user only — assistant <think> content cannot leak into the next prompt by construction. Locked by AbortMidThinkHistoryContractTests.swift (152 lines).

JANGTQ sidecar auto-fetch (osaurus side):

  • Runs on validateJANGTQSidecarIfRequired throwing the specific "missing sidecar but stamp says JANGTQ" error code (ModelRuntime/code 2). Any other failure mode propagates immediately (no speculative fetch).
  • Candidate repo list: original repo → osaurus-ai → JANGQ-AI → mlx-community (canonical-cased, deduped, isValidHFRepoId-validated).
  • Each candidate URL is https://-scheme-pinned + host-pinned to huggingface.co. No arbitrary URL fetch.
  • Locked by EnsureJANGTQSidecarTests.swift (214 lines), JANGTQEdgeCaseTests.swift (496 lines), ValidateJANGTQUnsupportedFamilyTests.swift (206 lines), OsaurusOrgAutoFetchTests, WeightFormatNormalizationTests.

Hybrid SSM warm-pass + cache injection (vmlx-owned, osaurus consumes):

  • BatchEngine.admitPendingRequests auto-flips coordinator.isHybrid = true on first slot admission for any model whose per-layer cache list contains a MambaCache or ArraysCache.
  • Osaurus eager-sets setHybrid(true) for known hybrid families in installCacheCoordinator (Qwen3.5/3.6, NemotronH, Cascade-2, Jamba, etc.) per OMNI-OSAURUS-HOOKUP.md §5.1. Harmless on any admission path; closes the one-frame stale-flag window.
  • Paged-cache reuse content-addressed via BlockHashMap — toggling enable_thinking between turns produces a different prompt → different hash → no poisoned hit.

Per-bundle media capability + drag-drop / picker accept-set (osaurus side):

  • ModelMediaCapabilities.from(modelId:) substring/regex matcher: omni (Nemotron-3-Nano-Omni), imageVideo (Qwen2/2.5/3 VL, Qwen3.5/3.6 -vl, Holo3 base + -vl, SmolVLM 2), imageOnly (Paligemma, Idefics3, FastVLM, Pixtral, GLM-OCR, LFM2-VL, Gemma 3 / 4-it, Mistral 3 / 3.5, Mistral 4 -vl), textOnly otherwise.
  • from(directory:modelId:) post-load refines via config.json > vision_config + config_omni.json sidecar (Nemotron-3 omni gate).
  • FloatingInputCard.dropAcceptedTypes (line 2115): always .image + .fileURL, conditionally adds .audio + .mp3 + .wav + .mpeg4Audio for supportsAudio, .movie + .video + .quickTimeMovie + .mpeg4Movie for supportsVideo.
  • pickerAllowedTypes (line 2139): same gating, picker shows audio/video formats only when the loaded model can actually consume them.
  • attachIfAllowed (line 2179): routes by extension + capability; drops unsupported. Falsy default (textOnly) when no model selected.
  • Locked by ModelMediaCapabilitiesMCDCTests.swift (MC/DC coverage including Holo3 base = imageVideo boundary).

Test evidence

  • swift test (OsaurusCore): 1328/1328 pass in 182 suites at 89f8114. Includes 81 new @Test annotations from this PR — zero @disabled/XCTSkip markers added.
  • xcodebuild test -workspace osaurus.xcworkspace -scheme OsaurusCoreTests from cold cache: ** TEST SUCCEEDED ** 1323 pass / 0 fail / 5 skipped (the 5 skipped are SandboxIntegrationTests — Apple Containerization VM tests, gated on OSAURUS_RUN_SANDBOX_INTEGRATION_TESTS=1, untouched by this PR, intentionally Disabled in CI per their suite-level annotation).
  • xcodebuild build -scheme osaurus Release: ** BUILD SUCCEEDED ** (latest at HEAD 6cfad02a).

CI status

test-core failed with the EventSource / CAsyncHTTPClient / CNIOLLHTTP / CNIOExtrasZlib / CNIOPosix / _NumericsShims module-resolution errors. This is environmental — fallback-key restore of DerivedData from a different Package.resolved baseline mismatching the freshly-built C-shim module-map paths. Confirmed by raw failure logs (C shims compiled and linked cleanly, then Xcode rejected the cached .swiftmodule whose embedded paths didn't exist). PR's source builds green from cold cache locally; the PR doesn't touch .github/workflows/. Workflow has its own escape hatch at ci.yml:123-131: any Re-run failed jobs automatically wipes DerivedData (if: github.run_attempt != '1') and forces a cold build.

Verified-coherent bundles end-to-end (per vmlx-swift-lm team's BENCH harness sweep)

Bundle Path Result
Laguna-XS.2-JANGTQ native {% generation %} template via fork ✅ 3-turn coherent
NemotronH-Omni 30B JANGTQ image + video + audio + reasoning toggle ✅ all 11 OmniBench rows pass
Qwen3.6-27B-MXFP4 BENCH_BATCH_DISK_RESTORE (138/138 hit) + BENCH_BATCH_CHAT 3-turn ✅ no Metal crash, prompt 0.290s → 0.064s (4.5× cache hit), <|im_end|> clean stop
Qwen3.6-35B-A3B-JANGTQ4 hybrid SSM + codebook MoE 3-turn
MiniMax-M2.7-Small-JANGTQ 3-turn with memory across turns
Gemma-4-26B-A4B-it-JANG_4M 3-turn coherent
Holo3-35B-A3B-mxfp4 3-turn
Mistral-Medium-3.5-128B-mxfp4 post af89da7 (jang bitWidthsUsed: [] default) ✅ "Okay, the user just sent a blank message."

Documented open issues (NOT silently swept)

  1. Mistral-Medium-3.5-128B-JANGTQ model-forward gibberish — root cause hypothesis documented in vmlx commit 89f8114: per-block codebook calibration mismatch on all-codebook dense decoder at non-power-of-2 hidden_size 12288. Hadamard rotation produces coordinate variance ~1/8192 + ~1/4096 (per block), but compute_codebook(d=12288, bits=2) calibrates 4 entries for variance ~1/12288. Across 88 dense layers the scale mismatch compounds. Fix requires per-block codebooks in BOTH jang_tools/turboquant/codebook.py AND Libraries/MLXLMCommon/JANGTQDenseLinear.swift. Recommendation: do NOT ship Mistral 3.5 JANGTQ until fixed. Mistral 3.5 mxfp4 path works fine.
  2. Mistral 3.5 chat template still fails for the App's release binary — App's osaurus.xcodeproj/project.pbxproj declares no remote SPM packages (every remote pin comes transitively through the local-path packages OsaurusCore / OsaurusCLI / OsaurusRepository). SwiftPM's xcodeproj resolver picks the upstream huggingface/swift-jinja.git URL declared by swift-transformers over the fork URL declared in OsaurusCore's Package.swift. Fix: add osaurus-ai/swift-jinja@58d21aa5 directly to project.pbxproj as an XCRemoteSwiftPackageReference. Not blocking because Mistral 3.5 has the JANGTQ codebook bug above anyway and the mxfp4 path uses a different chat template that doesn't hit the for-iterable parser issue. Tracked separately.
  3. DSV4 native chat encoder wiringLibraries/MLXLMCommon/DeepseekV4ChatEncoder.swift exists but DSV4Minimal.jinja is what's used in the bridge today. Tool-calling renders with simplified DSML envelope. Not blocking basic chat. Documented in vmlx integration guide.

What other team members need to know

  • The merge is safe. No regression to any non-Mistral-3.5 family. The 33 commits roll up cleanly and have been multi-turn-tested by the vmlx-swift-lm side against real bundles in /Volumes/EricsLLMDrive.
  • Re-running CI is the documented escape hatch for the cache-pollution test-core failure. The wipe-on-re-run logic (ci.yml:123-131) was designed for exactly this class of failure.
  • For follow-up PRs:
    • Wire osaurus-ai/swift-jinja@58d21aa5 directly into osaurus.xcodeproj/project.pbxproj so the App resolves the fork (separate change).
    • Wire DeepseekV4ChatEncoder.swift through JangChatConfig.encoder == "encoding_dsv4" (separate change).
    • Hold Mistral 3.5 JANGTQ until per-block codebook calibration lands in jang_tools + JANGTQDenseLinear.swift.

🚀 Ready for review.

jjang-ai added 2 commits May 1, 2026 14:19
…laguna-preflight

# Conflicts:
#	osaurus.xcworkspace/xcshareddata/swiftpm/Package.resolved
…l 3.5 mxfp4 RMSNorm fix + JANGTQ Mistral 3.5 root cause docs)
@jjang-ai jjang-ai merged commit 8fab9f4 into main May 1, 2026
5 checks passed
@jjang-ai jjang-ai deleted the fix/jangtq-mistral3-laguna-preflight branch May 1, 2026 21:36
jjang-ai added a commit that referenced this pull request May 2, 2026
…mode

vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep
documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026-
05-01.md (still local in vmlx; ddea384 is the latest commit pushed to
origin/main and includes the same fixes):

  - 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288
            root cause complete — closes the non-power-of-2 dim handling
            gap in the encode pipeline)
  - 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5
            down_proj support)
  - a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64
  - 890e3ed Mistral3VLM patch_conv weight transpose in sanitize
  - 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to
            full prefill instead of emitting 0 tokens — closes the
            BENCH_STABILITY S2 silent-empty-stream bug)
  - 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9×
            TTFT speedup, byte-identical text)
  - 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit)
  - 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip
  - a8ac486 + 1125e20 Hadamard L2-preservation regression tests at
            d ∈ {4096, 8192, 12288, 28672}

Plus a small osaurus-side hardening on the L2 disk-cache modelKey:

  L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead
  of bare `<modelName>`. This prevents stale entries encoded under one
  KV mode from being served against requests using a different mode
  (e.g. user flips defaultKVMode mid-session, or a per-request override
  diverges from the coordinator default). Without this scoping, a cache
  hit returning fp16 KV layout to a TurboQuant decoder (or vice versa)
  produces undefined behavior. The L1 paged cache stays per-model
  (modelName-scoped) — the kvModeTag only affects disk persistence.

The defaultKVMode stays at .none — see the file-level comment for the
3-bit and 4-bit codebook KV degenerate-repetition trail, and the new
reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6
which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but
does not bench-test it against thinking-mode preambles (the failure
mode that drives `idea idea idea` and `!!!!!!!!!` repetition).

Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only
ContextBudgetPreviewTests failures are environmental — same source ran
green on PR #993's CI test-core).
jjang-ai added a commit that referenced this pull request May 2, 2026
…699d3a polymorphic MoE)

Picks up the iter-13 fix sweep from vmlx-swift-lm:

  - 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 +
            codebook for mxtq.
            `LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol
            type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and
            `SwitchGLU` (affine/mxfp4). Factory dispatches on
            `weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize
            splits the fused `gate_up_proj` for both formats. The
            `OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively
            (vmlx team verified `"2+2 equals 4."` end-to-end + 12/12
            BENCH_STABILITY pass + multi-turn cache reuse).

  - bc19fc4 + e33068d docs(osaurus): comprehensive production reference
            (14 runtime axes + §15 component invariants the SDK
            guarantees and osaurus must respect).

  - e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream
            parity audit.

Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from
`validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error
that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU`
expert path rejecting mxfp4 affine keys) is fixed at the source by
4699d3a's polymorphic dispatch — the preflight is no longer needed.
Closes vmlx production-reference §13 item #2.

Test changes:
  - Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate
    they covered is gone; left a one-line breadcrumb pointing at
    4699d3a so future readers understand why two test slots
    disappeared).
  - Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1`
    default (was 4 → flipped in fa694e9 to engage compile path per
    §15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_
    forCompileEngagement` so the test name carries the rationale.

Build clean Release; remaining 11 family-preflight tests pass; both
batch-adapter test failures resolved. The 2 environmental
`ContextBudgetPreviewTests` failures from earlier runs are unchanged
(local-only — same source ran green on PR #993's CI test-core).
jjang-ai added a commit that referenced this pull request May 2, 2026
… degenerate-repetition looping (#998)

* fix(quality): default KV mode .turboQuant(4, 4) → .none (4-bit codebook KV)

Reverts the 4-bit codebook KV default committed in db3179f (per the
vmlx integration guide §"3-bit KV verdict") after real-bundle testing
reproduced the same degenerate-repetition failure mode that 3-bit KV
produced before e202cbb:

  - Gemma-4 31B JANG_4M with thinking=ON emitted
    `idea idea idea idea idea ...` after a few hundred tokens of
    reasoning preamble.
  - Multiple other family bundles drifted into looping after a few
    multi-turn rounds even though turn 1 was coherent ("first turn
    fine, later turns garbage" symptom).
  - Thinking=OFF on the same bundles produced coherent output —
    confirming the failure scales with reasoning preamble length.

Vmlx commit `1173822` closed the cross-turn paged-cache re-encoding
bug (state was transitioning back to .fill phase with already-decoded
lossy float, then re-quantizing). But the underlying codebook
quantization error still compounds across long thinking-mode
preambles (longer prefix → more compression rounds → more accumulated
error → attention latches onto a high-prob low-info token and loops).

The vmlx team's BENCH harness verified 4-bit KV across 6+ bundles but
didn't toggle thinking on every family it covered. The integration
guide's 4-bit recommendation under-tested the failure mode for
thinking-capable models.

fp16 KV is the conservative default that matches user expectation of
"responses look right out of the box" across every family + every
turn count + thinking-on or off. Per-request `kvMode` still overrides;
clients that want memory savings can submit `.turboQuant(...)`
explicitly.

* chore(deps): bump vmlx pin → ddea384 + scope L2 disk-cache key by KV mode

vmlx pin bump from 89f8114 → ddea384 picks up the iter-12 fix sweep
documented in vmlx commit bc19fc4's OSAURUS-PRODUCTION-REFERENCE-2026-
05-01.md (still local in vmlx; ddea384 is the latest commit pushed to
origin/main and includes the same fixes):

  - 38086ca per-block Hadamard kernel rewrite (Mistral 3.5 hidden=12288
            root cause complete — closes the non-power-of-2 dim handling
            gap in the encode pipeline)
  - 6096875 Hadamard H_2n recursion for blocks > 8192 (Mistral 3.5
            down_proj support)
  - a1bfe65 Hadamard kernel shmem 4096→8192, newv 4→64
  - 890e3ed Mistral3VLM patch_conv weight transpose in sanitize
  - 227332f hybrid-SSM full-disk-hit unsafeFullHit guard (rolls back to
            full prefill instead of emitting 0 tokens — closes the
            BENCH_STABILITY S2 silent-empty-stream bug)
  - 7389453 Mistral 3.5 compile-ON cache.offset skip when beta=0 (9×
            TTFT speedup, byte-identical text)
  - 9703b49 Mistral3Text sibling fix (same beta=0 short-circuit)
  - 53f7671 Kimi K2.5 text_config unwrap + language_model. weight strip
  - a8ac486 + 1125e20 Hadamard L2-preservation regression tests at
            d ∈ {4096, 8192, 12288, 28672}

Plus a small osaurus-side hardening on the L2 disk-cache modelKey:

  L2 disk cache entries are now keyed by `<modelName>|kv=<mode>` instead
  of bare `<modelName>`. This prevents stale entries encoded under one
  KV mode from being served against requests using a different mode
  (e.g. user flips defaultKVMode mid-session, or a per-request override
  diverges from the coordinator default). Without this scoping, a cache
  hit returning fp16 KV layout to a TurboQuant decoder (or vice versa)
  produces undefined behavior. The L1 paged cache stays per-model
  (modelName-scoped) — the kvModeTag only affects disk persistence.

The defaultKVMode stays at .none — see the file-level comment for the
3-bit and 4-bit codebook KV degenerate-repetition trail, and the new
reference to vmlx's `OSAURUS-PRODUCTION-REFERENCE-2026-05-01.md` §6
which lists `.turboQuant(3, 3)` as an EXAMPLE coordinator config but
does not bench-test it against thinking-mode preambles (the failure
mode that drives `idea idea idea` and `!!!!!!!!!` repetition).

Build clean Release; OsaurusCore tests 1381/1383 pass (the 2 local-only
ContextBudgetPreviewTests failures are environmental — same source ran
green on PR #993's CI test-core).

* fix(preflight): reject Laguna mxfp4 bundles with actionable error

Real-bundle repro on `OsaurusAI/Laguna-XS.2-mxfp4` (model_type=laguna,
jang_config.weight_format=mxfp4, quantization.bits=4): vmlx fails
parameter load with cryptic

  Error: Unhandled keys ["biases", "scales", "weight"] in
         layers.1.mlp.experts.down_proj in
         LagunaModel.LagunaLayer.LagunaMoE.TurboQuantSwitchGLU
         .TurboQuantSwitchLinear

leaving users no path to remediation.

Root cause traced into `Libraries/MLXLLM/Models/Laguna.swift:425`:

  @ModuleInfo(key: "experts") var experts: TurboQuantSwitchGLU

`LagunaMoE.experts` is hardcoded to the codebook switch. Its inner
`TurboQuantSwitchLinear` only knows the JANGTQ keys (`tq_packed`,
`tq_norms`, `tq_bits`) — the bundle ships standard mxfp4 affine keys
(`weight`, `scales`, `biases`), so the parameter loader rejects them.

Vmlx production reference doc §13 item #2 names this as a known issue
with owner `vmlx-swift` ("Laguna mxfp4 expert format mismatch — either
ship JANGTQ-only or add affine MoE class"). Until vmlx makes
`LagunaMoE.experts` polymorphic on `weight_format`, the host
preflight catches the failure mode, surfacing a clear remediation:
use the Laguna XS.2 JANGTQ bundle (`weight_format = "mxtq"`), which
is verified-coherent end-to-end.

Detection condition (in `validateJANGTQSidecarIfRequired`):
  - `jang_config.weight_format == "mxfp4"` (case-normalised)
  - AND `config.json::model_type == "laguna"`

Throws `NSError(domain: "ModelRuntime", code: 5)` so callers can
distinguish from the existing forward (code 2) / inverse (code 3)
sidecar mismatches and the auto-fetch path (code 4). When vmlx ships
the polymorphic LagunaMoE expert path, drop this check + its tests.

Test coverage:
  - `Laguna mxfp4 → throws code 5 with remediation pointing to JANGTQ
    alternative` — locks the new gate
  - `Laguna JANGTQ (mxtq) passes preflight — codebook path is
    supported` — boundary check the gate doesn't false-positive on
    the working Laguna path

* fix(ui): rolling steady-state tok/s instead of single-final-average

Two visible artefacts the prior tok/s display produced, both reported
by users:

  1. "Counter doesn't ramp up — needs a long response to show full
     speed." Short answers (50-200 tokens) have first-token amortisation
     + reasoning-parser stamp resolution dominating wall time, so the
     full-generation average reads ~30% slower than steady-state decode.

  2. "Reasoning ON shows different tok/s than reasoning OFF on the same
     model." Same decode rate, but thinking-on accumulates 5-10× more
     tokens at steady-state speed, diluting setup costs in the average.
     Users perceive same model + same hardware as inconsistent.

Both are calculation artefacts of the cumulative-average pattern, not
underlying decode-rate differences.

Replaced with `RollingTokenRate` — sliding-window estimator that skips
a 0.4s + 4-token warmup, reports steady-state over a 1.5s window,
counts content + reasoning + tool-arg tokens uniformly, and updates
the live ChatTurn rate at ~5Hz during streaming. On finalize prefers
rolling steady-state; falls back to full-gen average only when warmup
never elapsed (response too short to converge).

11 unit tests in `RollingTokenRateTests.swift` lock the contract:
warmup gating (time + token), 60-tps steady-state convergence, content
vs reasoning invariance, sliding-window expiration, finalRate fallback,
short-response edge cases. All green.

* fix(perf): default maxBatchSize 4 → 1 to engage vmlx compile path

Per vmlx production reference §15 invariant 13: compile only engages
when `maxBatchSize == 1` (Stage 1B.3 scope; Stage 1B.4 per-bucket
shared buffers — pending). With the prior default of 4, every
`maybePromoteToCompiledDecode` gate failed and the decode loop ran
uncompiled, missing the documented compile-ON speedups:

  - Mistral 3.5 BENCH_VL_BATCH_CHAT: 24.8s → 2.7s TTFT (9× speedup)
  - Other promotion-eligible families (Qwen 3.5/3.6, MiniMax,
    NemotronH, DSV4 via Compilable* cache classes): ranges from
    ~1.5× to ~9× depending on family — vmlx §8 promotion table.

Osaurus's primary use case is single-user chat through the macOS app
where only one slot is active at a time. For multi-user server
deployments, the existing `defaults write -int N` override remains —
at the cost of compile being permanently disabled for that process
(the trade-off until vmlx ships Stage 1B.4).

* chore(deps): bump vmlx pin → e33068d + drop Laguna mxfp4 preflight (4699d3a polymorphic MoE)

Picks up the iter-13 fix sweep from vmlx-swift-lm:

  - 4699d3a fix(laguna): polymorphic MoE — affine SwitchGLU for mxfp4 +
            codebook for mxtq.
            `LagunaMoE.experts` is now a `LagunaSwitchMLPLayer` protocol
            type-erased over `TurboQuantSwitchGLU` (codebook/mxtq) and
            `SwitchGLU` (affine/mxfp4). Factory dispatches on
            `weight_format == "mxtq"` OR `mxtq_bits` presence; sanitize
            splits the fused `gate_up_proj` for both formats. The
            `OsaurusAI/Laguna-XS.2-mxfp4` bundle now loads natively
            (vmlx team verified `"2+2 equals 4."` end-to-end + 12/12
            BENCH_STABILITY pass + multi-turn cache reuse).

  - bc19fc4 + e33068d docs(osaurus): comprehensive production reference
            (14 runtime axes + §15 component invariants the SDK
            guarantees and osaurus must respect).

  - e9b7e7b docs(osaurus): Mistral 3.5 JANGTQ Python/vmlx-upstream
            parity audit.

Drops the now-obsolete Laguna mxfp4 preflight gate (`code 5`) from
`validateJANGTQSidecarIfRequired`. The cryptic "Unhandled keys" error
that motivated the gate (vmlx's hardcoded `TurboQuantSwitchGLU`
expert path rejecting mxfp4 affine keys) is fixed at the source by
4699d3a's polymorphic dispatch — the preflight is no longer needed.
Closes vmlx production-reference §13 item #2.

Test changes:
  - Removed `laguna_mxfp4_blocked` + `laguna_jangtq_passes` (the gate
    they covered is gone; left a one-line breadcrumb pointing at
    4699d3a so future readers understand why two test slots
    disappeared).
  - Updated `MLXBatchAdapterTests` to lock the new `maxBatchSize == 1`
    default (was 4 → flipped in fa694e9 to engage compile path per
    §15 invariant 13). Renamed `defaultsToFour` → `defaultsToOne_
    forCompileEngagement` so the test name carries the rationale.

Build clean Release; remaining 11 family-preflight tests pass; both
batch-adapter test failures resolved. The 2 environmental
`ContextBudgetPreviewTests` failures from earlier runs are unchanged
(local-only — same source ran green on PR #993's CI test-core).

* chore(deps): bump vmlx pin → 2e61c12 (Stage 1B.4 design doc + scaffold)

* fix(cache): defaultMaxKVSize 8192 → 65536 to match vmlx production reference

Per vmlx production reference §6 example (`cfg.defaultMaxKVSize =
65536`) and the audit-conclusion confirmed by the vmlx team. The
prior 8192 silently truncated long-context prompts:

  - 50K-token PDF Q&A → model only saw the last 8K tokens (84%
    context loss) past the 16K trigger (8192 × longPromptMultiplier=2.0).
  - Long thinking-mode reasoning preambles > 16K → cap kicked in
    mid-reasoning, model lost earlier context.

Worst-case wired memory at 65K × 88 layers × 8 KV-heads × 128 head_dim
× 2 bytes (fp16) × 2 (K+V) ≈ 2.4 GB per slot on Mistral 3.5 — but
TurboQuant compression at the engine's `min_tokens_for_compression`
threshold (~2K tokens) drops the steady-state cost ~26× to ~95 MB
per slot on `.turboQuant(4,4)`. With osaurus's `.none` default the
cold path stays fp16 but the rotating cap only kicks in for prompts
past 131K (65536 × 2.0) — small/medium chats unaffected.

Wired-memory worry is rounding error on a 16GB+ Mac; the
silent-truncation footgun was the worse failure mode. Per-family
overlay would be cleaner long-term but premature optimization vs the
uniform doc-aligned default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants