Skip to content

WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1

Open
itanka9 wants to merge 9 commits into
Ma-Dan:mainfrom
itanka9:qwen-universal
Open

WIP: Improve model config processing to support Qwen 35b, 122b and 397b models#1
itanka9 wants to merge 9 commits into
Ma-Dan:mainfrom
itanka9:qwen-universal

Conversation

@itanka9

@itanka9 itanka9 commented Jun 22, 2026

Copy link
Copy Markdown

This PR remove hardcoded patch from infer and python scripts and grabs neccessery params from models config.json file.

Now tested on 35b and 122b models. Test in 397b model pending.

Co-authored by opus 4.6

Dan and others added 9 commits April 12, 2026 12:13
  Update all architecture constants, expert layout, and tooling to support
  Qwen3.5-122B-A10B-4bit (48 layers, 256 experts, hidden_size=3072) loaded
  from ~/.cache/modelscope.

  Changes:
  - infer.m: update HIDDEN_DIM, NUM_LAYERS, NUM_EXPERTS, NUM_EXPERTS_PER_TOK,
    NUM_FULL_ATTN_LAYERS, NUM_LINEAR_LAYERS, all 4-bit/2-bit expert byte
    offsets, and MODEL_PATH_DEFAULT for 122B
  - extract_weights.py: update model config and default path for 122B
  - repack_experts.py: update COMPONENTS layout, EXPERT_SIZE, NUM_EXPERTS,
    NUM_LAYERS, and fix verify loop (was hardcoded to expert index 511)
  - generate_expert_index.py: new script — scans safetensors headers and
    writes expert_index.json mapping each layer's stacked expert tensors
    to their file offsets and strides
  - export_vocab.py: new script — exports vocab.bin with proper GPT-2
    byte-level BPE decoding so Chinese, Arabic, and all non-ASCII tokens
    render correctly in output
  - usage.txt: new file — complete step-by-step command reference

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Update repack_experts_2bit.py for Qwen3.5-122B-A10B-4bit:
  - EXPERT_SIZE_4BIT 7,077,888 → 5,308,416 (hidden 4096→3072)
  - NUM_EXPERTS 512 → 256, NUM_LAYERS 60 → 48
  - Recalculate all 4-bit and 2-bit offsets for 3072 hidden dim
  - EXPERT_SIZE_2BIT 3,932,160 → 2,949,120
  - Default path updated to modelscope/mlx-community/122B

  Add Step 4b to usage.txt covering 2-bit repack commands (single-layer
  verify, full run) with note that 2-bit breaks JSON/tool calling.
  Previously the server always sent SSE (text/event-stream) regardless of
  the stream parameter. Now:

  - Parse "stream" from the request body (default true)
  - stream:true — existing SSE behaviour unchanged
  - stream:false — buffer all tokens, send a single application/json
    chat.completion object with Content-Length when generation finishes

  Token accumulation was already happening for session persistence, so
  non-streaming just skips the per-token SSE writes and emits one response.
…hanges: tools injection into system prompt, parse_tool_call (JSON formats), tool_calls response shape, cold-prefill bypass for tool requests,

  temperature parameter, reasoning_content extraction, and debug logging.
  Key changes: build_multiturn_prompt replays full message history into the Qwen3.5 chat template for stateless clients, role:tool result turns, and auto-continuation detection (skips cold prefill when the last assistant message matches g_last_assistant_content).
Update all architecture constants for 35B: hidden=2048, 40 layers, 256 experts, K=8, MOE_INTERMEDIATE=512, LINEAR_NUM_V_HEADS=32.
Fix expert byte offsets in infer.m (replace hardcoded 122B values with #defines for 35B layout). Add cpu_dequant_matvec_8bit for MoE routing gate, which mlx-community quantizes at bits=8 rather than bits=4.
Update extract_weights.py, generate_expert_index.py, repack_experts.py, and repack_experts_2bit.py with 35B shapes, layer counts, and paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant