Setup and run flash-moe with:
mlx-community/Qwen3.5-35B-A3B-4bit
This guide covers the working flow from generating the expert index through launching the chat client.
-
macOS with Metal support
-
Xcode command line tools
-
Python 3 in a virtual environment
-
Local Hugging Face model snapshot for:
mlx-community/Qwen3.5-35B-A3B-4bit
In the commands below, this model snapshot is used:
MODEL=/Users/sbaruwal/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/1e20fd8d42056f870933bf98ca6211024744f7ecUpdate that variable if your snapshot path differs.
From the repo root:
cd /Users/sbaruwal/Repo/flash-moe-1
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install tokenizers
pip install numpycd /Users/sbaruwal/Repo/flash-moe-1
source .venv/bin/activate
python3 build_expert_index_35b.py --model-path "$MODEL" --out expert_index_35b.jsonOutput:
expert_index_35b.json
Validate first with a dry run:
python3 repack_experts_35b.py --index expert_index_35b.json --layers 11 --dry-runThen repack all expert layers:
python3 repack_experts_35b.py --index expert_index_35b.jsonpython3 metal_infer/extract_weights_35b.py --output metal_infer/out_35bOutputs:
metal_infer/out_35b/model_weights.binmetal_infer/out_35b/model_weights.json
Two files are required:
tokenizer.binfor prompt/chat tokenizationvocab.binfor token decoding
python3 metal_infer/export_tokenizer_35b.py "$MODEL/tokenizer.json" metal_infer/tokenizer.binpython3 metal_infer/export_vocab_35b.py "$MODEL/tokenizer.json" metal_infer/vocab.binVerify both files exist:
ls -lah metal_infer/tokenizer.bin metal_infer/vocab.bincd /Users/sbaruwal/Repo/flash-moe-1/metal_infer
clang -O2 -Wall -fobjc-arc -framework Metal -framework Foundation -framework Accelerate -lpthread infer.m -o infer
#### if xcode command line tools version is low, inset -lcompression option.From the repo root:
cd /Users/sbaruwal/Repo/flash-moe-1
./metal_infer/infer --model "$MODEL" --weights metal_infer/out_35b/model_weights.bin --manifest metal_infer/out_35b/model_weights.json --vocab metal_infer/vocab.bin --prompt "Mount Everest" --tokens 32A healthy startup should show:
- 40 layers
- 256 experts
- hidden size 2048
- MoE intermediate 512
- shared intermediate 512
Start the server from the repo root:
cd /Users/sbaruwal/Repo/flash-moe-1
./metal_infer/infer --model "$MODEL" --weights metal_infer/out_35b/model_weights.bin --manifest metal_infer/out_35b/model_weights.json --vocab metal_infer/vocab.bin --serve 8000The server listens on:
http://0.0.0.0:8000
Quick API test:
curl -N http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Mount Everest"}],"max_tokens":32,"stream":true}'From metal_infer:
cd /Users/sbaruwal/Repo/flash-moe-1/metal_infer
clang -O2 -Wall -fobjc-arc -framework Foundation chat.m linenoise.c -o chatIf your Makefile supports it, this is also fine:
make chatWith the server already running in another terminal:
cd /Users/sbaruwal/Repo/flash-moe-1/metal_infer
./chatThe chat client connects to:
http://localhost:8000
Supported commands in the client:
/quit/exit/clear/sessions
cd /Users/sbaruwal/Repo/flash-moe-1
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install tokenizers
pip install numpy
MODEL=/Users/sbaruwal/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/1e20fd8d42056f870933bf98ca6211024744f7ec
python3 build_expert_index_35b.py --model-path "$MODEL" --out expert_index_35b.json
python3 repack_experts_35b.py --index expert_index_35b.json
python3 metal_infer/extract_weights_35b.py --output metal_infer/out_35b
python3 metal_infer/export_tokenizer_35b.py "$MODEL/tokenizer.json" metal_infer/tokenizer.bin
python3 metal_infer/export_vocab_35b.py "$MODEL/tokenizer.json" metal_infer/vocab.bin
cd metal_infer
clang -O2 -Wall -fobjc-arc -framework Metal -framework Foundation -framework Accelerate -lpthread infer.m -o infer
clang -O2 -Wall -fobjc-arc -framework Foundation chat.m linenoise.c -o chatThen start the server:
cd /Users/sbaruwal/Repo/flash-moe-1
./metal_infer/infer --model "$MODEL" --weights metal_infer/out_35b/model_weights.bin --manifest metal_infer/out_35b/model_weights.json --vocab metal_infer/vocab.bin --serve 8000And in a second terminal:
cd /Users/sbaruwal/Repo/flash-moe-1/metal_infer
./chatexpert_index_35b.json
metal_infer/out_35b/model_weights.bin
metal_infer/out_35b/model_weights.json
metal_infer/tokenizer.bin
metal_infer/vocab.bin
metal_infer/infer
metal_infer/chat
This 35B adaptation uses:
- 40 layers
- 256 experts
- hidden size 2048
- MoE intermediate size 512
- shared expert intermediate size 512
Also note:
tokenizer.binandvocab.binare different filestokenizer.binis used for prompt/chat tokenizationvocab.binis used for token decoding