A minimal SwiftUI iPhone app that runs Bonsai-1.7B (1-bit, Q1_0) on-device via
upstream llama.cpp with the Metal GPU
backend. No forks, no custom kernels.
Real-device only. The bundled
llama.xcframeworkcontains just theios-arm64(device) slice, so this project builds and runs on a physical iPhone/iPad — not the iOS Simulator or macOS. (The simulator's Metal is incomplete anyway; a real device is what matters.) To re-add simulator/macOS slices, rebuild the framework — see the bottom of this file.
Bonsai's Q1_0 1-bit format (originally Q1_0_g128) was merged into upstream llama.cpp
in April 2026 (PR #21273, CPU+format;
Metal in follow-up #21528). GGML_TYPE_Q1_0 = 41 ships upstream. The Hugging Face model card
still says "use the PrismML fork" — that's out of date.
project.yml XcodeGen config (generates BonsaiChat.xcodeproj)
Frameworks/
llama.xcframework Prebuilt llama.cpp engine, Metal, ios-arm64 device slice (~5 MB)
BonsaiChat/
LlamaWrapper.swift Thin Swift wrapper over the llama.cpp C API (the integration unit)
ChatViewModel.swift Model loading, streaming, bundled-model auto-load, benchmark
ContentView.swift SwiftUI UI
BonsaiChatApp.swift @main entry
scripts/
build-llama-xcframework.sh Rebuild llama.xcframework from upstream llama.cpp
models/ (gitignored) put your .gguf here — too large for GitHub
The model .gguf is not in the repo (exceeds GitHub's 100 MB limit). Download it into
models/ before building (see below).
- Xcode 16+ (built on Xcode 26), a physical iPhone/iPad, and an Apple Developer Team for signing.
- XcodeGen:
brew install xcodegen(the.xcodeprojis generated, not committed).
mkdir -p models
# Official Bonsai 1.7B Q1_0 (~237 MB):
pip install -U "huggingface_hub[cli]"
hf download prism-ml/Bonsai-1.7B-gguf Bonsai-1.7B-Q1_0.gguf --local-dir modelsproject.yml bundles models/bonsai-1.7b-multitask-7tasks-Q1_0.gguf into the app — put a
.gguf at that path, or edit the buildPhase: resources line to point at the file you
downloaded. ChatViewModel.loadBundledModel() auto-loads whatever .gguf is bundled.
xcodegen generate
open BonsaiChat.xcodeprojIn Xcode: scheme BonsaiChat-iOS → select your iPhone → target Signing & Capabilities → set your Team → ⌘R. The app auto-loads the bundled model and runs on Metal (status bar shows ⚡ Metal). Tap Benchmark for prefill/generate tok/s.
- Drag
Frameworks/llama.xcframeworkinto your target → Embed & Sign (ensureLD_RUNPATH_SEARCH_PATHShas@executable_path/Frameworks). - Add
LlamaWrapper.swift— it'simport llama(no bridging header; the xcframework ships a module map). - Use it:
LlamaWrapper.bootstrap() let llama = try LlamaWrapper(modelPath: path) for try await e in llama.generate(prompt: "Hello") { if case .token(let t) = e { print(t, terminator: "") } }
LlamaWrapper already handles Metal offload (n_gpu_layers = 99), the ChatML chat template
<think></think>suppression, EOS stop (<|im_end|>), and performance-core thread counts.
Native macOS CPU ~63 tok/s · macOS Metal ~217 tok/s. On-device numbers: use the in-app Benchmark. (The iOS Simulator runs CPU-only at ~9 tok/s and is not a valid benchmark.)
The committed Frameworks/llama.xcframework is device-only. To rebuild from upstream:
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git vendor/llama.cpp
bash scripts/build-llama-xcframework.sh # → vendor/llama.cpp/build-apple/llama.xcframework (iOS device+sim, macOS)That produces a multi-slice framework. To keep only the device slice (as shipped here):
xcrun xcodebuild -create-xcframework \
-framework vendor/llama.cpp/build-apple/llama.xcframework/ios-arm64/llama.framework \
-output Frameworks/llama.xcframework