A personalized iOS keyboard that autocompletes in your voice — lowercase, slang, the sentence shape you actually use — instead of generic AI output. Built on a custom logit-bias layer over Gemma 4 E2B, served from a FastAPI cloud backend.
- iOS keyboard extensions are capped at ~48 MB resident. A 4-bit Gemma is 3 GB. On-device LLM is physically impossible, so the keyboard is necessarily a thin client and inference lives in the cloud.
- LoRA fine-tuning didn't produce usable quality. Tried r=64 full-corpus, r=64 DoRA, a tiny r=8, and r=8 on a scrubbed corpus. None of the runs beat the bias-layer approach in side-by-side evaluation, and the smaller adapters showed the usual fine-tune brittleness on out-of-distribution inputs. Replaced with the sample-time logit-bias layer below (full postmortem at experimental/docs/LORA_TRAINING_LOG.md).
- Sub-500 ms suggestion latency on a 4B-param model required KV-cache reuse for the persona preamble, retrieval-augmented exemplars from the user's own message corpus, per-request mode dispatch, and a short-TTL dedup cache to absorb KeyboardKit's rapid re-fires.
For every next token, Gemma produces a length-262,144 logits vector. The bias layer adds a second length-262,144 vector before sampling:
adjusted = logits
+ strength × bias_vector # unigram voice
+ ngram_alpha × bigram_bonus[prev] # 2-gram context
adjusted[logits < (max_logit - margin)] = -1e30 # admissibility filter
Where bias_vector[t] = log(1 + count[t]) − log(1 + total/unique), clipped to ±MAX_BIAS, with a ~9k-token dampen set (numeric-led + pure-punctuation) zeroed out so digits and punctuation in the user's corpus don't contaminate ordinary suggestion contexts.
The admissibility margin is the safety story: bias can reorder tokens the raw model already considered plausible, but it can't resurrect implausible ones. That's why the model never says "<NAME_A>" the way LoRA did — the base model didn't put <NAME_A> in its top-K, so no amount of bias can push it through.
Hyperparameters (T=0.5, top_p=0.92, top_k=64, min_p=0.05, repeat_penalty=1.15, max_tokens=10, margin=5.0, strength=0.4, ngram_alpha=0.4) came out of a 744-run sweep across realistic typing contexts. The harness and fixtures are at experimental/server/tools/voice_eval_v2/.
App Group (UserDefaults suite)
────────────────────────────────
device_id, server_url, snippets,
blocklist, mode, retention_pref
│
┌────────────────────────────────────┼────────────────────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Glide app │ │ GlideKeyboard │ │ ActionExtension │
│ (SwiftUI │ │ (KeyboardKit │ │ (selectedText │
│ settings, │ │ extension — │ │ capture for │
│ onboarding, │ │ ~48MB ceiling) │ │ Smart Reply) │
│ blocklist, │ └────────┬─────────┘ └──────────────────┘
│ retention) │ │
└─────────────┘ │ HTTPS
▼
┌────────────────────────────────────┐
│ FastAPI server (server/main.py) │
│ ┌──────────────────────────────┐ │
│ │ /api/suggest → cotypist.py │ │
│ │ • Gemma 4 E2B Q4_K_M │ │
│ │ • UserProfile (bias layer) │ │
│ │ • Retrieval exemplars │ │
│ │ • Admissibility filter │ │
│ ├──────────────────────────────┤ │
│ │ /api/autocorrect │ │
│ │ • SymSpell + QWERTY-DL │ │
│ │ • Bigram rescore │ │
│ ├──────────────────────────────┤ │
│ │ /api/keystrokes /api/flush │ │
│ │ /api/accept /api/smart_reply│ │
│ └──────────────────────────────┘ │
│ SQLite: keystrokes, messages, │
│ device_preferences │
└────────────────────────────────────┘
| Layer | Stack |
|---|---|
| iOS keyboard | Swift 5.9, SwiftUI, KeyboardKit, App Groups, AVFoundation (dictation) |
| iOS main app | SwiftUI, App Intents (Back-Tap context capture), Action Extension (selected-text capture) |
| Server | FastAPI, llama-cpp-python, SQLite |
| Model | Gemma 4 E2B Q4_K_M (~3 GB, ~2.3B effective params via Per-Layer Embeddings) |
| Autocorrect | SymSpell (Damerau-Levenshtein) + QWERTY-weighted edit distances + bigram rescore |
| Build | xcodegen + Xcode 15 / iOS 16+ deployment target |
# Server
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r server/requirements.txt
export GLIDE_GEMMA_GGUF=/path/to/gemma-4-e2b-q4_k_m.gguf
uvicorn --app-dir server main:app --host 0.0.0.0 --port 8000
# Smoke test
curl -s -X POST localhost:8000/api/suggest \
-H 'content-type: application/json' \
-d '{"context":"hey wha","device_id":"local-dev"}' | jq .
# iOS
xcodegen # generates Glide.xcodeproj from project.yml
open Glide.xcodeproj # build & run on a real device (App Group entitlements)In the Glide app's Settings → Server, point at your Mac's local IP (http://<your-mac>.local:8000) or your cloud URL. The keyboard reads the URL from the App Group; no rebuild when you switch.
glide/
├── Glide/ iOS main app (SwiftUI — settings, blocklist, retention toggle)
├── GlideKeyboard/ iOS keyboard extension (KeyboardKit, ~48MB ceiling)
├── GlidePolishExtension/ Action extension for selected-text → Smart Reply
├── server/
│ ├── main.py FastAPI app, endpoints, retention cleanup
│ ├── cotypist.py Logit-bias completer + retrieval exemplars + KV cache
│ ├── autocorrect.py SymSpell + QWERTY-weighted DL + bigram rescore
│ ├── retrieval.py FAISS-style embedding retrieval
│ ├── blocklist.py Suggestion content moderation
│ └── importers/ iMessage chat.db → corpus importer
├── experimental/ LoRA training pipeline + voice_eval_v2 sweep + Qwen 14B
│ rewrite tools. Parked, kept for the postmortem.
├── docs/
│ ├── AUTOCORRECT.md Autocorrect design (SymSpell + QWERTY-DL + bigram)
│ ├── COTYPIST_LOG.md Algorithm reference for the suggestion path
│ ├── DEPLOY.md Fly.io deployment walkthrough
│ ├── PERFORMANCE.md Suggestion-path latency reference
│ ├── SNIPPETS.md Text-snippet feature design
│ └── EMOJI_SUPPORT.md Emoji key + emoji-suggestion design notes
└── project.yml xcodegen input — generates Glide.xcodeproj
LoRA fine-tuning, then DoRA, then a tiny LoRA, then a tiny LoRA on a scrubbed dataset. None of them produced output quality that beat the bias-layer approach in side-by-side comparisons, and the smaller adapters showed the usual fine-tune brittleness on out-of-distribution inputs. Full negative-result writeup at experimental/docs/LORA_TRAINING_LOG.md. The vector-bias layer described above replaced all of it.