Local LLM gateway for Apple Silicon
Run GGUF and MLX models locally. Proxy OpenAI & Anthropic API requests. Real-time monitoring. Built for coding agents.
Install · Features · Quick Start · API · Architecture
Flow LLM is a local LLM gateway for macOS. It manages GGUF and MLX models on Apple Silicon, proxies OpenAI- and Anthropic-compatible API requests, and exposes real-time monitoring — so tools like OpenClaw, Hermes, Claude Code, and Codex (via AIRun) can talk to local models without Ollama or LM Studio.
- JIT Model Loading — Models auto-load on first request and auto-unload after idle cooldown. Togglable in Settings (defaults ON, 5 min cooldown). Circuit breaker prevents memory exhaustion
- Real-time Monitor — Per-request lifecycle tracking (queued → prefilling → generating → completed), odometer-style token counter, WebSocket push, idle waveform
- OpenAI & Anthropic APIs — Drop-in proxy for
/v1/chat/completionsand/v1/messages. Streaming and non-streaming, tool calling, system prompts, reasoning/thinking block translation - GGUF & MLX — Run llama.cpp GGUF models or MLX models (text-only and vision) on Apple Silicon with sensible defaults (100K context, flash attention, q4_0 KV cache). Speculative decoding for MLX
- Agent-Ready — Parallel slot support, Anthropic streaming SSE adapter, input token estimation fallback, stuck request pruning
- Connect External — Adopt an already-running llama-server without restarting it. Auto-detects model name
- HuggingFace Browser — Search and download models directly from the UI. Scan local directories for unregistered GGUF files
- Telemetry — TTFT, throughput, token counts per request. Card-based history with color-coded metrics
- Template Validation — Validates chat templates before loading (Jinja syntax, system role, tool calling)
- Single Binary —
pip install -e . && flow. One process, one port (3377). Frontend bundled in the package
curl -fsSL https://raw.githubusercontent.com/styles01/flow-llm/main/setup.sh | bashOr clone and run:
git clone https://github.com/styles01/flow-llm.git
cd flow-llm && ./setup.sh
flowOpen http://localhost:3377 — API and UI from a single process.
Flow requires inference backends. Install at least one:
# Required — GGUF models
brew install llama.cpp
# Optional — MLX models (text + vision)
pip install mlx_lm mlx_vlmflowIn the UI: Models → search HuggingFace, download and register any model. Or connect a running backend.
Or via API:
curl -X POST http://localhost:3377/api/register-local \
-H "Content-Type: application/json" \
-d '{"gguf_path": "/path/to/model.gguf"}'With JIT (default): Your first inference request auto-loads the model. Nothing else to do.
Without JIT: Turn it off in Settings, then load models explicitly via the Models page or POST /api/models/{id}/load.
{
"models": {
"providers": {
"flow": {
"baseUrl": "http://127.0.0.1:3377/v1",
"apiKey": "flow-local",
"api": "openai-completions"
}
}
}
}Flow also exposes POST /v1/messages for Claude Code and other Anthropic API tools.
Flow ships with sensible defaults for Apple Silicon:
| Setting | Default | Why |
|---|---|---|
| Context window | 100,000 tokens | Coding agents need long context |
| Flash attention | On | Critical for long context performance |
| KV cache | q4_0 | 75% memory savings, enables 100K on 48GB |
| GPU layers | -1 (all) | Metal acceleration |
| Parallel slots | 2 | Concurrent agent requests |
| JIT loading | On | Auto-load models on first request |
| JIT cooldown | 300s (5 min) | Auto-unload idle models |
| Auto-update | On | Checks backend versions on startup |
Configurable in Settings page, persisted to ~/.flow/settings.json.
cd server && pip install -e .
cd ../web && npm install && npm run devFrontend dev server at http://localhost:5173 proxies API requests to the backend. Rebuild bundled frontend:
cd web && npm run build| Dependency | Purpose | Install |
|---|---|---|
| Python 3.11+ | Runtime | System |
| llama.cpp | GGUF inference backend | brew install llama.cpp |
| Node.js 18+ | Frontend build | brew install node |
| Package | Purpose |
|---|---|
| fastapi | Management server and API |
| uvicorn | ASGI server |
| httpx | Async HTTP proxy |
| sqlalchemy | Model registry (SQLite) |
| huggingface-hub | Model search and download |
| jinja2 | Chat template validation |
| psutil | Hardware detection |
| pydantic | Request/response models |
| websockets | Real-time updates |
| Dependency | Purpose | Install |
|---|---|---|
| mlx_lm / mlx_vlm | MLX inference backends (text + vision) | pip install mlx_lm mlx_vlm |
Flow can adopt an already-running backend without restarting it:
curl -X POST http://localhost:3377/api/connect-external \
-H "Content-Type: application/json" \
-d '{"base_url": "http://127.0.0.1:8081"}'Auto-detects the model name. Unloading kills the backend process and frees memory.
| Port | Service |
|---|---|
| 3377 | Flow management server |
| 5173 | Frontend dev server (Vite) |
| 8081+ | llama.cpp backend processes |
| 8100+ | mlx-openai-server backend processes |
| Method | Endpoint | Purpose |
|---|---|---|
| GET | /api/hardware |
Hardware info (chip, memory, Metal) |
| GET | /api/models |
List all registered models |
| GET | /api/models/{id} |
Get model details |
| GET | /api/models/running |
List running models |
| POST | /api/models/{id}/load |
Load a model |
| POST | /api/models/{id}/unload |
Unload a model |
| DELETE | /api/models/{id} |
Delete a model |
| POST | /api/models/download |
Download from HuggingFace |
| POST | /api/models/scan |
Scan for unregistered GGUF files |
| POST | /api/register-local |
Register a local GGUF file |
| POST | /api/connect-external |
Connect to a running backend |
| GET | /api/settings |
Get default loading settings |
| PUT | /api/settings |
Update settings |
| GET | /api/downloads |
Download progress |
| GET | /api/hf/search?q= |
Search HuggingFace |
| GET | /api/telemetry |
Request telemetry records |
| GET | /api/requests |
Active request tracker |
| POST | /api/requests/clear-stuck |
Clear stuck requests |
| GET | /api/logs |
Backend logs |
| GET | /api/model-activity |
Per-slot activity and metrics |
| GET | /api/health |
Health check |
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/chat/completions |
Chat completions (streaming + non-streaming) |
| POST | /v1/messages |
Anthropic Messages API |
| GET | /v1/models |
List available models |
| Endpoint | Purpose |
|---|---|
/ws |
Real-time updates (request lifecycle, slot state, metrics, model events) |
JIT (Just-In-Time) model loading is the headline feature — models auto-load when an inference request arrives, then auto-unload after a configurable idle cooldown. It's on by default but fully optional (turn it off in Settings for explicit control). Circuit breaker prevents memory exhaustion by estimating requirements and evicting idle models oldest-first. Cooldown tasks check for active in-flight requests before unloading, so streaming responses are never interrupted. Works with both GGUF and MLX backends.
Also in this release:
- Speculative decoding for MLX — Draft model path and num draft tokens fields in the Load dialog, forwarded to mlx-openai-server for 2-3x throughput on supported models.
- Anthropic thinking blocks —
reasoning_contentfrom OpenAI backends is now translated to Anthropicthinkingcontent blocks in both streaming SSE and non-streaming responses. - Monitor polling fix — Polling fallback no longer overwrites fresher WebSocket-pushed request state, preventing stage regression.
- ProcessManager thread-safety —
asyncio.Lockaround all mutations, preventing races during compound operations.
Qwen 3.6 MLX support (unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit)
Getting Qwen 3.6 working reliably with Hermes-style agents (38 tools, 100+ message sessions, streaming) required fixing a cascade of interacting bugs:
- XML tool call format — this unsloth quantization generates
<function=name><parameter=x>v</parameter></function>inside<tool_call>tags (Qwen3-Coder format), not Hermes JSON. The proxy now parses both formats and normalises totool_calls[]. - HTML entity cascade — when a tool call leaked as text, Telegram HTML-escaped it (
<tool_call>→<tool_call>) in session storage. The model then mimicked the escaped format on every subsequent turn, snowballing until the session was unrecoverable. Fixed withhtml.unescape()as the first step in rescue. - Truncated responses — Hermes sends
max_tokens=4096; with the reasoning parser active, thinking tokens consumed the budget leaving 2–5 tokens for the actual reply. The proxy now dropsmax_tokenswhen a reasoning parser is active, letting the full context window (262K) be the cap. </think>bleeding into content — the reasoning parser only activates when it sees a<think>opening tag in the stream, but Qwen3's generation prefix omits it. Any thinking content was flowing intocontent. Fixed with a stripping pass that moves…\n</think>\ntoreasoning_content.- Half-warm model poisoning — requests arriving during weight loading returned malformed responses that agents stored permanently in session history. The proxy now returns 503 until the backend health check passes.
Preset and template handling
- Built-in "Qwen3.6 — Tools (stable)" preset covers all required load params: 262K context,
qwen3reasoning/tool parsers, Hermes JSON chat template, trust-remote-code. - Qwen chat template auto-fill now runs regardless of whether
tool_call_parserwas explicitly provided (was previously skipped).
Warmup UX improvements
- Monitor page shows a real loading percentage bar (
Loading weights: 42%) parsed from mlx-lm's stderr during weight loading, so you know exactly where the model is instead of just an amber "warming up" badge. - Backend-ready guard: the proxy rejects requests with 503 while weights are still loading — agents get a clean retryable error instead of a malformed response that corrupts their context.
- Per-request lifecycle tracking: queued → prefilling → generating → sending → completed
- WebSocket push for real-time monitor updates
- LM Studio-style odometer token counter
- PWA manifest, app icons, theme-color meta tag
- Telemetry page redesigned with card layout and color-coded TTFT
