Flow LLM

Local LLM gateway for Apple Silicon

Run GGUF and MLX models locally. Proxy OpenAI & Anthropic API requests. Real-time monitoring. Built for coding agents.

Install · Features · Quick Start · API · Architecture

Flow LLM is a local LLM gateway for macOS. It manages GGUF and MLX models on Apple Silicon, proxies OpenAI- and Anthropic-compatible API requests, and exposes real-time monitoring — so tools like OpenClaw, Hermes, Claude Code, and Codex (via AIRun) can talk to local models without Ollama or LM Studio.

Features

JIT Model Loading — Models auto-load on first request and auto-unload after idle cooldown. Togglable in Settings (defaults ON, 5 min cooldown). Circuit breaker prevents memory exhaustion
Real-time Monitor — Per-request lifecycle tracking (queued → prefilling → generating → completed), odometer-style token counter, WebSocket push, idle waveform
OpenAI & Anthropic APIs — Drop-in proxy for /v1/chat/completions and /v1/messages. Streaming and non-streaming, tool calling, system prompts, reasoning/thinking block translation
GGUF & MLX — Run llama.cpp GGUF models or MLX models (text-only and vision) on Apple Silicon with sensible defaults (100K context, flash attention, q4_0 KV cache). Speculative decoding for MLX
Agent-Ready — Parallel slot support, Anthropic streaming SSE adapter, input token estimation fallback, stuck request pruning
Connect External — Adopt an already-running llama-server without restarting it. Auto-detects model name
HuggingFace Browser — Search and download models directly from the UI. Scan local directories for unregistered GGUF files
Telemetry — TTFT, throughput, token counts per request. Card-based history with color-coded metrics
Template Validation — Validates chat templates before loading (Jinja syntax, system role, tool calling)
Single Binary — pip install -e . && flow. One process, one port (3377). Frontend bundled in the package

Quick Install

curl -fsSL https://raw.githubusercontent.com/styles01/flow-llm/main/setup.sh | bash

Or clone and run:

git clone https://github.com/styles01/flow-llm.git
cd flow-llm && ./setup.sh
flow

Open http://localhost:3377 — API and UI from a single process.

Prerequisites

Flow requires inference backends. Install at least one:

# Required — GGUF models
brew install llama.cpp

# Optional — MLX models (text + vision)
pip install mlx_lm mlx_vlm

Quick Start

1. Start Flow

flow

2. Add a model (one-time setup)

In the UI: Models → search HuggingFace, download and register any model. Or connect a running backend.

Or via API:

curl -X POST http://localhost:3377/api/register-local \
  -H "Content-Type: application/json" \
  -d '{"gguf_path": "/path/to/model.gguf"}'

3. Point your agent

With JIT (default): Your first inference request auto-loads the model. Nothing else to do.

Without JIT: Turn it off in Settings, then load models explicitly via the Models page or POST /api/models/{id}/load.

{
  "models": {
    "providers": {
      "flow": {
        "baseUrl": "http://127.0.0.1:3377/v1",
        "apiKey": "flow-local",
        "api": "openai-completions"
      }
    }
  }
}

Flow also exposes POST /v1/messages for Claude Code and other Anthropic API tools.

Model Loading Defaults

Flow ships with sensible defaults for Apple Silicon:

Setting	Default	Why
Context window	100,000 tokens	Coding agents need long context
Flash attention	On	Critical for long context performance
KV cache	q4_0	75% memory savings, enables 100K on 48GB
GPU layers	-1 (all)	Metal acceleration
Parallel slots	2	Concurrent agent requests
JIT loading	On	Auto-load models on first request
JIT cooldown	300s (5 min)	Auto-unload idle models
Auto-update	On	Checks backend versions on startup

Configurable in Settings page, persisted to ~/.flow/settings.json.

Development

cd server && pip install -e .
cd ../web && npm install && npm run dev

Frontend dev server at http://localhost:5173 proxies API requests to the backend. Rebuild bundled frontend:

cd web && npm run build

Dependencies

Required

Dependency	Purpose	Install
Python 3.11+	Runtime	System
llama.cpp	GGUF inference backend	`brew install llama.cpp`
Node.js 18+	Frontend build	`brew install node`

Python packages (installed via `pip install -e .`)

Package	Purpose
fastapi	Management server and API
uvicorn	ASGI server
httpx	Async HTTP proxy
sqlalchemy	Model registry (SQLite)
huggingface-hub	Model search and download
jinja2	Chat template validation
psutil	Hardware detection
pydantic	Request/response models
websockets	Real-time updates

Optional

Dependency	Purpose	Install
mlx_lm / mlx_vlm	MLX inference backends (text + vision)	`pip install mlx_lm mlx_vlm`

Connect External Backend

Flow can adopt an already-running backend without restarting it:

curl -X POST http://localhost:3377/api/connect-external \
  -H "Content-Type: application/json" \
  -d '{"base_url": "http://127.0.0.1:8081"}'

Auto-detects the model name. Unloading kills the backend process and frees memory.

Port Layout

Port	Service
3377	Flow management server
5173	Frontend dev server (Vite)
8081+	llama.cpp backend processes
8100+	mlx-openai-server backend processes

API Endpoints

Management API

Method	Endpoint	Purpose
GET	`/api/hardware`	Hardware info (chip, memory, Metal)
GET	`/api/models`	List all registered models
GET	`/api/models/{id}`	Get model details
GET	`/api/models/running`	List running models
POST	`/api/models/{id}/load`	Load a model
POST	`/api/models/{id}/unload`	Unload a model
DELETE	`/api/models/{id}`	Delete a model
POST	`/api/models/download`	Download from HuggingFace
POST	`/api/models/scan`	Scan for unregistered GGUF files
POST	`/api/register-local`	Register a local GGUF file
POST	`/api/connect-external`	Connect to a running backend
GET	`/api/settings`	Get default loading settings
PUT	`/api/settings`	Update settings
GET	`/api/downloads`	Download progress
GET	`/api/hf/search?q=`	Search HuggingFace
GET	`/api/telemetry`	Request telemetry records
GET	`/api/requests`	Active request tracker
POST	`/api/requests/clear-stuck`	Clear stuck requests
GET	`/api/logs`	Backend logs
GET	`/api/model-activity`	Per-slot activity and metrics
GET	`/api/health`	Health check

OpenAI-Compatible Proxy

Method	Endpoint	Purpose
POST	`/v1/chat/completions`	Chat completions (streaming + non-streaming)
POST	`/v1/messages`	Anthropic Messages API
GET	`/v1/models`	List available models

WebSocket

Endpoint	Purpose
`/ws`	Real-time updates (request lifecycle, slot state, metrics, model events)

Changelog

v1.5.0 — JIT model loading

JIT (Just-In-Time) model loading is the headline feature — models auto-load when an inference request arrives, then auto-unload after a configurable idle cooldown. It's on by default but fully optional (turn it off in Settings for explicit control). Circuit breaker prevents memory exhaustion by estimating requirements and evicting idle models oldest-first. Cooldown tasks check for active in-flight requests before unloading, so streaming responses are never interrupted. Works with both GGUF and MLX backends.

Also in this release:

Speculative decoding for MLX — Draft model path and num draft tokens fields in the Load dialog, forwarded to mlx-openai-server for 2-3x throughput on supported models.
Anthropic thinking blocks — reasoning_content from OpenAI backends is now translated to Anthropic thinking content blocks in both streaming SSE and non-streaming responses.
Monitor polling fix — Polling fallback no longer overwrites fresher WebSocket-pushed request state, preventing stage regression.
ProcessManager thread-safety — asyncio.Lock around all mutations, preventing races during compound operations.

v1.1.0 — Qwen 3.6 MLX tool calling + warmup UX

Qwen 3.6 MLX support (unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit)

Getting Qwen 3.6 working reliably with Hermes-style agents (38 tools, 100+ message sessions, streaming) required fixing a cascade of interacting bugs:

XML tool call format — this unsloth quantization generates <function=name><parameter=x>v</parameter></function> inside <tool_call> tags (Qwen3-Coder format), not Hermes JSON. The proxy now parses both formats and normalises to tool_calls[].
HTML entity cascade — when a tool call leaked as text, Telegram HTML-escaped it (<tool_call> → <tool_call>) in session storage. The model then mimicked the escaped format on every subsequent turn, snowballing until the session was unrecoverable. Fixed with html.unescape() as the first step in rescue.
Truncated responses — Hermes sends max_tokens=4096; with the reasoning parser active, thinking tokens consumed the budget leaving 2–5 tokens for the actual reply. The proxy now drops max_tokens when a reasoning parser is active, letting the full context window (262K) be the cap.
</think> bleeding into content — the reasoning parser only activates when it sees a <think> opening tag in the stream, but Qwen3's generation prefix omits it. Any thinking content was flowing into content. Fixed with a stripping pass that moves …\n</think>\n to reasoning_content.
Half-warm model poisoning — requests arriving during weight loading returned malformed responses that agents stored permanently in session history. The proxy now returns 503 until the backend health check passes.

Preset and template handling

Built-in "Qwen3.6 — Tools (stable)" preset covers all required load params: 262K context, qwen3 reasoning/tool parsers, Hermes JSON chat template, trust-remote-code.
Qwen chat template auto-fill now runs regardless of whether tool_call_parser was explicitly provided (was previously skipped).

Warmup UX improvements

Monitor page shows a real loading percentage bar (Loading weights: 42%) parsed from mlx-lm's stderr during weight loading, so you know exactly where the model is instead of just an amber "warming up" badge.
Backend-ready guard: the proxy rejects requests with 503 while weights are still loading — agents get a clean retryable error instead of a malformed response that corrupts their context.

v1.0.0 — Real-time Monitor + request lifecycle tracking

Per-request lifecycle tracking: queued → prefilling → generating → sending → completed
WebSocket push for real-time monitor updates
LM Studio-style odometer token counter
PWA manifest, app icons, theme-color meta tag
Telemetry page redesigned with card layout and color-coded TTFT

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github		.github
docs		docs
screenshots		screenshots
server		server
templates		templates
web		web
.gitignore		.gitignore
.notes-mais.md		.notes-mais.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build_frontend.sh		build_frontend.sh
setup.sh		setup.sh
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flow LLM

Features

Quick Install

Prerequisites

Quick Start

1. Start Flow

2. Add a model (one-time setup)

3. Point your agent

Model Loading Defaults

Development

Dependencies

Required

Python packages (installed via `pip install -e .`)

Optional

Connect External Backend

Port Layout

API Endpoints

Management API

OpenAI-Compatible Proxy

WebSocket

Changelog

v1.5.0 — JIT model loading

v1.1.0 — Qwen 3.6 MLX tool calling + warmup UX

v1.0.0 — Real-time Monitor + request lifecycle tracking

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Flow LLM

Features

Quick Install

Prerequisites

Quick Start

1. Start Flow

2. Add a model (one-time setup)

3. Point your agent

Model Loading Defaults

Development

Dependencies

Required

Python packages (installed via pip install -e .)

Optional

Connect External Backend

Port Layout

API Endpoints

Management API

OpenAI-Compatible Proxy

WebSocket

Changelog

v1.5.0 — JIT model loading

v1.1.0 — Qwen 3.6 MLX tool calling + warmup UX

v1.0.0 — Real-time Monitor + request lifecycle tracking

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python packages (installed via `pip install -e .`)

Packages