Skip to content

styles01/flow-llm

Repository files navigation

Flow LLM

Flow LLM

Local LLM gateway for Apple Silicon

Run GGUF and MLX models locally. Proxy OpenAI & Anthropic API requests. Real-time monitoring. Built for coding agents.

License: MIT Python 3.11+ Platform: Apple Silicon Buy me a coffee

Install · Features · Quick Start · API · Architecture


Flow LLM is a local LLM gateway for macOS. It manages GGUF and MLX models on Apple Silicon, proxies OpenAI- and Anthropic-compatible API requests, and exposes real-time monitoring — so tools like OpenClaw, Hermes, Claude Code, and Codex (via AIRun) can talk to local models without Ollama or LM Studio.

Flow LLM Monitor

Features

  • JIT Model Loading — Models auto-load on first request and auto-unload after idle cooldown. Togglable in Settings (defaults ON, 5 min cooldown). Circuit breaker prevents memory exhaustion
  • Real-time Monitor — Per-request lifecycle tracking (queued → prefilling → generating → completed), odometer-style token counter, WebSocket push, idle waveform
  • OpenAI & Anthropic APIs — Drop-in proxy for /v1/chat/completions and /v1/messages. Streaming and non-streaming, tool calling, system prompts, reasoning/thinking block translation
  • GGUF & MLX — Run llama.cpp GGUF models or MLX models (text-only and vision) on Apple Silicon with sensible defaults (100K context, flash attention, q4_0 KV cache). Speculative decoding for MLX
  • Agent-Ready — Parallel slot support, Anthropic streaming SSE adapter, input token estimation fallback, stuck request pruning
  • Connect External — Adopt an already-running llama-server without restarting it. Auto-detects model name
  • HuggingFace Browser — Search and download models directly from the UI. Scan local directories for unregistered GGUF files
  • Telemetry — TTFT, throughput, token counts per request. Card-based history with color-coded metrics
  • Template Validation — Validates chat templates before loading (Jinja syntax, system role, tool calling)
  • Single Binarypip install -e . && flow. One process, one port (3377). Frontend bundled in the package

Quick Install

curl -fsSL https://raw.githubusercontent.com/styles01/flow-llm/main/setup.sh | bash

Or clone and run:

git clone https://github.com/styles01/flow-llm.git
cd flow-llm && ./setup.sh
flow

Open http://localhost:3377 — API and UI from a single process.

Prerequisites

Flow requires inference backends. Install at least one:

# Required — GGUF models
brew install llama.cpp

# Optional — MLX models (text + vision)
pip install mlx_lm mlx_vlm

Quick Start

1. Start Flow

flow

2. Add a model (one-time setup)

In the UI: Models → search HuggingFace, download and register any model. Or connect a running backend.

Or via API:

curl -X POST http://localhost:3377/api/register-local \
  -H "Content-Type: application/json" \
  -d '{"gguf_path": "/path/to/model.gguf"}'

3. Point your agent

With JIT (default): Your first inference request auto-loads the model. Nothing else to do.

Without JIT: Turn it off in Settings, then load models explicitly via the Models page or POST /api/models/{id}/load.

{
  "models": {
    "providers": {
      "flow": {
        "baseUrl": "http://127.0.0.1:3377/v1",
        "apiKey": "flow-local",
        "api": "openai-completions"
      }
    }
  }
}

Flow also exposes POST /v1/messages for Claude Code and other Anthropic API tools.

Model Loading Defaults

Flow ships with sensible defaults for Apple Silicon:

Setting Default Why
Context window 100,000 tokens Coding agents need long context
Flash attention On Critical for long context performance
KV cache q4_0 75% memory savings, enables 100K on 48GB
GPU layers -1 (all) Metal acceleration
Parallel slots 2 Concurrent agent requests
JIT loading On Auto-load models on first request
JIT cooldown 300s (5 min) Auto-unload idle models
Auto-update On Checks backend versions on startup

Configurable in Settings page, persisted to ~/.flow/settings.json.

Development

cd server && pip install -e .
cd ../web && npm install && npm run dev

Frontend dev server at http://localhost:5173 proxies API requests to the backend. Rebuild bundled frontend:

cd web && npm run build

Dependencies

Required

Dependency Purpose Install
Python 3.11+ Runtime System
llama.cpp GGUF inference backend brew install llama.cpp
Node.js 18+ Frontend build brew install node

Python packages (installed via pip install -e .)

Package Purpose
fastapi Management server and API
uvicorn ASGI server
httpx Async HTTP proxy
sqlalchemy Model registry (SQLite)
huggingface-hub Model search and download
jinja2 Chat template validation
psutil Hardware detection
pydantic Request/response models
websockets Real-time updates

Optional

Dependency Purpose Install
mlx_lm / mlx_vlm MLX inference backends (text + vision) pip install mlx_lm mlx_vlm

Connect External Backend

Flow can adopt an already-running backend without restarting it:

curl -X POST http://localhost:3377/api/connect-external \
  -H "Content-Type: application/json" \
  -d '{"base_url": "http://127.0.0.1:8081"}'

Auto-detects the model name. Unloading kills the backend process and frees memory.

Port Layout

Port Service
3377 Flow management server
5173 Frontend dev server (Vite)
8081+ llama.cpp backend processes
8100+ mlx-openai-server backend processes

API Endpoints

Management API

Method Endpoint Purpose
GET /api/hardware Hardware info (chip, memory, Metal)
GET /api/models List all registered models
GET /api/models/{id} Get model details
GET /api/models/running List running models
POST /api/models/{id}/load Load a model
POST /api/models/{id}/unload Unload a model
DELETE /api/models/{id} Delete a model
POST /api/models/download Download from HuggingFace
POST /api/models/scan Scan for unregistered GGUF files
POST /api/register-local Register a local GGUF file
POST /api/connect-external Connect to a running backend
GET /api/settings Get default loading settings
PUT /api/settings Update settings
GET /api/downloads Download progress
GET /api/hf/search?q= Search HuggingFace
GET /api/telemetry Request telemetry records
GET /api/requests Active request tracker
POST /api/requests/clear-stuck Clear stuck requests
GET /api/logs Backend logs
GET /api/model-activity Per-slot activity and metrics
GET /api/health Health check

OpenAI-Compatible Proxy

Method Endpoint Purpose
POST /v1/chat/completions Chat completions (streaming + non-streaming)
POST /v1/messages Anthropic Messages API
GET /v1/models List available models

WebSocket

Endpoint Purpose
/ws Real-time updates (request lifecycle, slot state, metrics, model events)

Changelog

v1.5.0 — JIT model loading

JIT (Just-In-Time) model loading is the headline feature — models auto-load when an inference request arrives, then auto-unload after a configurable idle cooldown. It's on by default but fully optional (turn it off in Settings for explicit control). Circuit breaker prevents memory exhaustion by estimating requirements and evicting idle models oldest-first. Cooldown tasks check for active in-flight requests before unloading, so streaming responses are never interrupted. Works with both GGUF and MLX backends.

Also in this release:

  • Speculative decoding for MLX — Draft model path and num draft tokens fields in the Load dialog, forwarded to mlx-openai-server for 2-3x throughput on supported models.
  • Anthropic thinking blocksreasoning_content from OpenAI backends is now translated to Anthropic thinking content blocks in both streaming SSE and non-streaming responses.
  • Monitor polling fix — Polling fallback no longer overwrites fresher WebSocket-pushed request state, preventing stage regression.
  • ProcessManager thread-safetyasyncio.Lock around all mutations, preventing races during compound operations.

v1.1.0 — Qwen 3.6 MLX tool calling + warmup UX

Qwen 3.6 MLX support (unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit)

Getting Qwen 3.6 working reliably with Hermes-style agents (38 tools, 100+ message sessions, streaming) required fixing a cascade of interacting bugs:

  • XML tool call format — this unsloth quantization generates <function=name><parameter=x>v</parameter></function> inside <tool_call> tags (Qwen3-Coder format), not Hermes JSON. The proxy now parses both formats and normalises to tool_calls[].
  • HTML entity cascade — when a tool call leaked as text, Telegram HTML-escaped it (<tool_call>&lt;tool_call&gt;) in session storage. The model then mimicked the escaped format on every subsequent turn, snowballing until the session was unrecoverable. Fixed with html.unescape() as the first step in rescue.
  • Truncated responses — Hermes sends max_tokens=4096; with the reasoning parser active, thinking tokens consumed the budget leaving 2–5 tokens for the actual reply. The proxy now drops max_tokens when a reasoning parser is active, letting the full context window (262K) be the cap.
  • </think> bleeding into content — the reasoning parser only activates when it sees a <think> opening tag in the stream, but Qwen3's generation prefix omits it. Any thinking content was flowing into content. Fixed with a stripping pass that moves …\n</think>\n to reasoning_content.
  • Half-warm model poisoning — requests arriving during weight loading returned malformed responses that agents stored permanently in session history. The proxy now returns 503 until the backend health check passes.

Preset and template handling

  • Built-in "Qwen3.6 — Tools (stable)" preset covers all required load params: 262K context, qwen3 reasoning/tool parsers, Hermes JSON chat template, trust-remote-code.
  • Qwen chat template auto-fill now runs regardless of whether tool_call_parser was explicitly provided (was previously skipped).

Warmup UX improvements

  • Monitor page shows a real loading percentage bar (Loading weights: 42%) parsed from mlx-lm's stderr during weight loading, so you know exactly where the model is instead of just an amber "warming up" badge.
  • Backend-ready guard: the proxy rejects requests with 503 while weights are still loading — agents get a clean retryable error instead of a malformed response that corrupts their context.

v1.0.0 — Real-time Monitor + request lifecycle tracking

  • Per-request lifecycle tracking: queued → prefilling → generating → sending → completed
  • WebSocket push for real-time monitor updates
  • LM Studio-style odometer token counter
  • PWA manifest, app icons, theme-color meta tag
  • Telemetry page redesigned with card layout and color-coded TTFT

License

MIT

About

Local LLM gateway for Apple Silicon. Works with OpenClaw, Hermes Agent, Claude Code, and Codex (AIRun). No Ollama or LM Studio required.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors