GitHub - nareshnavinash/bonsai: CLI for running prism-ml's 1-bit Bonsai models locally. Auto-manages llama.cpp server, downloads from HuggingFace, exposes OpenAI-compatible API.

Quick Start · Models · Commands · Configuration

Bonsai is a CLI that makes it easy to run prism-ml's Bonsai 1-bit quantized models locally using llama.cpp. These models are 14x smaller than full-precision equivalents, use 4-5x less energy, and deliver fast inference on consumer hardware. Bonsai handles everything -- model downloads from HuggingFace, server lifecycle management, and an OpenAI-compatible API -- with zero configuration.

Bonsai Models

The prism-ml Bonsai models use true 1-bit quantization across all layers -- embeddings, attention, MLP, and output head. No escape hatches, no mixed-precision workarounds.

Model	Parameters	Size	Pull Command
bonsai-8b	8B	~1.2 GB	`bonsai pull bonsai-8b`
bonsai-4b	4B	~572 MB	`bonsai pull bonsai-4b`
bonsai-1.7b	1.7B	~248 MB	`bonsai pull bonsai-1.7b`

Why 1-Bit?

14x smaller than FP16 equivalents
4-5x lower energy consumption per token
Fast inference: 40 tok/s on iPhone, 131 tok/s on M4 Pro, 368 tok/s on RTX 4090
Intelligence density: 1.06 intelligence/GB vs 0.10 for full precision -- 10x more capability per byte
GGUF format -- runs directly with llama.cpp, no conversion needed

Models by prism-ml -- explore the collection on HuggingFace.

Features

Zero-config inference -- bonsai run auto-starts the server, loads the model, and starts chatting
Built-in Bonsai registry -- pull models by shortname (bonsai pull bonsai-4b), downloads directly from HuggingFace
Full model management -- pull, list, show, run, stop, remove, copy
Interactive chat -- multi-turn conversations with streaming output
One-shot prompts -- bonsai run bonsai-4b "explain monads"
Smart model resolution -- auto-selects the best available Bonsai model
OpenAI-compatible API -- works with any OpenAI SDK, LangChain, etc.
Server lifecycle management -- auto-start, PID tracking, health checks
Progress tracking -- download progress bars
Lightweight -- single binary, ~1,500 LOC, two dependencies (cobra + uuid)
No Ollama required -- talks directly to llama.cpp server via OpenAI-compatible API

Why llama.cpp Instead of Ollama?

Bonsai v1 used Ollama as its inference backend. We moved to direct llama.cpp integration in v2 for significant performance and control improvements:

	Ollama	llama.cpp (direct)
Response time (simple query)	4,585 ms	56 ms (78x faster)
Forced thinking mode	Yes -- Qwen3 template injects `<think>` tags	No -- clean responses
Wasted tokens	160-265 thinking tokens per response	0
Dependencies	Ollama daemon + Go SDK + 8 transitive deps	Single `llama-server` binary
Model storage	Opaque blob store (`~/.ollama/models/blobs/`)	Plain GGUF files you control
Template control	Locked to Ollama's per-family templates	Full control, no forced behavior

The core problem

The Bonsai models are based on Qwen3. Ollama's Qwen3 chat template unconditionally injects a <think> tag at the start of every assistant response:

<|im_start|>assistant
<think>

This forces the model into chain-of-thought reasoning mode on every single query -- even "What is 2+2?" generates 160-265 internal reasoning tokens before producing the actual answer. This template is baked into Ollama and cannot be overridden per-request.

What llama.cpp gives us

Direct GGUF inference -- no middleware, no template injection, no abstraction tax
OpenAI-compatible API -- llama-server exposes /v1/chat/completions natively, same protocol any OpenAI SDK speaks
Transparent model files -- GGUF files in ~/.bonsai/models/ that you can inspect, copy, or share
One fewer dependency -- no need to install and run a separate Ollama daemon
Full control -- inference parameters, threading, GPU layers, batch size all configurable

Note: Bonsai still works with any OpenAI-compatible server. If you prefer Ollama, vLLM, or another backend, just point BONSAI_HOST at it.

Quick Start

Prerequisites

llama.cpp server must be available:

# macOS
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j

Note: Bonsai auto-detects llama-server in your PATH or common locations. You can also set LLAMA_SERVER_BIN to point to the binary.

Install Bonsai

go install github.com/nareshnavinash/bonsai@latest

Or download a binary from Releases.

Run

# Pull a model (~572 MB)
bonsai pull bonsai-4b

# Start chatting (auto-starts the server)
bonsai run

# Or one-shot
bonsai run bonsai-4b "what is quantum computing?"

Commands

Command	Description
`bonsai run [model] [prompt]`	Start a chat session or run a one-shot prompt
`bonsai pull <model>`	Download a model from HuggingFace
`bonsai list`	List installed models
`bonsai show <model>`	Show model details
`bonsai models`	List available Bonsai models
`bonsai ps`	Show running server status
`bonsai stop`	Stop the server
`bonsai rm <model>`	Remove a model
`bonsai cp <src> <dest>`	Copy a model file
`bonsai serve [model]`	Start the llama-server (foreground)
`bonsai api`	Start OpenAI-compatible API server
`bonsai status`	Show server status

API Server

Bonsai can expose an OpenAI-compatible HTTP API, letting any application that speaks the OpenAI format interact with your local Bonsai models.

# Start the API server
bonsai api                    # localhost:8080
bonsai api --port 3000        # custom port
bonsai api --host 0.0.0.0    # all interfaces

Endpoints

Method	Path	Description
POST	`/v1/chat/completions`	Chat completions (streaming & non-streaming)
GET	`/v1/models`	List available models
GET	`/health`	Health check

Usage with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
    model="bonsai-4b",
    messages=[{"role": "user", "content": "hello"}]
)
print(response.choices[0].message.content)

Usage with curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"bonsai-4b","messages":[{"role":"user","content":"hello"}]}'

Works with LangChain, LlamaIndex, Continue.dev, Cursor, and any OpenAI-compatible client.

Configuration

Variable	Default	Description
`BONSAI_MODEL`	`bonsai-8b`	Preferred model
`BONSAI_HOST`	`http://127.0.0.1:8081`	Server URL
`BONSAI_PORT`	`8081`	Server port
`BONSAI_THREADS`	CPU count	Inference threads
`BONSAI_MODELS_DIR`	`~/.bonsai/models/`	Model storage directory
`LLAMA_SERVER_BIN`	auto-detect	Path to llama-server binary

Model Resolution Order

BONSAI_MODEL environment variable (if set)
Any locally installed model with "bonsai" in its name
Any locally installed GGUF model
Helpful error message with pull instructions

Chat Commands

In interactive mode (bonsai run):

Command	Description
`/bye` or `/exit`	Exit the chat
`/clear`	Clear conversation history
`/model <name>`	Switch to a different model
`/set temperature <value>`	Adjust creativity (0.0-2.0)
`/set top_p <value>`	Adjust nucleus sampling
`"""`	Start multi-line input (end with `"""`)

Architecture

bonsai run "hello"
    │
    ├── Resolve model name → find GGUF file
    ├── Auto-start llama-server (if not running)
    ├── Send request via OpenAI-compatible HTTP API
    └── Stream response tokens to terminal

Bonsai manages the full lifecycle:

Models stored as GGUF files in ~/.bonsai/models/
Server process tracked via PID file at ~/.bonsai/server.pid
Logs written to ~/.bonsai/server.log
Compatible with any OpenAI-compatible server via BONSAI_HOST

Contributing

Contributions are welcome. Please open an issue first to discuss what you would like to change.

git clone https://github.com/nareshnavinash/bonsai.git
cd bonsai
go build -o bonsai .
./bonsai status

License

MIT

Acknowledgments

prism-ml for the Bonsai 1-bit quantized model family
llama.cpp for the inference engine
Cobra for the CLI framework

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
cmd		cmd
docs		docs
internal		internal
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bonsai Models

Why 1-Bit?

Features

Why llama.cpp Instead of Ollama?

The core problem

What llama.cpp gives us

Quick Start

Prerequisites

Install Bonsai

Run

Commands

API Server

Endpoints

Usage with OpenAI SDK

Usage with curl

Configuration

Model Resolution Order

Chat Commands

Architecture

Contributing

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bonsai Models

Why 1-Bit?

Features

Why llama.cpp Instead of Ollama?

The core problem

What llama.cpp gives us

Quick Start

Prerequisites

Install Bonsai

Run

Commands

API Server

Endpoints

Usage with OpenAI SDK

Usage with curl

Configuration

Model Resolution Order

Chat Commands

Architecture

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages