Digital products and their users need privacy, reliability, cost control, and an option to be independent from closed-source model providers.
Paddler is an open-source LLM load balancer and serving platform. It allows you to run inference, deploy, and scale LLMs on your own infrastructure, providing a great developer experience along the way.
- Inference through a built-in llama.cpp engine
- LLM-specific load balancing
- Works through agents that can be added dynamically, allowing integration with autoscaling tools
- Request buffering, enabling scaling from zero hosts
- Dynamic model swapping
- Built-in web admin panel for management, monitoring, and testing
- Observability metrics
- Product teams that need LLM inference and embeddings in their features
- DevOps/LLMOps teams that need to run and deploy LLMs at scale
- Organizations handling sensitive data with high compliance and privacy requirements (medical, financial, etc.)
- Organizations wanting to achieve predictable LLM costs instead of being exposed to per-token pricing
- Product leaders who need reliable model performance to maintain a consistent user experience of their AI-based features
- Discord https://discord.gg/92x3Z8a4gj
- Reddit (just started a subreddit, we will see how it goes :)) https://www.reddit.com/r/paddler/
Paddler is self-contained in a single binary file, so all you need to do to start using it is obtain the paddler binary and make it available in your system.
You can obtain the binary by:
- Option 1: Downloading the latest release from our GitHub releases
- Option 2: Or building Paddler from source (MSRV is 1.88.0)
Once you have made the binary available in your system, you can start using Paddler. The entire Paddler functionality is available through the paddler command (running paddler --help will list all available commands).
There are only two deployable components, the balancer (which distributes the incoming requests), and the agent (which generates tokens and embeddings through slots).
To start the balancer, run:
paddler balancer --inference-addr 127.0.0.1:8061 --management-addr 127.0.0.1:8060 --web-admin-panel-addr 127.0.0.1:8062The --web-admin-panel-addr flag is optional, but it will allow you to view your setup in a web browser.
And to start an agent with, for example, 4 slots, run:
paddler agent --management-addr 127.0.0.1:8060 --slots 4Read more about the installation and setting up a basic cluster.
- Visit our documentation page to install Paddler and get started with it.
- API documentation is also available.
- Video overview
- FOSEDM 2026 talk - From Infrastructure to Production: A Year of Self-Hosted LLMs.
Paddler is built for an easy setup. It comes as a self-contained binary with only two deployable components, the balancer and the agents.
The balancer exposes the following:
- Inference service (used by applications that connect to it to obtain tokens or embeddings)
- Management service, which manages the Paddler's setup internally
- Web admin panel that lets you view and test your Paddler setup
Agents are usually deployed on separate instances. They further distribute the incoming requests to slots, which are responsible for generating tokens and embeddings.
Paddler uses a built-in llama.cpp engine for inference, but has its own implementation of llama.cpp slots, which keep their own context and KV cache.
Paddler comes with a built-in web admin panel.
You can use it to monitor your Paddler fleet:

Add and update your model and customize the chat template and inference parameters:

And use a GUI to test the inference:

Paddler comes in two versions: a command-line interface for infrastructure use, and a desktop application for more casual use cases, like using multiple laptops and PCs in a local AI cluster or setting up an office-wide company second brain, without using a console.
You can also mix both; for example, you can set up a Paddler balancer on your server rack, and ask a colleague in the office with an RTX 5090 to plug in ad hoc as an agent if they do not need their entire compute.
The world is your oyster with this one. :)
See the desktop app docs to get started.
- Setup a basic LLM cluster
- Use Paddler's web admin panel
- Generate tokens and embeddings
- Use function calling
- Use grammars
- Use multimodal models
- Create a multi agent fleet
- Go beyond a single device
All code in the project is human-reviewed, and most is handcrafted. We have been experimenting with using AI to generate some code, and so far, we had success with:
- coding and maintaining the HTTP client that connects to the core library
- creating an integration test harness for Paddler, where we were able to consolidate all the existing tests to use the new, improved harness almost automatically
If you successfully generate something, you can submit it. We will still need to review it, so make sure you understand what you are doing.
You can try, though. :) We have even added CLAUDE.md with some code style and other basic instructions.
We initially wanted to use Raft consensus algorithm (thus Paddler, because it paddles on a Raft), but eventually dropped that idea. The name stayed, though.
Later, people started sending us the "that's a paddlin'" clip from The Simpsons, and we just embraced it.
