edge-llama 🚀

Native LLM inference on Jetson — no Ollama, no cloud, just shared libraries.

Links libllama.so (llama.cpp) directly into your process. Loads GGUF models, generates text at 19 t/s on CPU, and can be embedded into MUDs, gateways, or mesh agents via a 51KB shared library (libedge-cuda.so).

Quick Start

# Build
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

# Run test (points to your deepseek-r1 GGUF)
./edge_test

# Start the interactive MUD on port 4003
./flato 4003 /tmp/edge-native.sock

Status: v0.6.0 — Production CPU Inference ✅

Works

Native CPU inference at 19 t/s on deepseek-r1:1.5b (Q4_K_M, 1.04GB GGUF)
libedge-cuda.so — 51KB shared library, links libllama.so directly
edge_native.py — Python ctypes wrapper (with EdgeModel("model.gguf") as e: e.generate(...))
Streaming API — callback-based per-token generation (edge_generate_stream)
Python singleton — EdgePlatoModel loaded once, re-entrant via threading.Lock
flato MUD — C17 telnet server on :4003 with /think, /status, /gpu, /cuda, /peers
Edge gateway integration — ?native=true routes through native inference at 18 t/s
SSE streaming through OpenAI-compatible /v1/chat/completions?native=true&stream=true

Architecture

Your App (Python/Evennia/C)
        │
        ▼
libedge-cuda.so  (51KB shared library)
        │
        ▼
libllama.so      (llama.cpp C API — GGUF, tokenize, sample, generate)
        │
        ▼
    model.gguf    (deepseek-r1:1.5b, ~1.0GB on disk)

No HTTP. No subprocess. No Python loops. Just dlopen() and call.

Commands

Command	Where	What
`@infer <prompt>`	Evennia (Plato MUD)	Native inference, streaming
`@think <prompt>`	Evennia	Ship AI — ship persona response
`/think <prompt>`	flato (telnet :4003)	Same, via Unix socket
`/gpu`	flato	nvidia-smi status
`/cuda`	flato	CUDA toolkit + device + CMA info
`?native=true`	Edge gateway (:11435)	Native backend instead of Ollama

Blocked (GPU)

GPU inference — CMA pool depleted (NVIDIA driver allocates during first CUDA call, never frees). Workaround: CUDA_VISIBLE_DEVICES="" for CPU-only. Fix: reboot with cma=1024M.

Why Not Just Use Ollama?

Ollama is great. edge-llama is different:

	Ollama	edge-llama
API	HTTP (subprocess)	Shared library (in-process)
Latency	~50ms overhead	~2μs function call
Embedding	Web server	`dlopen()` into your app
Fleet agent	Separate daemon	Link into MUD process
Binary size	~200MB	51KB

The edge gateway (edge-gateway.py) uses both: Ollama by default, native fallback when Ollama is down, forced native via ?native=true.

Build Dependencies

C++17 compiler (GCC or Clang)
libllama.so — installable via pip install llama-cpp-python (includes .so)
CUDA 12.6 (optional — auto-detected at runtime)

Files

build/libedge-cuda.so      # Shared library — 51KB
build/edge_test            # Test binary
edge_native.py             # Python ctypes wrapper
flato.c                    # C17 MUD telnet server
src/
  edge-cuda.h              # Public API header
  edge-cuda.c              # C stub → C++ impl
  edge-cuda-impl.cpp       # Links libllama.so, calls llama_eval

Design

The wedge. edge-llama is the first brick in the wall. From here: full GPU inference (after reboot), mesh-native embeddings, in-MUD training.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
docs		docs
messages		messages
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
edge_native.py		edge_native.py
flato		flato
flato.c		flato.c
fleet-agent		fleet-agent
fleet-agent.c		fleet-agent.c
full_forward.cpp		full_forward.cpp
jc1-research-agent.py		jc1-research-agent.py
plato-server.service		plato-server.service
plato-sync.py		plato-sync.py
plato-synthesizer.py		plato-synthesizer.py
sensor-pipeline.py		sensor-pipeline.py
sensor-pipeline.service		sensor-pipeline.service
sensor-pipeline.timer		sensor-pipeline.timer
test_edge		test_edge
test_edge.cpp		test_edge.cpp
true_lambda.py		true_lambda.py
warp-room.service		warp-room.service
warp-room.timer		warp-room.timer
warp_room.py		warp_room.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

edge-llama 🚀

Quick Start

Status: v0.6.0 — Production CPU Inference ✅

Works

Architecture

Commands

Blocked (GPU)

Why Not Just Use Ollama?

Build Dependencies

Files

Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

edge-llama 🚀

Quick Start

Status: v0.6.0 — Production CPU Inference ✅

Works

Architecture

Commands

Blocked (GPU)

Why Not Just Use Ollama?

Build Dependencies

Files

Design

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages