A custom minimal AI agent with persistent memory, MCP tool integration, Discord/CLI interface, and structured observability.
Runs entirely on local hardware — no cloud API costs.
~2300 lines of Python. No frameworks.
We evaluated existing agent frameworks and rejected them all:
- OpenClaw: 400K lines of code, 42K exposed instances on Shodan. Too large to audit, too large to trust.
- ZeroClaw: 9 days old at time of evaluation. Too immature.
- NanoClaw: Too thin — would need to rebuild most of it anyway.
The core needs (memory, tools, chat, logging) are individually well-solved problems. No 400K-line framework needed.
Discord (discord.py) CLI REPL (no token)
| |
v v
+---------------------------------+
| Luna Agent Core |
| |
| agent.py | agent loop: msg → memory → prompt → LLM → tools → respond
| ├── llm.py | single LLM client, configurable endpoint
| ├── memory.py | SQLite + FTS5 + sqlite-vec hybrid search
| ├── tools.py | native tools: bash, files, web, delegate, code_task
| ├── tool_output.py | smart output pipeline for large results
| ├── mcp_manager.py | MCP client for community tool servers
| └── observe.py | structured JSON logging
| |
+---------------------------------+
|
v
llama-server Qwen3.5-35B-A3B on 2x RTX 3090
All LLM traffic flows through a single LLMClient with a configurable endpoint URL. Today it points at localhost:8001 (llama-server). To insert an AI firewall later, change the URL to localhost:9000 — zero code changes required.
Thinking model support: Luna handles reasoning models (Qwen3.5, etc.) automatically — extracting reasoning_content, falling back to cleaned reasoning when content is empty, and stripping leaked markup (<thinking>, <tool_call>, etc.) from output.
- Intel i7-13700K, 64GB DDR4
- 2x NVIDIA RTX 3090 (24GB each, 48GB total)
- Qwen3.5-35B-A3B Q8_0 via llama-server with layer split across both GPUs
- 131K context window, Q8_0 KV cache
cd ~/luna-agent
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run without Discord (interactive CLI REPL)
python -m luna
# Run with Discord
DISCORD_TOKEN=your-token-here python -m lunaluna-agent/
├── config.toml # All configuration
├── mcp_servers.json # MCP server registry
├── pyproject.toml # Dependencies
├── luna/
│ ├── __main__.py # Entry point (python -m luna)
│ ├── agent.py # Core agent loop
│ ├── llm.py # LLM client (OpenAI-compatible)
│ ├── memory.py # Memory (SQLite + FTS5 + sqlite-vec)
│ ├── tools.py # Native tools (bash, files, web)
│ ├── tool_output.py # Large output persistence + filtering
│ ├── mcp_manager.py # MCP tool client
│ ├── discord_bot.py # Discord interface
│ ├── observe.py # Structured JSON logging
│ └── config.py # Config loader
├── tests/
│ ├── test_agent.py # Agent loop tests
│ ├── test_llm.py # LLM client tests
│ ├── test_memory.py # Memory system tests
│ ├── test_tools.py # Native tool tests
│ └── test_tool_output.py # Output pipeline tests
├── luna-agent.service # systemd unit for the agent
├── worker-agent.service # systemd unit for llama-server (Qwen3.5-35B-A3B)
└── data/ # Created at runtime
├── memory.db # SQLite database
├── logs/ # JSON log files
│ └── luna-YYYY-MM-DD.jsonl
└── tool_outputs/ # Persisted large tool outputs
All settings live in config.toml. Environment variables override for secrets:
| Env Var | Overrides | Required |
|---|---|---|
DISCORD_TOKEN |
Discord bot token | Yes (for Discord) |
LLM_ENDPOINT |
[llm] endpoint |
No |
LLM_MODEL |
[llm] model |
No |
MEMORY_DB_PATH |
[memory] db_path |
No |
LOG_DIR |
[observe] log_dir |
No |
See config.toml for all available settings and their defaults.
The orchestrator. Receives a message and session ID, then:
- Saves the user message to memory
- Searches for relevant memories (hybrid FTS + vector)
- Retrieves the session summary (if any)
- Builds a system prompt with memories, summary, and current time
- Loads the last 20 messages for context
- Calls the LLM with all available tools (native + MCP)
- Enters a tool call loop (max 25 rounds):
- Executes each tool call (native or MCP)
- Feeds results back to the LLM
- Repeats until the LLM responds without tool calls
- Saves the assistant response
- Triggers conversation summarization if enough messages have accumulated
Thin async wrapper around the OpenAI-compatible API. Single chat() method that handles tool calls, thinking model output, and per-call temperature overrides. This is the only code that talks to the LLM — the AI firewall insertion point.
Returns structured LLMResponse objects with content, reasoning, tool calls, and token usage.
SQLite-based persistent memory with three search strategies combined via Reciprocal Rank Fusion:
- FTS5 keyword search — fast exact/stemmed term matching (Porter stemmer + Unicode61 tokenizer)
- sqlite-vec cosine similarity — semantic search via nomic-embed-text-v1.5 embeddings
- Recency + importance weighting — recent and important memories rank higher
Scoring formula:
final_score = rrf_score + (recency_weight × 2^(-age_days / 7)) + (importance / 10 × 0.1)
Database tables:
| Table | Purpose |
|---|---|
messages |
Every message persisted per session |
memories |
Extracted facts with embeddings and importance scores |
summaries |
LLM-generated compression of old message blocks |
memories_fts |
FTS5 virtual table for keyword search |
memories_vec |
sqlite-vec virtual table for vector search |
Conversation compression: Every N messages (default 50), the LLM summarizes the conversation and extracts facts with importance scores (1-10). Facts above the threshold (default 3.0) are stored as memories. This enables effectively infinite conversations — the agent always has a summary of what came before plus searchable memory of key facts.
All retrieval parameters (top_k, RRF k, recency weight, importance threshold, etc.) are in config.toml for experimentation.
Built-in tools that don't require external MCP servers:
| Tool | Description |
|---|---|
bash |
Execute shell commands with safety guardrails |
read_file |
Read files with optional offset/limit for large files |
write_file |
Write or append to files, creates parent directories |
list_directory |
List files/directories, optional recursion with depth limits |
web_fetch |
Fetch a URL and convert HTML to markdown via html2text |
web_search |
Search the web via DuckDuckGo, returns structured results |
delegate |
Hand off a self-contained subtask to a sub-agent with its own tool loop |
code_task |
Delegate a coding task to a sub-agent with a write-run-fix loop |
summarize_paper |
Fetch and summarize an arXiv paper |
list_available_tools |
Discover MCP tools available from connected servers |
use_tool |
Call a specific MCP tool by name |
Bash safety: Commands are checked against blocked patterns before execution:
rm -rf /,mkfs,dd if=,shutdown,reboot, fork bombs, writes to/dev/sda- Timeout enforcement: default 30s, max 120s
- Output capped at 50KB
Handles large tool outputs so they don't overwhelm the LLM context:
- Small outputs (< 10KB) — passed through directly
- Large outputs — processed through a pipeline:
- Persist — full output saved to
data/tool_outputs/with a deterministic filename (content hash + source label) - Python filter — keyword matching against the user's query context, with structural detection (headers, code blocks). Includes 1 line of surrounding context per match.
- LLM extraction — if the Python filter finds fewer than 5 keyword matches, the LLM extracts relevant parts from the raw output
- File reference — a footer with the persisted file path is appended so the agent can inspect the full output later
- Persist — full output saved to
Connects to community MCP servers via stdio transport. On startup it spawns configured servers, discovers their tools, and converts schemas to OpenAI function-calling format. Tool calls from the LLM are routed to the correct server automatically.
Tool namespacing: Tools are prefixed with the server name (browser__navigate, filesystem__read_file) to avoid collisions between servers.
Configure servers in mcp_servers.json:
{
"servers": {
"browser": {
"command": "npx",
"args": ["-y", "@playwright/mcp"],
"transport": "stdio"
}
}
}Adding a new tool is editing JSON — no code changes.
Responds to DMs, @mentions, and replies in threads it created. Shows a typing indicator while the agent is processing.
Session isolation: Session IDs are derived from message context to keep memory separate:
| Context | Session ID |
|---|---|
| Thread | thread-{thread_id} |
| DM | dm-{user_id} |
| Channel | ch-{channel_id}-{user_id} |
Long responses are split at newlines (preferred), spaces, or hard-split at 2000 characters to stay within Discord's limit.
Every LLM call, tool execution, memory operation, and Discord message is logged as structured JSON.
Dual output:
- File —
data/logs/luna-YYYY-MM-DD.jsonl, one file per day, machine-parseable - Console — human-readable format for development
What's logged:
| Component | Events |
|---|---|
| LLM | llm_call, llm_response (tokens, latency, tools used) |
| Memory | memory_search (hits, method breakdown), memory_stored, summary_stored |
| Tools | tool_executing, native_tool_call, tool_call (server, tool, duration, errors) |
| Discord | discord_ready, discord_message (session, author, channel) |
| MCP | server_connected, tools_refreshed, mcp_shutdown |
| Agent | agent_process (latency), agent_response (memory hits, tool rounds) |
| Output | output_persisted, llm_extraction_triggered |
Inspection:
# Watch logs in real-time
tail -f data/logs/luna-*.jsonl
# Search with jq
jq 'select(.event == "llm_response")' data/logs/luna-*.jsonl
jq 'select(.latency_ms > 5000)' data/logs/luna-*.jsonlDataclass-based configuration loaded from config.toml with environment variable overrides. Relative paths are resolved against the project root. All fields have sensible defaults — the agent starts with zero configuration if a config.toml is present.
Copy the systemd service files and enable them:
sudo cp luna-agent.service /etc/systemd/system/
sudo cp worker-agent.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now worker-agent # Start LLM server (Qwen3.5-35B-A3B) first
sudo systemctl enable --now luna-agent # Then the agent (depends on worker-agent)Monitor:
journalctl -u luna-agent -f
journalctl -u worker-agent -fCLI mode (no Discord token): The agent starts an interactive REPL where tool calls print inline as they execute, then the final response prints below. Useful for testing without Discord.
8 runtime packages, no heavy frameworks:
| Package | Purpose |
|---|---|
discord.py |
Discord API client |
openai |
OpenAI-compatible HTTP client |
mcp[cli] |
Model Context Protocol SDK |
sentence-transformers |
Embedding model runtime |
einops |
Tensor operations for embeddings |
sqlite-vec |
Vector search in SQLite |
html2text |
HTML to markdown conversion |
duckduckgo-search |
Web search |
Dev: pytest, pytest-asyncio
Python: >= 3.11
- No AI firewall (future — just don't block the insertion point)
- No web dashboard (future phase of observability)
- No multi-user auth (single user)
- No cloud LLM fallback (local only)
- No containers for the agent (systemd is simpler)
See CONTRIBUTING.md for development setup, testing, and pull request guidelines.
MIT — Fabio Nonato, 2026