Autonomous Responsible Intelligence Architecture
An AI that's truly yours.
Using capable AI today means renting it from a handful of cloud providers who own the model, the compute, your data, and the memory of every conversation you have. ARIA gives that back. It's an open-source, peer-to-peer protocol that runs AI on hardware you already own β the model you choose, the compute you control, and the data and memory that never leave your machine.
ARIA runs 1-bit/ternary models CPU-first, scales up through standard quantized and specialist models when you want more, and links nodes into a peer-to-peer network so you can reach models bigger than your own hardware β and a resilient, license-gated library of open weights.
Sovereignty here means ownership across the four things cloud AI centralizes:
- The model β open weights you choose and can inspect, not a black box behind an API.
- The compute β your hardware, CPU-first, GPU-optional.
- The data & memory β they stay on your device; nothing is uploaded to be mined.
- The governance β MIT-licensed, auditable, community-governed, with a public no-rug-pull commitment.
Private, offline inference is table stakes β tools like llama.cpp already give you
that. Sovereignty is the larger claim: ownership and control end to end, including
the parts the local-AI world still leaves centralized β model distribution,
long-term memory, and governance.
ARIA runs the same protocol across tiers, so a vintage laptop and a modern workstation each get the best model they can actually run:
| Tier | Models (representative) | Backend | Memory floor | Use case |
|---|---|---|---|---|
| π± Efficiency | BitNet b1.58 2B-4T Β· Falcon-E 1B/3B Β· Falcon3 1.58-bit 1Bβ10B | bitnet.cpp |
0.4 GB RAM | Always-on chat on any CPU; low-power laptops; background nodes |
| β‘ Quality | Gemma 4 Β· Qwen 3.5 Β· GLM-4 9B Β· Granite 4.0 Β· Phi-4 mini Β· SmolLM3 Β· OLMo 3 Β· Ministral 3 | mainline llama.cpp |
1.6 GB RAM | Multilingual chat, longer context, multimodal (vision + audio) |
| π οΈ Specialist | Qwen2.5-Coder 7B Β· DeepSeek-R1 0528 8B Β· Granite 3.3 8B Β· Phi-4 mini reasoning Β· Qwen3-VL 2B | mainline llama.cpp |
1.5 GB RAM | Code, reasoning, math, vision β routed on demand |
| π Accelerated | Qwen3 8B/14B Β· Phi-4 14B Β· Mistral Nemo 12B Β· DeepSeek-R1 14B | llama.cpp + GPU offload |
4 GB VRAM | Dense models on a discrete GPU / Apple Silicon (β6β12 GB VRAM) |
A node operator picks one of the profile presets (minimal Β· efficient
(default) Β· balanced Β· full Β· specialist_only), and the profile decides which
tiers light up. The default keeps nodes lean (Efficiency only); heavier profiles
add Quality and Specialist; the Accelerated tier engages automatically when a
supported GPU is detected.
Full catalog, license matrix and model URLs: docs/MODELS.md.
Honest about the limits:
- CPU-first means accessible and efficient, not "no GPU, ever." Most people have some GPU, and ARIA uses it when present (the Accelerated tier). We don't pretend the CPU is always fastest β we make running on the CPU enough.
- 1-bit models are great for chat and simple tasks, but not yet reliable for agentic tool-use. That's a known capability gap we're actively working on, not a solved feature. We'd rather tell you up front than oversell it.
The peer-to-peer layer gives you two things a purely local setup can't:
- Models bigger than your hardware β distributed inference lets a model that wouldn't fit on your machine run across willing peers.
- A resilient, community-hosted library of open models β no central registry to throttle, paywall, or take down.
The license gate is a feature, not an afterthought. Only green-licensed weights
propagate across the network β MIT, Apache 2.0, BSD, and the TII Falcon license (see
docs/MODELS.md for the full green/yellow/red breakdown). A
transparent denylist blocks illegal content. That's what keeps a community-hosted
library both legal and trustworthy.
Built on infrastructure already in the codebase: Kademlia DHT, NAT traversal, Ed25519 identities, TLS transport.
Honest about the limits: the network's value is real even at small scale β a resilient library and bigger-than-your-hardware models don't need thousands of nodes. We deliberately under-promise on network size; ARIA is built to be useful to one person on one laptop first.
Status: roadmap. This describes where ARIA is going, not what it does today.
The goal: an AI that actually remembers you β your context, your preferences, the things you asked it to follow up on β entirely on your device, with nothing in the cloud. It maps to the five types of human memory, including prospective memory (remembering to do things later), which no major open-source agent framework ships.
It's sequenced after the core (inference + library) lands, and we'll describe it as planned until it's actually built. Memory is the most personal data a system holds β keeping it 100% local is the strongest expression of "an AI that's yours."
pip install aria-protocol# Default profile is "efficient" β 1-bit only, low RAM, any CPU
aria node start --port 8765
# OpenAI-compatible API
aria api start --port 3000# Efficiency tier (default)
aria model download BitNet-b1.58-2B-4T
# Quality tier (after switching profile)
aria node profile set balanced
aria model download Gemma-4-E2B
# Specialist tier (after switching profile)
aria node profile set full
aria model download Qwen2.5-Coder-7B-Instructfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="aria")
response = client.chat.completions.create(
model="BitNet-b1.58-2B-4T",
messages=[{"role": "user", "content": "What is quantum computing?"}],
)
print(response.choices[0].message.content)The router picks a model automatically when none is specified β pass a catalog ID
when you want a specific one. Full walkthrough:
docs/getting-started.md.
ARIA v0.9.5 ships with 30 active models across four tiers. Every entry
passes a strict license gate at import β the catalog only contains models under MIT,
Apache 2.0, or TII Falcon licenses, so peer-to-peer redistribution stays
friction-free. Models considered and rejected on licensing grounds (e.g. Llama 3.x,
Gemma 3, Mistral research) are listed in docs/MODELS.md with the
rejection reasoning.
| Tier | # models | License surface |
|---|---|---|
| π± Efficiency | 8 | MIT Β· TII Falcon 2.0 |
| β‘ Quality | 10 | Apache 2.0 Β· MIT |
| π οΈ Specialist | 5 | Apache 2.0 Β· MIT |
| π Accelerated | 7 | Apache 2.0 Β· MIT |
Counts are active models (30 total). The catalog source
(aria/model_catalog.py) carries 33 entries β the three extras are two superseded
models, hidden in the desktop UI, and Whisper Large v3 Turbo, reserved for the v1.0
audio backend.
Adding a model is a pull request against aria/model_catalog.py. The gate refuses
non-permissive licenses at import time, so the roster cannot drift.
ARIA detects the local CPU/GPU/NPU at startup and ships the snapshot in the peer handshake, so routers can prefer hardware-friendly peers:
aria hardware infoCPU detection covers Intel / AMD / Apple Silicon / Qualcomm Snapdragon, including
the AVX-512 capability bitnet.cpp uses for native ternary kernels. Discrete GPUs
are used by the Accelerated tier.
NPU: detection ships today; acceleration is on the roadmap. ARIA recognizes AMD
XDNA/XDNA2, Intel NPU, Qualcomm Hexagon and Apple ANE devices, but inference still
runs on CPU/GPU β real NPU acceleration is a v1.0 target (OpenVINO, QNN, Core ML).
See docs/NPU_SUPPORT.md.
Real numbers, reproducible from the repo. All measurements on a single host so they're comparable to each other; treat absolute throughput as indicative across hosts.
Hardware: AMD Ryzen 9 7845HX (12C/24T, Zen 4, 64 GB DDR5)
Build: bitnet.cpp + Clang, AVX-512 enabled Β· 8 threads, 256 tokens, 5 runs, median
| Model | Params | Type | tok/s |
|---|---|---|---|
| BitNet-b1.58-large | 0.7B | post-quantized | 118.25 |
| Falcon-E-1B-Instruct | 1.0B | native 1-bit | 80.19 |
| Falcon3-1B-Instruct | 1.0B | post-quantized | 56.31 |
| Falcon-E-3B-Instruct | 3.0B | native 1-bit | 49.80 |
| BitNet-b1.58-2B-4T | 2.4B | native 1-bit | 37.76 |
| Falcon3-3B-Instruct | 3.0B | post-quantized | 33.21 |
| Falcon3-7B-Instruct | 7.0B | post-quantized | 19.89 |
| Falcon3-10B-Instruct | 10.0B | post-quantized | 15.12 |
Key finding: models natively trained in 1-bit (Falcon-E) outperform post-training quantized models by +42% at 1B and +50% at 3B on identical hardware. Native ternary training matters more than parameter count below 7B.
AMD Ryzen AI 9 HX 370 (Zen 5, native 512-bit AVX-512). Average improvement: +35% across 7 models.
| Model | Zen 4 (t/s) | Zen 5 (t/s) | Ξ |
|---|---|---|---|
| Falcon-E-1B | 80.19 | 103.59 | +29% |
| Falcon3-1B | 56.31 | 78.16 | +39% |
| BitNet-2B-4T | 37.76 | 51.82 | +37% |
| Falcon-E-3B | 49.80 | 65.19 | +31% |
| Falcon3-3B | 33.21 | 46.77 | +41% |
| Falcon3-7B | 19.89 | 28.45 | +43% |
| Falcon3-10B | 15.12 | 19.39 | +28% |
Big.LITTLE CPUs need model-size-aware thread tuning: 1B peaks around 6 threads, 7B
around 20. Full results and reproduction harness: benchmarks/.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARIA PROTOCOL v0.9.5 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SERVICE OpenAI-compatible API Β· Desktop App Β· CLI Β· Dashboard β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CONSENSUS Provenance Ledger Β· Proof of Useful Work Β· β
β Proof of Sobriety Β· Consent Contracts β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β COMPUTE SmartRouter -> BackendDispatcher -> two llama-servers β
β - bitnet.cpp :8081 Efficiency (1.58-bit ternary) β
β - llama.cpp :8082 Quality / Specialist (CPU); β
β Accelerated = + CUDA/GPU offload β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β NETWORK P2P Β· WebSocket Β· Kademlia DHT Β· NAT traversal Β· β
β Ed25519 identities Β· TLS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The router is a pure function: it takes a query, classifies it, picks
(tier, model_id) from a routing table, and returns a routing decision with a
fallback chain. Backends are independent processes the router dispatches to by
model ID. Detailed view: docs/architecture.md. P2P wire
format: docs/protocol-spec.md.
| Document | Description |
|---|---|
| Getting Started | Install, first node, per-tier examples |
| Models | Full catalog, license matrix, model URLs |
| NPU Support | Detection today, acceleration roadmap |
| Architecture | Tiers, backends, P2P |
| Protocol Spec | Peer-to-peer wire protocol |
| API Reference | OpenAI-compatible HTTP endpoints |
| Threat Model | Security analysis |
| Smart Router | Routing table, classifier, fallback |
| Benchmarks | Methodology and full result sets |
| Roadmap | All versions and tasks |
Download latest release β Windows, macOS (Intel + Apple Silicon), Linux.
A chat-centric interface with a tier badge, profile-preset switcher, hardware panel,
and layout presets; a mode switch separates Chat from Node (Dashboard, Models,
Energy, Network). See desktop/README.md for build instructions.
| Problem | What ARIA does |
|---|---|
| You don't own cloud AI β you rent it | Runs on hardware you own; the model, data and memory are yours |
| Capable AI seems to require expensive GPUs | 1-bit/ternary models run efficiently CPU-first; GPU is optional |
| Your hardware caps what you can run | The P2P network reaches models bigger than your machine |
| Model distribution is centralized and gatekept | A resilient, license-gated peer library of open weights |
| Outputs are untraceable | Every inference is recorded on the provenance ledger |
| One provider means one dependency | An open catalog from 8+ organizations under permissive licenses |
CPU-first Β· explicit consent Β· energy sobriety Β· useful work Β· provenance Β· transparency (MIT, auditable) Β· accessibility (one-click node).
No token. Ever. ARIA is founded on reputation and contribution, not on a token or any financial instrument. Contributors earn a contribution score for useful work; standing on the network is a reputation requirement, not a stake.
Pull requests welcome. To keep ARIA trustworthy as it grows, the project commits up front to: a permissive license (MIT/Apache-2.0) on the core, contributions under a DCO (not a CLA, so contributors keep their rights), upstream attribution, and the enforced license gate on the model catalog.
Areas where help is most useful:
- New models β add a
ModelEntrytoaria/model_catalog.py(the license gate enforces P2P-compatible licenses at import). - Routing improvements β better classifiers or routing tables in
aria/smart_router.py. - Backends β wiring acceleration for GPU/NPU under
aria/backends/. - Docs and examples β every example in
examples/is welcome.
git clone https://github.com/spmfrance-cloud/aria-protocol.git
cd aria-protocol
pip install -e ".[dev]"
make testCode style: PEP 8, type hints on public APIs, focused functions, tests alongside the change.
MIT. See LICENSE.
@misc{aria2026,
author = {Anthony MURGO},
title = {ARIA: Autonomous Responsible Intelligence Architecture},
year = {2026},
url = {https://github.com/spmfrance-cloud/aria-protocol}
}- Microsoft Research BitNet β 1-bit ternary research and
bitnet.cpp - TII Falcon β Falcon-Edge and Falcon3 1.58-bit families
- ggml-org β mainline
llama.cppand GGUF builds - Qwen team β Qwen families
- DeepSeek β distilled reasoning weights
- OpenBMB β MiniCPM-V vision
An AI that's truly yours.