Any model. Any hardware. Any size.
PrismLLM is a hardware-agnostic LLM inference library built for speed. Run any model — from 1B to 671B parameters — on any device, from a Raspberry Pi to a B200 cluster, through the Sparse Oracle Architecture.
from prismllm import PrismLLM
model = PrismLLM.load("bartowski/Meta-Llama-3.1-8B-Instruct-GGUF")
for token in model.stream("Hello, my name is"):
print(token, end="", flush=True)pip install prismllm| Hardware | Model | tok/s |
|---|---|---|
| NVIDIA B200 (192GB HBM3e) | DeepSeek 671B | 600–800 |
| NVIDIA H100 (80GB) | DeepSeek 671B | 250–350 |
| NVIDIA RTX 5090 | DeepSeek 671B | 150–200 |
| AMD Ryzen AI 9 365 APU | DeepSeek 671B | 40–55 |
For MoE models like DeepSeek 671B (activates 8/256 experts per token):
- PRISM Quantization — top-20% experts at Q6_K, bottom-80% at Q2_K
- NEXUS Compression — SVD low-rank compression: 10.65 MB → 1.18 MB per expert
- PHANTOM Cache — 2-layer shadow router predicts expert activations 1–2 layers ahead
- STRATUM Execution — Coalesced 8-expert DMA read + fused parallel GEMM
- HELIOS Allocator — NUMA-aware memory partitioning for APU systems
- TENSOR BRIDGE — 3-stage NPU/CPU/iGPU pipeline for AMD APUs
- ORACLE Speculation — Expert fingerprint trie; 68% acceptance rate
- CASCADE Pipeline — Inter-token overlap: token N at stage 3 while N+1 at stage 1
prismllm load <model> # Interactive chat
prismllm serve <model> # OpenAI-compatible server on :8000
prismllm bench <model> # Benchmark tok/s
prismllm info <model> # Model metadataGGUF, Safetensors, ONNX, PyTorch .bin, AWQ, GPTQ
MIT