Skip to content

AshtonVaughan/prismllm

Repository files navigation

PrismLLM

Any model. Any hardware. Any size.

PrismLLM is a hardware-agnostic LLM inference library built for speed. Run any model — from 1B to 671B parameters — on any device, from a Raspberry Pi to a B200 cluster, through the Sparse Oracle Architecture.

Quick Start

from prismllm import PrismLLM

model = PrismLLM.load("bartowski/Meta-Llama-3.1-8B-Instruct-GGUF")

for token in model.stream("Hello, my name is"):
    print(token, end="", flush=True)

Install

pip install prismllm

Performance

Hardware Model tok/s
NVIDIA B200 (192GB HBM3e) DeepSeek 671B 600–800
NVIDIA H100 (80GB) DeepSeek 671B 250–350
NVIDIA RTX 5090 DeepSeek 671B 150–200
AMD Ryzen AI 9 365 APU DeepSeek 671B 40–55

Sparse Oracle Architecture

For MoE models like DeepSeek 671B (activates 8/256 experts per token):

  • PRISM Quantization — top-20% experts at Q6_K, bottom-80% at Q2_K
  • NEXUS Compression — SVD low-rank compression: 10.65 MB → 1.18 MB per expert
  • PHANTOM Cache — 2-layer shadow router predicts expert activations 1–2 layers ahead
  • STRATUM Execution — Coalesced 8-expert DMA read + fused parallel GEMM
  • HELIOS Allocator — NUMA-aware memory partitioning for APU systems
  • TENSOR BRIDGE — 3-stage NPU/CPU/iGPU pipeline for AMD APUs
  • ORACLE Speculation — Expert fingerprint trie; 68% acceptance rate
  • CASCADE Pipeline — Inter-token overlap: token N at stage 3 while N+1 at stage 1

CLI

prismllm load <model>       # Interactive chat
prismllm serve <model>      # OpenAI-compatible server on :8000
prismllm bench <model>      # Benchmark tok/s
prismllm info <model>       # Model metadata

Supported Formats

GGUF, Safetensors, ONNX, PyTorch .bin, AWQ, GPTQ

License

MIT

About

Any model. Any hardware. Any size. Hardware-agnostic LLM inference engine with the Sparse Oracle Architecture.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors