-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Description
Integrate llmfit (MIT-licensed, Rust) into DreamServer's installer pipeline to replace the current hardcoded tier-based model selection with dynamic, hardware-aware model recommendations.
DreamServer currently uses a 3-stage bash/python pipeline (detect-hardware.sh → classify-hardware.sh → build-capability-profile.sh) that maps hardware to fixed tiers (T1–T4, SH_LARGE/SH_COMPACT, AP_BASE/PRO/ULTRA), each with one hardcoded model. This works for common setups but doesn't scale — it can't evaluate quantization trade-offs, doesn't estimate speed, and requires manual updates when new models are released.
llmfit detects hardware (NVIDIA multi-GPU, AMD, Intel Arc, Apple Silicon, Ascend NPU), scores 200+ models across Quality/Speed/Fit/Context dimensions, dynamically selects the best quantization that fits, and estimates token/sec throughput — all in a single cross-platform binary.
| Capability | DreamServer today | llmfit |
|---|---|---|
| GPU detection | NVIDIA, AMD sysfs, Apple | NVIDIA (multi-GPU), AMD, Intel Arc, Apple Silicon, Ascend NPU |
| Model database | ~8 hardcoded models | 206 models from HuggingFace |
| Quantization | Fixed GGUF per tier | Dynamic — picks highest quality that fits (Q8_0 → Q2_K) |
| MoE support | Basic tier split | Full expert offloading with reduced VRAM calculation |
| Speed estimation | None | Bandwidth-based token/sec (~80 GPU profiles) |
| Scoring | VRAM thresholds only | Multi-dimensional with use-case weights |
| Runtime providers | Docker compose | Ollama, llama.cpp, MLX, Docker Model Runner |
Integration Options
Option A: Use llmfit as a pre-install advisor (recommended)
- Run
llmfit --jsonduring install to get the best model recommendation - Replace hardcoded tier map with llmfit's dynamic scoring
- Single binary (~5MB), cross-platform (Linux, macOS, Windows)
- Could ship as an optional dependency or vendor the binary in releases
Option B: Use llmfit's model database only
- Import
hf_models.json(206 models with parameter counts, quantization sizes, categories) - Keep DreamServer's own detection but use richer model data for selection
- No binary dependency — just a JSON file updated periodically
Option C: Use llmfit's REST API at runtime
llmfit serveexposes a REST API for model recommendations- Could query it dynamically for upgrade recommendations or model switching
Key Benefits
- Better model selection for edge cases — 24GB VRAM, unusual AMD cards, multi-GPU setups
- Dynamic quantization — instead of "you get this one GGUF", pick the best quality that fits
- Speed estimates — users see expected token/sec before committing
- Automatic model database updates — llmfit scrapes HuggingFace, so new models are picked up without code changes
- MoE-aware recommendations — properly handles Mixtral, DeepSeek-V3, Qwen3 MoE models
- Reduced maintenance — offload hardware detection and model scoring to a maintained upstream tool
Use Case
DreamServer is an open-source, self-hosted AI server that auto-installs on Linux, macOS, and Windows. During installation, it must detect the user's GPU (NVIDIA/AMD/Apple Silicon), measure available VRAM/RAM, and automatically select the best LLM model and Docker Compose stack — with zero user input.
Today we solve this with a custom pipeline that maps hardware to a fixed tier and picks one hardcoded model per tier. This breaks down for edge cases: unusual VRAM sizes (e.g., 24GB falls awkwardly between tiers), multi-GPU rigs, MoE models that could fit but our tier map doesn't consider, and new models that require manual updates to our tier config.
llmfit solves exactly the problems we hit:
- Dynamic model selection — instead of maintaining a static tier→model map, llmfit scores 200+ models against actual hardware and picks the best fit with the best quantization.
- Better hardware coverage — multi-GPU setups, Intel Arc, and Ascend NPUs that we don't currently handle.
- Speed estimation — our installer has no way to tell users how fast inference will be; llmfit provides bandwidth-based token/sec estimates.
- MoE-aware fitting — we hardcode tiers for MoE models, but llmfit dynamically calculates expert offloading for any MoE architecture.
- Reduced maintenance — llmfit's model database is scraped from HuggingFace automatically, eliminating manual tier config updates when new models ship.
Proposed Solution
Integrate llmfit --json as the model recommendation engine in our installer pipeline, called between hardware detection and capability profile generation.
Integration approach
- Ship the
llmfitbinary alongside DreamServer (or install it during setup viacargo install llmfit/ prebuilt release binary) - Call
llmfit --json --category generalduring install to get scored model recommendations - Parse the JSON output to extract the top-ranked model, its quantization, and estimated speed
- Feed the result into
build-capability-profile.shto generate.capabilities.json - Use the recommended model to configure llama-server in the Docker Compose stack
What changes in DreamServer
resolve_tier_config()inscripts/resolve-tier-config.shwould delegate to llmfit instead of the static tier mapclassify-hardware.shcould be simplified or removed (llmfit handles classification internally)detect-hardware.shcould remain as a fallback, or be replaced entirely by llmfit's detectionconfig/gpu-database.jsonwould no longer need manual maintenance
What stays the same
- The capability profile schema (
.capabilities.json) - Docker Compose overlay selection (amd/nvidia/apple)
- The installer flow (
install-core.sh→ preflight → detect → configure → compose up)
Alternatives Considered
-
Keep extending the current tier map — Add more tiers and models manually. Simple but doesn't scale; every new model or GPU requires code changes. No quantization flexibility or speed estimation.
-
Build our own model database from HuggingFace — Write a scraper to build
hf_models.jsonourselves and add scoring logic. Duplicates work that llmfit already does well. Higher maintenance burden. -
Use llmfit's model database only (JSON import) — Import
data/hf_models.json(206 models) without the binary. Keeps our own detection but gains model coverage. Less integration effort, but we'd still need to build scoring/quantization logic ourselves. -
Use llmfit's REST API at runtime — Run
llmfit serveas a sidecar service for dynamic model recommendations. More complex operationally, but enables runtime model switching and upgrade suggestions.
We recommend Option 1 (binary integration via --json) as the best balance of capability, simplicity, and maintenance. llmfit is MIT-licensed (same as DreamServer), actively maintained (61 releases), cross-platform, and already covers all our target platforms.