Skip to content

feat: integrate llmfit for better hardware detection and model selection #226

@evereq

Description

@evereq

Description

Integrate llmfit (MIT-licensed, Rust) into DreamServer's installer pipeline to replace the current hardcoded tier-based model selection with dynamic, hardware-aware model recommendations.

DreamServer currently uses a 3-stage bash/python pipeline (detect-hardware.shclassify-hardware.shbuild-capability-profile.sh) that maps hardware to fixed tiers (T1–T4, SH_LARGE/SH_COMPACT, AP_BASE/PRO/ULTRA), each with one hardcoded model. This works for common setups but doesn't scale — it can't evaluate quantization trade-offs, doesn't estimate speed, and requires manual updates when new models are released.

llmfit detects hardware (NVIDIA multi-GPU, AMD, Intel Arc, Apple Silicon, Ascend NPU), scores 200+ models across Quality/Speed/Fit/Context dimensions, dynamically selects the best quantization that fits, and estimates token/sec throughput — all in a single cross-platform binary.

Capability DreamServer today llmfit
GPU detection NVIDIA, AMD sysfs, Apple NVIDIA (multi-GPU), AMD, Intel Arc, Apple Silicon, Ascend NPU
Model database ~8 hardcoded models 206 models from HuggingFace
Quantization Fixed GGUF per tier Dynamic — picks highest quality that fits (Q8_0 → Q2_K)
MoE support Basic tier split Full expert offloading with reduced VRAM calculation
Speed estimation None Bandwidth-based token/sec (~80 GPU profiles)
Scoring VRAM thresholds only Multi-dimensional with use-case weights
Runtime providers Docker compose Ollama, llama.cpp, MLX, Docker Model Runner

Integration Options

Option A: Use llmfit as a pre-install advisor (recommended)

  • Run llmfit --json during install to get the best model recommendation
  • Replace hardcoded tier map with llmfit's dynamic scoring
  • Single binary (~5MB), cross-platform (Linux, macOS, Windows)
  • Could ship as an optional dependency or vendor the binary in releases

Option B: Use llmfit's model database only

  • Import hf_models.json (206 models with parameter counts, quantization sizes, categories)
  • Keep DreamServer's own detection but use richer model data for selection
  • No binary dependency — just a JSON file updated periodically

Option C: Use llmfit's REST API at runtime

  • llmfit serve exposes a REST API for model recommendations
  • Could query it dynamically for upgrade recommendations or model switching

Key Benefits

  1. Better model selection for edge cases — 24GB VRAM, unusual AMD cards, multi-GPU setups
  2. Dynamic quantization — instead of "you get this one GGUF", pick the best quality that fits
  3. Speed estimates — users see expected token/sec before committing
  4. Automatic model database updates — llmfit scrapes HuggingFace, so new models are picked up without code changes
  5. MoE-aware recommendations — properly handles Mixtral, DeepSeek-V3, Qwen3 MoE models
  6. Reduced maintenance — offload hardware detection and model scoring to a maintained upstream tool

Use Case

DreamServer is an open-source, self-hosted AI server that auto-installs on Linux, macOS, and Windows. During installation, it must detect the user's GPU (NVIDIA/AMD/Apple Silicon), measure available VRAM/RAM, and automatically select the best LLM model and Docker Compose stack — with zero user input.

Today we solve this with a custom pipeline that maps hardware to a fixed tier and picks one hardcoded model per tier. This breaks down for edge cases: unusual VRAM sizes (e.g., 24GB falls awkwardly between tiers), multi-GPU rigs, MoE models that could fit but our tier map doesn't consider, and new models that require manual updates to our tier config.

llmfit solves exactly the problems we hit:

  • Dynamic model selection — instead of maintaining a static tier→model map, llmfit scores 200+ models against actual hardware and picks the best fit with the best quantization.
  • Better hardware coverage — multi-GPU setups, Intel Arc, and Ascend NPUs that we don't currently handle.
  • Speed estimation — our installer has no way to tell users how fast inference will be; llmfit provides bandwidth-based token/sec estimates.
  • MoE-aware fitting — we hardcode tiers for MoE models, but llmfit dynamically calculates expert offloading for any MoE architecture.
  • Reduced maintenance — llmfit's model database is scraped from HuggingFace automatically, eliminating manual tier config updates when new models ship.

Proposed Solution

Integrate llmfit --json as the model recommendation engine in our installer pipeline, called between hardware detection and capability profile generation.

Integration approach

  1. Ship the llmfit binary alongside DreamServer (or install it during setup via cargo install llmfit / prebuilt release binary)
  2. Call llmfit --json --category general during install to get scored model recommendations
  3. Parse the JSON output to extract the top-ranked model, its quantization, and estimated speed
  4. Feed the result into build-capability-profile.sh to generate .capabilities.json
  5. Use the recommended model to configure llama-server in the Docker Compose stack

What changes in DreamServer

  • resolve_tier_config() in scripts/resolve-tier-config.sh would delegate to llmfit instead of the static tier map
  • classify-hardware.sh could be simplified or removed (llmfit handles classification internally)
  • detect-hardware.sh could remain as a fallback, or be replaced entirely by llmfit's detection
  • config/gpu-database.json would no longer need manual maintenance

What stays the same

  • The capability profile schema (.capabilities.json)
  • Docker Compose overlay selection (amd/nvidia/apple)
  • The installer flow (install-core.sh → preflight → detect → configure → compose up)

Alternatives Considered

  1. Keep extending the current tier map — Add more tiers and models manually. Simple but doesn't scale; every new model or GPU requires code changes. No quantization flexibility or speed estimation.

  2. Build our own model database from HuggingFace — Write a scraper to build hf_models.json ourselves and add scoring logic. Duplicates work that llmfit already does well. Higher maintenance burden.

  3. Use llmfit's model database only (JSON import) — Import data/hf_models.json (206 models) without the binary. Keeps our own detection but gains model coverage. Less integration effort, but we'd still need to build scoring/quantization logic ourselves.

  4. Use llmfit's REST API at runtime — Run llmfit serve as a sidecar service for dynamic model recommendations. More complex operationally, but enables runtime model switching and upgrade suggestions.

We recommend Option 1 (binary integration via --json) as the best balance of capability, simplicity, and maintenance. llmfit is MIT-licensed (same as DreamServer), actively maintained (61 releases), cross-platform, and already covers all our target platforms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions