feat: integrate llmfit for better hardware detection and model selection

### Description

Integrate [llmfit](https://github.com/AlexsJones/llmfit) (MIT-licensed, Rust) into DreamServer's installer pipeline to replace the current hardcoded tier-based model selection with dynamic, hardware-aware model recommendations.

DreamServer currently uses a 3-stage bash/python pipeline (`detect-hardware.sh` → `classify-hardware.sh` → `build-capability-profile.sh`) that maps hardware to fixed tiers (T1–T4, SH_LARGE/SH_COMPACT, AP_BASE/PRO/ULTRA), each with one hardcoded model. This works for common setups but doesn't scale — it can't evaluate quantization trade-offs, doesn't estimate speed, and requires manual updates when new models are released.

llmfit detects hardware (NVIDIA multi-GPU, AMD, Intel Arc, Apple Silicon, Ascend NPU), scores 200+ models across Quality/Speed/Fit/Context dimensions, dynamically selects the best quantization that fits, and estimates token/sec throughput — all in a single cross-platform binary.

| Capability | DreamServer today | llmfit |
|---|---|---|
| GPU detection | NVIDIA, AMD sysfs, Apple | NVIDIA (multi-GPU), AMD, Intel Arc, Apple Silicon, Ascend NPU |
| Model database | ~8 hardcoded models | 206 models from HuggingFace |
| Quantization | Fixed GGUF per tier | Dynamic — picks highest quality that fits (Q8_0 → Q2_K) |
| MoE support | Basic tier split | Full expert offloading with reduced VRAM calculation |
| Speed estimation | None | Bandwidth-based token/sec (~80 GPU profiles) |
| Scoring | VRAM thresholds only | Multi-dimensional with use-case weights |
| Runtime providers | Docker compose | Ollama, llama.cpp, MLX, Docker Model Runner |

---

### Integration Options

**Option A: Use llmfit as a pre-install advisor (recommended)**
- Run `llmfit --json` during install to get the best model recommendation
- Replace hardcoded tier map with llmfit's dynamic scoring
- Single binary (~5MB), cross-platform (Linux, macOS, Windows)
- Could ship as an optional dependency or vendor the binary in releases

**Option B: Use llmfit's model database only**
- Import `hf_models.json` (206 models with parameter counts, quantization sizes, categories)
- Keep DreamServer's own detection but use richer model data for selection
- No binary dependency — just a JSON file updated periodically

**Option C: Use llmfit's REST API at runtime**
- `llmfit serve` exposes a REST API for model recommendations
- Could query it dynamically for upgrade recommendations or model switching

### Key Benefits

1. **Better model selection for edge cases** — 24GB VRAM, unusual AMD cards, multi-GPU setups
2. **Dynamic quantization** — instead of "you get this one GGUF", pick the best quality that fits
3. **Speed estimates** — users see expected token/sec before committing
4. **Automatic model database updates** — llmfit scrapes HuggingFace, so new models are picked up without code changes
5. **MoE-aware recommendations** — properly handles Mixtral, DeepSeek-V3, Qwen3 MoE models
6. **Reduced maintenance** — offload hardware detection and model scoring to a maintained upstream tool

### Use Case

DreamServer is an open-source, self-hosted AI server that auto-installs on Linux, macOS, and Windows. During installation, it must detect the user's GPU (NVIDIA/AMD/Apple Silicon), measure available VRAM/RAM, and automatically select the best LLM model and Docker Compose stack — with zero user input.

Today we solve this with a custom pipeline that maps hardware to a fixed tier and picks one hardcoded model per tier. This breaks down for edge cases: unusual VRAM sizes (e.g., 24GB falls awkwardly between tiers), multi-GPU rigs, MoE models that could fit but our tier map doesn't consider, and new models that require manual updates to our tier config.

**llmfit solves exactly the problems we hit:**

- **Dynamic model selection** — instead of maintaining a static tier→model map, llmfit scores 200+ models against actual hardware and picks the best fit with the best quantization.
- **Better hardware coverage** — multi-GPU setups, Intel Arc, and Ascend NPUs that we don't currently handle.
- **Speed estimation** — our installer has no way to tell users how fast inference will be; llmfit provides bandwidth-based token/sec estimates.
- **MoE-aware fitting** — we hardcode tiers for MoE models, but llmfit dynamically calculates expert offloading for any MoE architecture.
- **Reduced maintenance** — llmfit's model database is scraped from HuggingFace automatically, eliminating manual tier config updates when new models ship.

---

### Proposed Solution

**Integrate `llmfit --json` as the model recommendation engine** in our installer pipeline, called between hardware detection and capability profile generation.

### Integration approach

1. Ship the `llmfit` binary alongside DreamServer (or install it during setup via `cargo install llmfit` / prebuilt release binary)
2. Call `llmfit --json --category general` during install to get scored model recommendations
3. Parse the JSON output to extract the top-ranked model, its quantization, and estimated speed
4. Feed the result into `build-capability-profile.sh` to generate `.capabilities.json`
5. Use the recommended model to configure llama-server in the Docker Compose stack

### What changes in DreamServer

- `resolve_tier_config()` in `scripts/resolve-tier-config.sh` would delegate to llmfit instead of the static tier map
- `classify-hardware.sh` could be simplified or removed (llmfit handles classification internally)
- `detect-hardware.sh` could remain as a fallback, or be replaced entirely by llmfit's detection
- `config/gpu-database.json` would no longer need manual maintenance

### What stays the same

- The capability profile schema (`.capabilities.json`)
- Docker Compose overlay selection (amd/nvidia/apple)
- The installer flow (`install-core.sh` → preflight → detect → configure → compose up)

### Alternatives Considered

1. **Keep extending the current tier map** — Add more tiers and models manually. Simple but doesn't scale; every new model or GPU requires code changes. No quantization flexibility or speed estimation.

2. **Build our own model database from HuggingFace** — Write a scraper to build `hf_models.json` ourselves and add scoring logic. Duplicates work that llmfit already does well. Higher maintenance burden.

3. **Use llmfit's model database only (JSON import)** — Import `data/hf_models.json` (206 models) without the binary. Keeps our own detection but gains model coverage. Less integration effort, but we'd still need to build scoring/quantization logic ourselves.

4. **Use llmfit's REST API at runtime** — Run `llmfit serve` as a sidecar service for dynamic model recommendations. More complex operationally, but enables runtime model switching and upgrade suggestions.

We recommend **Option 1 (binary integration via `--json`)** as the best balance of capability, simplicity, and maintenance. llmfit is MIT-licensed (same as DreamServer), actively maintained (61 releases), cross-platform, and already covers all our target platforms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate llmfit for better hardware detection and model selection #226

Description

Integration Options

Key Benefits

Use Case

Proposed Solution

Integration approach

What changes in DreamServer

What stays the same

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capability	DreamServer today	llmfit
GPU detection	NVIDIA, AMD sysfs, Apple	NVIDIA (multi-GPU), AMD, Intel Arc, Apple Silicon, Ascend NPU
Model database	~8 hardcoded models	206 models from HuggingFace
Quantization	Fixed GGUF per tier	Dynamic — picks highest quality that fits (Q8_0 → Q2_K)
MoE support	Basic tier split	Full expert offloading with reduced VRAM calculation
Speed estimation	None	Bandwidth-based token/sec (~80 GPU profiles)
Scoring	VRAM thresholds only	Multi-dimensional with use-case weights
Runtime providers	Docker compose	Ollama, llama.cpp, MLX, Docker Model Runner

feat: integrate llmfit for better hardware detection and model selection #226

Description

Description

Integration Options

Key Benefits

Use Case

Proposed Solution

Integration approach

What changes in DreamServer

What stays the same

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions