Why
README.md shows a 3-step CLI block (lm-eval → dry-run → publish) but jumps over real-world setup questions:
- How do I bring up
vllm-mlx + llama-swap? (link to nix-ai)
- What does
max_gen_toks=4096 mean vs other gen_kwargs?
- How do I handle slow models / timeouts?
- What's the difference between dry-run and a real publish?
- How do I publish a partial run if my model crashed mid-suite?
Suggested approach
New docs/first-benchmark-run.md that walks through a complete first run end-to-end, with explanation:
- Set up the inference stack (link to nix-ai).
- Verify model availability (
curl localhost:11434/v1/models).
- Pick a task and run lm-eval with limit=10 (cheap smoke).
- Inspect
results_*.json (what the converter will see).
- Dry-run publish, read the validation output.
- Real publish, view the shard on HF dataset.
- Open the HF Space viewer to see it.
Add a "Common pitfalls" section: wrong base_url (must end in /v1/chat/completions), unavailable model in catalog, missing HF_TOKEN.
Acceptance
- A new user with a working vllm-mlx setup can complete a real benchmark publish using only this doc.
- Cross-linked from README "Usage" section and from docs.jacobpevans.com/tools/mlx-benchmarks.
Filed as follow-up to the docs(presentation): polish work in #49.
Why
README.mdshows a 3-step CLI block (lm-eval → dry-run → publish) but jumps over real-world setup questions:vllm-mlx+llama-swap? (link to nix-ai)max_gen_toks=4096mean vs other gen_kwargs?Suggested approach
New
docs/first-benchmark-run.mdthat walks through a complete first run end-to-end, with explanation:curl localhost:11434/v1/models).results_*.json(what the converter will see).Add a "Common pitfalls" section: wrong base_url (must end in /v1/chat/completions), unavailable model in catalog, missing
HF_TOKEN.Acceptance
Filed as follow-up to the
docs(presentation):polish work in #49.