docs: write 'First Benchmark Run' step-by-step walkthrough

## Why

`README.md` shows a 3-step CLI block (lm-eval → dry-run → publish) but jumps over real-world setup questions:

- How do I bring up `vllm-mlx` + `llama-swap`? (link to nix-ai)
- What does `max_gen_toks=4096` mean vs other gen_kwargs?
- How do I handle slow models / timeouts?
- What's the difference between dry-run and a real publish?
- How do I publish a partial run if my model crashed mid-suite?

## Suggested approach

New `docs/first-benchmark-run.md` that walks through a complete first run end-to-end, with explanation:

1. Set up the inference stack (link to nix-ai).
2. Verify model availability (`curl localhost:11434/v1/models`).
3. Pick a task and run lm-eval with limit=10 (cheap smoke).
4. Inspect `results_*.json` (what the converter will see).
5. Dry-run publish, read the validation output.
6. Real publish, view the shard on HF dataset.
7. Open the HF Space viewer to see it.

Add a "Common pitfalls" section: wrong base_url (must end in /v1/chat/completions), unavailable model in catalog, missing `HF_TOKEN`.

## Acceptance

- A new user with a working vllm-mlx setup can complete a real benchmark publish using only this doc.
- Cross-linked from README "Usage" section and from docs.jacobpevans.com/tools/mlx-benchmarks.

Filed as follow-up to the `docs(presentation):` polish work in #49.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: write 'First Benchmark Run' step-by-step walkthrough #51

Why

Suggested approach

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

docs: write 'First Benchmark Run' step-by-step walkthrough #51

Description

Why

Suggested approach

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions