Skip to content

docs: write 'First Benchmark Run' step-by-step walkthrough #51

@JacobPEvans-personal

Description

@JacobPEvans-personal

Why

README.md shows a 3-step CLI block (lm-eval → dry-run → publish) but jumps over real-world setup questions:

  • How do I bring up vllm-mlx + llama-swap? (link to nix-ai)
  • What does max_gen_toks=4096 mean vs other gen_kwargs?
  • How do I handle slow models / timeouts?
  • What's the difference between dry-run and a real publish?
  • How do I publish a partial run if my model crashed mid-suite?

Suggested approach

New docs/first-benchmark-run.md that walks through a complete first run end-to-end, with explanation:

  1. Set up the inference stack (link to nix-ai).
  2. Verify model availability (curl localhost:11434/v1/models).
  3. Pick a task and run lm-eval with limit=10 (cheap smoke).
  4. Inspect results_*.json (what the converter will see).
  5. Dry-run publish, read the validation output.
  6. Real publish, view the shard on HF dataset.
  7. Open the HF Space viewer to see it.

Add a "Common pitfalls" section: wrong base_url (must end in /v1/chat/completions), unavailable model in catalog, missing HF_TOKEN.

Acceptance

  • A new user with a working vllm-mlx setup can complete a real benchmark publish using only this doc.
  • Cross-linked from README "Usage" section and from docs.jacobpevans.com/tools/mlx-benchmarks.

Filed as follow-up to the docs(presentation): polish work in #49.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions