mlx-benchmarks

A reproducible benchmark harness for MLX-quantized and locally-hosted LLMs on Apple Silicon. One envelope schema, one HuggingFace dataset, one interactive viewer — across every upstream evaluation tool.

Why not just use lm-eval directly? You can. This repo wraps lm-eval (and other harnesses) with:

A single versioned result contract (schema.json) so every shard is comparable across tools, models, and dates.
A publish pipeline (mlx-bench-publish) that validates envelopes against the schema and uploads to the HF dataset with content-addressed filenames.
A Gradio viewer (in space/) auto-deployed to an HF Space on every main push.

Read results as a pandas DataFrame with no tooling beyond huggingface_hub + pyarrow.

Architecture

%%{init: {
  'theme':'base',
  'look':'handDrawn',
  'themeVariables':{
    'fontFamily':'Geist',
    'fontSize':'14px',
    'primaryColor':'#102937',
    'primaryTextColor':'#F4EFE6',
    'primaryBorderColor':'#4FB3A9',
    'lineColor':'#4FB3A9',
    'secondaryColor':'#0B1D2A',
    'tertiaryColor':'#1A2A38',
    'clusterBkg':'rgba(79,179,169,0.08)',
    'clusterBorder':'#4FB3A9'
  }
}}%%
flowchart LR
  NixAI([nix-ai env])
  Serve([vllm-mlx + llama-swap])
  Eval([lm-eval · vllm · framework-eval])
  Bench([mlx-benchmarks])
  Dataset[("HF dataset")]
  Viewer([HF Space viewer])

  NixAI --> Serve
  Serve -->|":11434/v1"| Eval
  Eval -->|"results_*.json"| Bench
  Bench -->|"validate + publish"| Dataset
  Dataset --> Viewer

  classDef source fill:#102937,stroke:#E06B4A,stroke-width:2.5px,color:#F4EFE6;
  classDef stack  fill:#102937,stroke:#4FB3A9,stroke-width:2px,color:#F4EFE6;
  classDef core   fill:#102937,stroke:#4FB3A9,stroke-width:3px,color:#F4EFE6;
  classDef sink   fill:#102937,stroke:#F4EFE6,stroke-width:2.5px,color:#F4EFE6;

  class NixAI source
  class Serve,Eval stack
  class Bench core
  class Dataset,Viewer sink

  linkStyle 0,1,2,3,4 stroke:#4FB3A9,stroke-width:2px;

The repo lives at the mlx-benchmarks node: it owns the envelope contract (schema.json), the publisher (mlx-bench-publish), and the converters that fan in from each upstream evaluation tool. Everything to the left of it (model serving, evaluation drivers) and to the right (storage, viewer) is delegated. See docs/architecture.md for the detailed component breakdown, data-flow, and CI diagrams.

Ecosystem

mlx-benchmarks is the instrumentation layer of a larger Apple-Silicon homelab AI stack. The full stack is documented at docs.jacobpevans.com.

Layer	Repo	Provides
Declarative env	`nix-ai`	`vllm-mlx` LaunchAgent, `llama-swap`, MLX module derivations
macOS host	`nix-darwin`	Composes `nix-ai` + `nix-home` into the system flake
Multi-provider gateway	`orbstack-kubernetes`	Bifrost AI gateway, OTEL collection on local K8s
AI tool policy	`ai-assistant-instructions`	Model routing, permissions, MCP server canon

User-visible workflow: darwin-rebuild switch brings up the inference stack via nix-ai → vllm-mlx serves models on :11434 → benchmark drivers in this repo hit that endpoint → mlx-bench-publish validates and uploads results → the HF Space viewer renders them. Nothing in this repo assumes a specific model registry or routing layer; it talks to any OpenAI-compatible endpoint.

Upstream tools wired in

Tool	Suite(s)	Purpose
lm-evaluation-harness	`coding`, `reasoning`	Standard LLM evals (humaneval, mbpp, gsm8k, arc, ...)
vllm `benchmark_serving`	`throughput`	Cross-check throughput against vllm upstream (install with `[vllm]` extra)
OpenAI + Qwen-Agent + smolagents + ADK	`framework-eval`	Per-framework agent harness in `harness/framework-eval/`

Planned but not wired yet: lighteval (broader tasks), MLXBench (native throughput). The configs/LAYOUT.md is the single source of truth for what is currently implemented vs aspirational.

Repository layout

.
├── README.md                 <- this file
├── CLAUDE.md                 <- agent-facing project notes
├── CONTRIBUTING.md           <- dev workflow
├── SECURITY.md               <- HF token handling, unsafe-code warning
├── LICENSE                   <- Apache-2.0
├── schema.json               <- envelope v1 (authoritative)
├── examples/                 <- known-good + known-bad envelope fixtures
├── pyproject.toml            <- package + lint/type/test config
├── src/mlx_benchmarks/       <- Python package (publisher, converters)
│   ├── cli.py                <-   mlx-bench-publish entry point
│   ├── envelope.py           <-   typed envelope + jsonschema validator
│   ├── publish.py            <-   parquet + HF upload (unique filenames)
│   ├── system.py             <-   runtime detection of os/chip/memory/versions
│   ├── logging_config.py     <-   text + JSON-lines logging
│   └── converters/lm_eval.py <-   lm-eval results.json -> envelope
├── tests/                    <- package tests + fixtures
├── configs/                  <- one TOML per (tool, suite) pair
│   ├── LAYOUT.md
│   ├── lm-eval/{coding.toml, reasoning.toml, qwen3-tasks/}
│   └── vllm/benchmark_serving.toml
├── harness/                  <- inline-Python suites (non-TOML)
│   └── framework-eval/       <-   agent framework evaluations
├── scripts/                  <- one-shot tooling (validator, space deploy)
├── space/                    <- Gradio viewer (deployed to HF Space)
│   ├── app.py
│   ├── requirements.txt
│   ├── README.md             <-   HF Spaces front-matter
│   └── tests/
├── docs/                     <- architecture.md, schema.md, faq.md, journal/
└── .github/workflows/        <- ci-gate (test + lint + scan + dry-run-publish
                                  + schema-validate via paths-filter),
                                  release-please, deploy-space

Installation

Requires macOS on Apple Silicon (for inference) and Python 3.11+. A running vllm-mlx OpenAI-compatible inference server on http://localhost:11434/v1 is assumed by the lm-eval configs.

git clone https://github.com/JacobPEvans/mlx-benchmarks.git
cd mlx-benchmarks

# Plain uv (recommended)
uv sync
# ...or plain pip into a venv
python -m venv .venv && source .venv/bin/activate && pip install -e ".[viewer]"

# Token with write scope on the HF dataset, required for publishing
export HF_TOKEN="hf_..."

# Install pre-commit hooks (optional but encouraged)
.venv/bin/pre-commit install

For Nix users: direnv allow activates the included flake.nix dev shell.

Usage

Run a benchmark and publish

# 1. Run lm-eval against your local vllm-mlx endpoint
BASE="http://localhost:11434/v1/chat/completions"
MODEL="mlx-community/Qwen3.5-9B-MLX-4bit"
.venv/bin/lm_eval --model local-chat-completions \
  --model_args "base_url=$BASE,model=$MODEL,max_length=32768,timeout=3600" \
  --tasks gsm8k_cot_zeroshot \
  --batch_size 1 --num_fewshot 0 --limit 10 \
  --gen_kwargs "max_gen_toks=4096" \
  --apply_chat_template --fewshot_as_multiturn --log_samples \
  --output_path ./run-output

# 2. Dry-run conversion (validates envelope against schema, no upload)
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
  --kind lm-eval --suite reasoning --dry-run

# 3. Publish to the HF dataset
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
  --kind lm-eval --suite reasoning

Filenames are deterministic (data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet) so historical shards are never overwritten.

View results

Open the live HF Space: https://huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer

Or run the viewer locally:

cd space
pip install -r requirements.txt
python app.py

API

The envelope

See schema.json — it is the authoritative, versioned contract backing every published shard. A minimal valid envelope:

{
  "schema_version": "1",
  "timestamp": "2026-04-24T18:30:00Z",
  "git_sha": "aaa3ff3",
  "trigger": "local",
  "suite": "reasoning",
  "model": "mlx-community/Qwen3.5-9B-MLX-4bit",
  "system": {"os": "macOS 26.4.1", "chip": "Apple M4 Max", "memory_gb": 128},
  "results": [
    {"name": "gsm8k_cot_zeroshot", "metric": "exact_match_flexible",
     "value": 0.8, "unit": "ratio"}
  ]
}

Optional v1 fields (non-breaking additions): seed, gen_kwargs, model_revision, quantization, and on the system object: python_version, mlx_version, mlx_lm_version, lm_eval_version, kernel. The CLI auto-detects all of these at publish time — no hand-curation required.

See docs/schema.md for fields, docs/schema-migration.md for version upgrades, and docs/faq.md for ops questions and troubleshooting.

The publisher

from mlx_benchmarks.converters import get_converter
from mlx_benchmarks.converters.base import ConverterContext
from mlx_benchmarks.publish import publish
from mlx_benchmarks.system import detect_system

ctx = ConverterContext(
    suite="reasoning",
    model="mlx-community/Qwen3.5-9B-MLX-4bit",
    git_sha="aaa3ff3",
    system=detect_system(),
)
envelope = get_converter("lm-eval").build_envelope(raw_results, ctx)
publish(envelope, dry_run=False)  # validates + uploads

Reading the dataset

from datasets import load_dataset
ds = load_dataset("JacobPEvans/mlx-benchmarks")
print(ds["train"][0])

Contributing

See CONTRIBUTING.md for the full developer workflow. Keep orchestration glue thin — if integrating a new upstream tool requires more than ~50 lines of Python, re-read the tool's docs before writing code.

Security

HF tokens, the --confirm_run_unsafe_code lm-eval flag, and the disclosure policy are covered in SECURITY.md.

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-benchmarks

Architecture

Ecosystem

Upstream tools wired in

Repository layout

Installation

Usage

Run a benchmark and publish

View results

API

The envelope

The publisher

Reading the dataset

Contributing

Security

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.claude		.claude
.github		.github
configs		configs
docs		docs
examples		examples
harness/framework-eval		harness/framework-eval
scripts		scripts
space		space
src/mlx_benchmarks		src/mlx_benchmarks
tests		tests
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.release-please-manifest.json		.release-please-manifest.json
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
flake.lock		flake.lock
flake.nix		flake.nix
osv-scanner.toml		osv-scanner.toml
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
renovate.json		renovate.json
schema.json		schema.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

mlx-benchmarks

Architecture

Ecosystem

Upstream tools wired in

Repository layout

Installation

Usage

Run a benchmark and publish

View results

API

The envelope

The publisher

Reading the dataset

Contributing

Security

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages