A reproducible benchmark harness for MLX-quantized and locally-hosted LLMs on Apple Silicon. One envelope schema, one HuggingFace dataset, one interactive viewer — across every upstream evaluation tool.
Why not just use lm-eval directly? You can. This repo wraps lm-eval (and other harnesses) with:
- A single versioned result contract (
schema.json) so every shard is comparable across tools, models, and dates. - A publish pipeline (
mlx-bench-publish) that validates envelopes against the schema and uploads to the HF dataset with content-addressed filenames. - A Gradio viewer (in
space/) auto-deployed to an HF Space on everymainpush.
Read results as a pandas DataFrame with no tooling beyond huggingface_hub +
pyarrow.
%%{init: {
'theme':'base',
'look':'handDrawn',
'themeVariables':{
'fontFamily':'Geist',
'fontSize':'14px',
'primaryColor':'#102937',
'primaryTextColor':'#F4EFE6',
'primaryBorderColor':'#4FB3A9',
'lineColor':'#4FB3A9',
'secondaryColor':'#0B1D2A',
'tertiaryColor':'#1A2A38',
'clusterBkg':'rgba(79,179,169,0.08)',
'clusterBorder':'#4FB3A9'
}
}}%%
flowchart LR
NixAI([nix-ai env])
Serve([vllm-mlx + llama-swap])
Eval([lm-eval · vllm · framework-eval])
Bench([mlx-benchmarks])
Dataset[("HF dataset")]
Viewer([HF Space viewer])
NixAI --> Serve
Serve -->|":11434/v1"| Eval
Eval -->|"results_*.json"| Bench
Bench -->|"validate + publish"| Dataset
Dataset --> Viewer
classDef source fill:#102937,stroke:#E06B4A,stroke-width:2.5px,color:#F4EFE6;
classDef stack fill:#102937,stroke:#4FB3A9,stroke-width:2px,color:#F4EFE6;
classDef core fill:#102937,stroke:#4FB3A9,stroke-width:3px,color:#F4EFE6;
classDef sink fill:#102937,stroke:#F4EFE6,stroke-width:2.5px,color:#F4EFE6;
class NixAI source
class Serve,Eval stack
class Bench core
class Dataset,Viewer sink
linkStyle 0,1,2,3,4 stroke:#4FB3A9,stroke-width:2px;
The repo lives at the mlx-benchmarks node: it owns the envelope contract
(schema.json), the publisher (mlx-bench-publish), and the converters that
fan in from each upstream evaluation tool. Everything to the left of it
(model serving, evaluation drivers) and to the right (storage, viewer) is
delegated. See docs/architecture.md for the
detailed component breakdown, data-flow, and CI diagrams.
mlx-benchmarks is the instrumentation layer of a larger Apple-Silicon
homelab AI stack. The full stack is documented at
docs.jacobpevans.com.
| Layer | Repo | Provides |
|---|---|---|
| Declarative env | nix-ai |
vllm-mlx LaunchAgent, llama-swap, MLX module derivations |
| macOS host | nix-darwin |
Composes nix-ai + nix-home into the system flake |
| Multi-provider gateway | orbstack-kubernetes |
Bifrost AI gateway, OTEL collection on local K8s |
| AI tool policy | ai-assistant-instructions |
Model routing, permissions, MCP server canon |
User-visible workflow: darwin-rebuild switch brings up the inference
stack via nix-ai → vllm-mlx serves models on :11434 → benchmark
drivers in this repo hit that endpoint → mlx-bench-publish validates
and uploads results → the HF Space viewer renders them. Nothing in this
repo assumes a specific model registry or routing layer; it talks to any
OpenAI-compatible endpoint.
| Tool | Suite(s) | Purpose |
|---|---|---|
| lm-evaluation-harness | coding, reasoning |
Standard LLM evals (humaneval, mbpp, gsm8k, arc, ...) |
vllm benchmark_serving |
throughput |
Cross-check throughput against vllm upstream (install with [vllm] extra) |
| OpenAI + Qwen-Agent + smolagents + ADK | framework-eval |
Per-framework agent harness in harness/framework-eval/ |
Planned but not wired yet: lighteval (broader tasks), MLXBench (native throughput).
The configs/LAYOUT.md is the single source of truth for what is currently
implemented vs aspirational.
.
├── README.md <- this file
├── CLAUDE.md <- agent-facing project notes
├── CONTRIBUTING.md <- dev workflow
├── SECURITY.md <- HF token handling, unsafe-code warning
├── LICENSE <- Apache-2.0
├── schema.json <- envelope v1 (authoritative)
├── examples/ <- known-good + known-bad envelope fixtures
├── pyproject.toml <- package + lint/type/test config
├── src/mlx_benchmarks/ <- Python package (publisher, converters)
│ ├── cli.py <- mlx-bench-publish entry point
│ ├── envelope.py <- typed envelope + jsonschema validator
│ ├── publish.py <- parquet + HF upload (unique filenames)
│ ├── system.py <- runtime detection of os/chip/memory/versions
│ ├── logging_config.py <- text + JSON-lines logging
│ └── converters/lm_eval.py <- lm-eval results.json -> envelope
├── tests/ <- package tests + fixtures
├── configs/ <- one TOML per (tool, suite) pair
│ ├── LAYOUT.md
│ ├── lm-eval/{coding.toml, reasoning.toml, qwen3-tasks/}
│ └── vllm/benchmark_serving.toml
├── harness/ <- inline-Python suites (non-TOML)
│ └── framework-eval/ <- agent framework evaluations
├── scripts/ <- one-shot tooling (validator, space deploy)
├── space/ <- Gradio viewer (deployed to HF Space)
│ ├── app.py
│ ├── requirements.txt
│ ├── README.md <- HF Spaces front-matter
│ └── tests/
├── docs/ <- architecture.md, schema.md, faq.md, journal/
└── .github/workflows/ <- ci-gate (test + lint + scan + dry-run-publish
+ schema-validate via paths-filter),
release-please, deploy-space
Requires macOS on Apple Silicon (for inference) and Python 3.11+. A running
vllm-mlx OpenAI-compatible inference server on http://localhost:11434/v1
is assumed by the lm-eval configs.
git clone https://github.com/JacobPEvans/mlx-benchmarks.git
cd mlx-benchmarks
# Plain uv (recommended)
uv sync
# ...or plain pip into a venv
python -m venv .venv && source .venv/bin/activate && pip install -e ".[viewer]"
# Token with write scope on the HF dataset, required for publishing
export HF_TOKEN="hf_..."
# Install pre-commit hooks (optional but encouraged)
.venv/bin/pre-commit installFor Nix users: direnv allow activates the included flake.nix dev shell.
# 1. Run lm-eval against your local vllm-mlx endpoint
BASE="http://localhost:11434/v1/chat/completions"
MODEL="mlx-community/Qwen3.5-9B-MLX-4bit"
.venv/bin/lm_eval --model local-chat-completions \
--model_args "base_url=$BASE,model=$MODEL,max_length=32768,timeout=3600" \
--tasks gsm8k_cot_zeroshot \
--batch_size 1 --num_fewshot 0 --limit 10 \
--gen_kwargs "max_gen_toks=4096" \
--apply_chat_template --fewshot_as_multiturn --log_samples \
--output_path ./run-output
# 2. Dry-run conversion (validates envelope against schema, no upload)
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
--kind lm-eval --suite reasoning --dry-run
# 3. Publish to the HF dataset
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
--kind lm-eval --suite reasoningFilenames are deterministic
(data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet)
so historical shards are never overwritten.
Open the live HF Space: https://huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer
Or run the viewer locally:
cd space
pip install -r requirements.txt
python app.pySee schema.json — it is the authoritative, versioned contract
backing every published shard. A minimal valid envelope:
{
"schema_version": "1",
"timestamp": "2026-04-24T18:30:00Z",
"git_sha": "aaa3ff3",
"trigger": "local",
"suite": "reasoning",
"model": "mlx-community/Qwen3.5-9B-MLX-4bit",
"system": {"os": "macOS 26.4.1", "chip": "Apple M4 Max", "memory_gb": 128},
"results": [
{"name": "gsm8k_cot_zeroshot", "metric": "exact_match_flexible",
"value": 0.8, "unit": "ratio"}
]
}Optional v1 fields (non-breaking additions): seed, gen_kwargs,
model_revision, quantization, and on the system object:
python_version, mlx_version, mlx_lm_version, lm_eval_version,
kernel. The CLI auto-detects all of these at publish time —
no hand-curation required.
See docs/schema.md for fields,
docs/schema-migration.md for version upgrades,
and docs/faq.md for ops questions and troubleshooting.
from mlx_benchmarks.converters import get_converter
from mlx_benchmarks.converters.base import ConverterContext
from mlx_benchmarks.publish import publish
from mlx_benchmarks.system import detect_system
ctx = ConverterContext(
suite="reasoning",
model="mlx-community/Qwen3.5-9B-MLX-4bit",
git_sha="aaa3ff3",
system=detect_system(),
)
envelope = get_converter("lm-eval").build_envelope(raw_results, ctx)
publish(envelope, dry_run=False) # validates + uploadsfrom datasets import load_dataset
ds = load_dataset("JacobPEvans/mlx-benchmarks")
print(ds["train"][0])See CONTRIBUTING.md for the full developer workflow.
Keep orchestration glue thin — if integrating a new upstream tool requires
more than ~50 lines of Python, re-read the tool's docs before writing code.
HF tokens, the --confirm_run_unsafe_code lm-eval flag, and the disclosure
policy are covered in SECURITY.md.
Apache 2.0. See LICENSE.