Skip to content

dryvist/mlx-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

mlx-benchmarks

ci-gate Release Please Schema v1 Python 3.11+ License: Apache 2.0 HF Dataset HF Space

A reproducible benchmark harness for MLX-quantized and locally-hosted LLMs on Apple Silicon. One envelope schema, one HuggingFace dataset, one interactive viewer — across every upstream evaluation tool.

Why not just use lm-eval directly? You can. This repo wraps lm-eval (and other harnesses) with:

  • A single versioned result contract (schema.json) so every shard is comparable across tools, models, and dates.
  • A publish pipeline (mlx-bench-publish) that validates envelopes against the schema and uploads to the HF dataset with content-addressed filenames.
  • A Gradio viewer (in space/) auto-deployed to an HF Space on every main push.

Read results as a pandas DataFrame with no tooling beyond huggingface_hub + pyarrow.

Architecture

%%{init: {
  'theme':'base',
  'look':'handDrawn',
  'themeVariables':{
    'fontFamily':'Geist',
    'fontSize':'14px',
    'primaryColor':'#102937',
    'primaryTextColor':'#F4EFE6',
    'primaryBorderColor':'#4FB3A9',
    'lineColor':'#4FB3A9',
    'secondaryColor':'#0B1D2A',
    'tertiaryColor':'#1A2A38',
    'clusterBkg':'rgba(79,179,169,0.08)',
    'clusterBorder':'#4FB3A9'
  }
}}%%
flowchart LR
  NixAI([nix-ai env])
  Serve([vllm-mlx + llama-swap])
  Eval([lm-eval · vllm · framework-eval])
  Bench([mlx-benchmarks])
  Dataset[("HF dataset")]
  Viewer([HF Space viewer])

  NixAI --> Serve
  Serve -->|":11434/v1"| Eval
  Eval -->|"results_*.json"| Bench
  Bench -->|"validate + publish"| Dataset
  Dataset --> Viewer

  classDef source fill:#102937,stroke:#E06B4A,stroke-width:2.5px,color:#F4EFE6;
  classDef stack  fill:#102937,stroke:#4FB3A9,stroke-width:2px,color:#F4EFE6;
  classDef core   fill:#102937,stroke:#4FB3A9,stroke-width:3px,color:#F4EFE6;
  classDef sink   fill:#102937,stroke:#F4EFE6,stroke-width:2.5px,color:#F4EFE6;

  class NixAI source
  class Serve,Eval stack
  class Bench core
  class Dataset,Viewer sink

  linkStyle 0,1,2,3,4 stroke:#4FB3A9,stroke-width:2px;
Loading

The repo lives at the mlx-benchmarks node: it owns the envelope contract (schema.json), the publisher (mlx-bench-publish), and the converters that fan in from each upstream evaluation tool. Everything to the left of it (model serving, evaluation drivers) and to the right (storage, viewer) is delegated. See docs/architecture.md for the detailed component breakdown, data-flow, and CI diagrams.

Ecosystem

mlx-benchmarks is the instrumentation layer of a larger Apple-Silicon homelab AI stack. The full stack is documented at docs.jacobpevans.com.

Layer Repo Provides
Declarative env nix-ai vllm-mlx LaunchAgent, llama-swap, MLX module derivations
macOS host nix-darwin Composes nix-ai + nix-home into the system flake
Multi-provider gateway orbstack-kubernetes Bifrost AI gateway, OTEL collection on local K8s
AI tool policy ai-assistant-instructions Model routing, permissions, MCP server canon

User-visible workflow: darwin-rebuild switch brings up the inference stack via nix-aivllm-mlx serves models on :11434 → benchmark drivers in this repo hit that endpoint → mlx-bench-publish validates and uploads results → the HF Space viewer renders them. Nothing in this repo assumes a specific model registry or routing layer; it talks to any OpenAI-compatible endpoint.

Upstream tools wired in

Tool Suite(s) Purpose
lm-evaluation-harness coding, reasoning Standard LLM evals (humaneval, mbpp, gsm8k, arc, ...)
vllm benchmark_serving throughput Cross-check throughput against vllm upstream (install with [vllm] extra)
OpenAI + Qwen-Agent + smolagents + ADK framework-eval Per-framework agent harness in harness/framework-eval/

Planned but not wired yet: lighteval (broader tasks), MLXBench (native throughput). The configs/LAYOUT.md is the single source of truth for what is currently implemented vs aspirational.

Repository layout

.
├── README.md                 <- this file
├── CLAUDE.md                 <- agent-facing project notes
├── CONTRIBUTING.md           <- dev workflow
├── SECURITY.md               <- HF token handling, unsafe-code warning
├── LICENSE                   <- Apache-2.0
├── schema.json               <- envelope v1 (authoritative)
├── examples/                 <- known-good + known-bad envelope fixtures
├── pyproject.toml            <- package + lint/type/test config
├── src/mlx_benchmarks/       <- Python package (publisher, converters)
│   ├── cli.py                <-   mlx-bench-publish entry point
│   ├── envelope.py           <-   typed envelope + jsonschema validator
│   ├── publish.py            <-   parquet + HF upload (unique filenames)
│   ├── system.py             <-   runtime detection of os/chip/memory/versions
│   ├── logging_config.py     <-   text + JSON-lines logging
│   └── converters/lm_eval.py <-   lm-eval results.json -> envelope
├── tests/                    <- package tests + fixtures
├── configs/                  <- one TOML per (tool, suite) pair
│   ├── LAYOUT.md
│   ├── lm-eval/{coding.toml, reasoning.toml, qwen3-tasks/}
│   └── vllm/benchmark_serving.toml
├── harness/                  <- inline-Python suites (non-TOML)
│   └── framework-eval/       <-   agent framework evaluations
├── scripts/                  <- one-shot tooling (validator, space deploy)
├── space/                    <- Gradio viewer (deployed to HF Space)
│   ├── app.py
│   ├── requirements.txt
│   ├── README.md             <-   HF Spaces front-matter
│   └── tests/
├── docs/                     <- architecture.md, schema.md, faq.md, journal/
└── .github/workflows/        <- ci-gate (test + lint + scan + dry-run-publish
                                  + schema-validate via paths-filter),
                                  release-please, deploy-space

Installation

Requires macOS on Apple Silicon (for inference) and Python 3.11+. A running vllm-mlx OpenAI-compatible inference server on http://localhost:11434/v1 is assumed by the lm-eval configs.

git clone https://github.com/JacobPEvans/mlx-benchmarks.git
cd mlx-benchmarks

# Plain uv (recommended)
uv sync
# ...or plain pip into a venv
python -m venv .venv && source .venv/bin/activate && pip install -e ".[viewer]"

# Token with write scope on the HF dataset, required for publishing
export HF_TOKEN="hf_..."

# Install pre-commit hooks (optional but encouraged)
.venv/bin/pre-commit install

For Nix users: direnv allow activates the included flake.nix dev shell.

Usage

Run a benchmark and publish

# 1. Run lm-eval against your local vllm-mlx endpoint
BASE="http://localhost:11434/v1/chat/completions"
MODEL="mlx-community/Qwen3.5-9B-MLX-4bit"
.venv/bin/lm_eval --model local-chat-completions \
  --model_args "base_url=$BASE,model=$MODEL,max_length=32768,timeout=3600" \
  --tasks gsm8k_cot_zeroshot \
  --batch_size 1 --num_fewshot 0 --limit 10 \
  --gen_kwargs "max_gen_toks=4096" \
  --apply_chat_template --fewshot_as_multiturn --log_samples \
  --output_path ./run-output

# 2. Dry-run conversion (validates envelope against schema, no upload)
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
  --kind lm-eval --suite reasoning --dry-run

# 3. Publish to the HF dataset
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json \
  --kind lm-eval --suite reasoning

Filenames are deterministic (data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet) so historical shards are never overwritten.

View results

Open the live HF Space: https://huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer

Or run the viewer locally:

cd space
pip install -r requirements.txt
python app.py

API

The envelope

See schema.json — it is the authoritative, versioned contract backing every published shard. A minimal valid envelope:

{
  "schema_version": "1",
  "timestamp": "2026-04-24T18:30:00Z",
  "git_sha": "aaa3ff3",
  "trigger": "local",
  "suite": "reasoning",
  "model": "mlx-community/Qwen3.5-9B-MLX-4bit",
  "system": {"os": "macOS 26.4.1", "chip": "Apple M4 Max", "memory_gb": 128},
  "results": [
    {"name": "gsm8k_cot_zeroshot", "metric": "exact_match_flexible",
     "value": 0.8, "unit": "ratio"}
  ]
}

Optional v1 fields (non-breaking additions): seed, gen_kwargs, model_revision, quantization, and on the system object: python_version, mlx_version, mlx_lm_version, lm_eval_version, kernel. The CLI auto-detects all of these at publish time — no hand-curation required.

See docs/schema.md for fields, docs/schema-migration.md for version upgrades, and docs/faq.md for ops questions and troubleshooting.

The publisher

from mlx_benchmarks.converters import get_converter
from mlx_benchmarks.converters.base import ConverterContext
from mlx_benchmarks.publish import publish
from mlx_benchmarks.system import detect_system

ctx = ConverterContext(
    suite="reasoning",
    model="mlx-community/Qwen3.5-9B-MLX-4bit",
    git_sha="aaa3ff3",
    system=detect_system(),
)
envelope = get_converter("lm-eval").build_envelope(raw_results, ctx)
publish(envelope, dry_run=False)  # validates + uploads

Reading the dataset

from datasets import load_dataset
ds = load_dataset("JacobPEvans/mlx-benchmarks")
print(ds["train"][0])

Contributing

See CONTRIBUTING.md for the full developer workflow. Keep orchestration glue thin — if integrating a new upstream tool requires more than ~50 lines of Python, re-read the tool's docs before writing code.

Security

HF tokens, the --confirm_run_unsafe_code lm-eval flag, and the disclosure policy are covered in SECURITY.md.

License

Apache 2.0. See LICENSE.

About

Benchmark harness for MLX and local LLMs on Apple Silicon (results: hf.co/datasets/JacobPEvans/mlx-benchmarks)

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors