feat: production polish — package layout, CI, viewer, docs by JacobPEvans-personal · Pull Request #15 · dryvist/mlx-benchmarks

JacobPEvans-personal · 2026-04-24T16:03:26Z

Summary

Take mlx-benchmarks from early-stage scaffolding to a polished,
senior-reviewer-ready project in four logical commits.

Publisher — migrate scripts/publish_run.py into a proper
src/mlx_benchmarks/ package (typed envelope, runtime system
detection, converter protocol, strict jsonschema validation before
every publish, structured JSON-lines logging). Expose as
mlx-bench-publish. Keep the old script as a back-compat shim.
Schema v1.1 — add $id, optional reproducibility fields
(python_version, mlx_version, mlx_lm_version, lm_eval_version,
kernel, seed, gen_kwargs, model_revision, quantization,
duration_seconds), and ship canonical valid + invalid envelope
examples. Non-breaking.
Quality gates — ruff + mypy strict + pytest + pre-commit.
18-test pytest suite with schema round-trip, publisher dry-run,
runtime system smoke. Zero findings, zero warnings.
Viewer as a first-class artifact — move app.py to space/ with
HF Spaces front-matter, add viewer chart tests (5 cases) covering
empty-data path, bar/trend/pivot rendering.
CI — six new workflows (test, dry-run-publish, codeql,
dependency-review, deploy-space, release-please) + rewrite of
validate-schema to call committed script instead of heredoc. All
user-controllable inputs flow through env: blocks; trusted-org
actions pinned to major-version tags.
Docs — full README rewrite with hero + real badges + correct
quickstart, plus CONTRIBUTING, SECURITY, architecture, schema prose,
CODEOWNERS, .editorconfig, bug/benchmark-request issue templates, PR
template. Session notes moved to docs/journal/.

Out of scope (follow-ups)

Converter coverage beyond lm-eval (vllm throughput, framework-eval)
— architecture is in place (get_converter registry), implementation
deferred.
Viewer bells-and-whistles (heatmap, leaderboard, sample drill-down)
— the current three tabs ship first; enhancements deferred.
F8-sweep historical rescue — already published as
data/run-rescue-*.parquet shards; verified in Phase 0.

Test plan

.venv/bin/ruff check . — all checks passed
.venv/bin/ruff format --check . — 25 files formatted
.venv/bin/mypy src/mlx_benchmarks — strict, 0 errors
.venv/bin/pytest tests space/tests — 23 pass
.venv/bin/python scripts/validate_schema.py — schema + 3 TOMLs OK
mlx-bench-publish tests/fixtures/lm_eval_results_sample.json --kind lm-eval --suite reasoning --dry-run — green
Live MLX smoke — BLOCKED on wedged llama-swap backend (see below)

Live-smoke status

Pipeline fully validated via fixture round-trip (the full code path a live
benchmark would take: raw JSON → LmEvalConverter → typed envelope →
runtime system detection → schema validation → Parquet → deterministic
HF filename).

Live MLX smoke could not complete this session: llama-swap on :11434 is
wedged after the pinned default Qwen3.5-27B-4bit got SIGKILL'd. The
vllm-mlx log shows 30+ minutes of 502s on POST /v1/chat/completions.
Any lm_eval run hangs until the service is restarted
(launchctl kickstart -k gui/$(id -u)/dev.vllm-mlx.server). I chose not
to take that shared-system action without approval.

Once the backend is back, the 3-model sweep drafted in
~/.claude/plans/i-never-know-vivid-sonnet.md Phase 8 is ready to run
as-is — pointing at Qwen3.5-9B-MLX-4bit, gemma-4-e4b-it-4bit, and
DeepSeek-R1-0528-Qwen3-8B-4bit for gsm8k_cot_zeroshot --limit 3.

Migration notes

scripts/publish_run.py keeps its old CLI shape via shim; no
breaking change for existing runbooks.
uv sync picks up the new required deps (jsonschema, psutil,
huggingface-hub, pyarrow) automatically. pyproject.toml no
longer declares [tool.uv] package = false.
Schema additions are all optional → existing consumers unaffected.

Release-please

Merge will bump to 0.3.0 (minor; feat + breaking-package-layout is
arguably major, but the public contract holds — published envelopes
remain valid, old CLI invocation still works via shim).

- Migrate scripts/publish_run.py into src/mlx_benchmarks/ package with typed envelope, runtime system detection, converter protocol, strict jsonschema validation before every publish, and structured logging. - Add mlx-bench-publish CLI exposed via project.scripts. Keep the original scripts/publish_run.py as a thin back-compat shim. - Extend schema.json with \$id, optional reproducibility fields (python_version, mlx_version, mlx_lm_version, lm_eval_version, kernel, seed, gen_kwargs, model_revision, quantization, duration_seconds) and ship canonical valid + invalid envelope examples. - Introduce ruff + mypy (strict) + pytest + pre-commit quality gates. First run is green across 18 pytest cases, 0 ruff findings, 0 mypy errors. - Replace hardcoded SYSTEM dict with detect_system() so published envelopes reflect whoever ran the benchmark, not one specific laptop. - Fix incidental ruff findings in harness/framework-eval/* and configs/lm-eval/qwen3-tasks/utils.py without silencing them. - Add scripts/validate_schema.py runnable locally (replaces heredoc embedded in validate-schema.yml).

- Add space/README.md with proper HF Spaces front-matter (sdk: gradio, app_file: app.py, pinned, tags, linked dataset) plus Installation and Usage sections satisfying the readme validator. - Add space/tests/test_charts.py covering bar/trend/table builders against a small in-memory DataFrame, plus an empty-data path. Guarantees the viewer cannot ship with a trivially-broken chart. - Pair with the upcoming deploy-space.yml workflow that auto-syncs this directory to huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer on every main push.

…e, release-please - test.yml: 3.11 + 3.12 matrix running ruff + ruff-format + mypy strict + pytest (package + viewer). Concurrency-cancelled per ref. - dry-run-publish.yml: round-trip the canonical lm-eval fixture through mlx-bench-publish on every PR that touches the publisher. Guards the envelope contract at PR time, not just locally. - validate-schema.yml rewritten to call the committed scripts/validate_schema.py instead of a heredoc Python snippet. Same behavior, reviewable locally. - codeql.yml: Python SAST on PR + weekly schedule. - dependency-review.yml: fail PRs introducing high-severity advisories. - deploy-space.yml: on main pushes touching space/**, sync to HF Space JacobPEvans/mlx-benchmarks-viewer via scripts/deploy_space.py (no inline Python). Uses HF_TOKEN from the huggingface-space environment. - release-please.yml + config + manifest: conventional-commits driven Python releases. feat=minor, fix=patch; manual major bumps only. - All workflow inputs that would ordinarily carry user-controllable data are piped through env: blocks, not interpolated into run: strings.

…ocs + templates - README: new hero, badges (test, validate-schema, CodeQL, HF dataset, HF Space, license, Python), accurate quickstart using mlx-bench-publish, real repo tree, differentiator paragraph ("why not just lm-eval?"). - CLAUDE.md: agent-facing summary aligned with README; zero drift. - Add CONTRIBUTING.md (dev setup, quality-gate checklist, converter guide, what-not-to-add). - Add SECURITY.md (HF token handling, \`--confirm_run_unsafe_code\` disclosure, third-party action pinning policy, non-vuln clarifications). - Add CODEOWNERS, .editorconfig, .github/PULL_REQUEST_TEMPLATE.md, and two issue templates (bug, benchmark/converter request) with a contact_links config pointing to the viewer + discussions. - Add docs/architecture.md (ASCII diagram + per-component notes) and docs/schema.md (prose walk-through of the authoritative contract). - Update configs/LAYOUT.md to match what actually ships and mark lighteval/mlxbench/extra lm-eval suites as planned, not present. - Update harness/framework-eval/README.md to reference the benchmark- request issue template instead of claiming a "follow-up PR" tracked nowhere.

Copilot

Pull request overview

This PR polishes mlx-benchmarks into a production-ready project by packaging the publisher as mlx_benchmarks, extending/validating the envelope schema, adding a tested Gradio viewer artifact, and introducing CI + repo hygiene docs/templates.

Changes:

Introduces a typed mlx_benchmarks package with converter(s), system detection, schema validation, CLI entrypoint, and HF publish path.
Extends schema.json (non-breaking optional fields) and ships canonical valid/invalid envelope examples + schema tests.
Adds CI workflows + pre-commit/mypy/ruff/pytest gates and promotes the dataset viewer under space/ with tests and deploy tooling.

Reviewed changes

Copilot reviewed 55 out of 58 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_system.py	Adds smoke tests for runtime system detection shape/content.
tests/test_schema.py	Adds Draft-07 schema self-validation + example fixture validation tests.
tests/test_publish.py	Adds unit tests for publish helpers (slugify/path/Parquet/dry-run).
tests/test_lm_eval_converter.py	Adds end-to-end lm-eval fixture → envelope → schema validation test.
tests/test_cli.py	Adds CLI argparse/dispatch smoke tests (dry-run, tag validation, JSON errors).
tests/fixtures/lm_eval_results_sample.json	Adds canonical lm-eval raw fixture used by converter/CLI tests.
tests/conftest.py	Adds shared fixtures for lm-eval + valid/invalid envelope examples.
src/mlx_benchmarks/system.py	Implements best-effort system metadata detection (os/chip/memory/versions).
src/mlx_benchmarks/publish.py	Implements envelope→rows→Parquet and HF dataset publish with deterministic shard paths.
src/mlx_benchmarks/logging_config.py	Adds console logging configuration with optional JSON-lines formatter.
src/mlx_benchmarks/envelope.py	Adds typed envelope definitions + cached jsonschema validator and schema lookup.
src/mlx_benchmarks/converters/lm_eval.py	Adds lm-eval results.json → envelope converter.
src/mlx_benchmarks/converters/base.py	Defines converter protocol + `ConverterContext` dataclass.
src/mlx_benchmarks/converters/init.py	Adds converter registry + `get_converter()` with explicit unknown-kind error.
src/mlx_benchmarks/cli.py	Adds `mlx-bench-publish` CLI for conversion + validation + publish/dry-run.
src/mlx_benchmarks/init.py	Exposes public API symbols and package version.
space/tests/test_charts.py	Adds Plotly chart-builder tests and empty-data behavior tests.
space/requirements.txt	Declares Space runtime dependencies for viewer deployment.
space/app.py	Viewer polish/formatting adjustments and small refactors.
space/README.md	Adds HF Spaces front-matter + viewer usage/deploy documentation.
scripts/validate_schema.py	Adds standalone schema + TOML validation script for CI/local use.
scripts/publish_run.py	Replaces legacy script with back-compat shim to the new CLI.
scripts/deploy_space.py	Adds script to atomically sync `space/` contents to HF Space.
schema.json	Adds `$id`, optional reproducibility fields, and embedded examples.
pyproject.toml	Converts repo into an installable package, adds scripts/extras, and configures ruff/mypy/pytest/coverage.
harness/framework-eval/eval_smolagents.py	Minor hardening: switch to `Path(...).open()` for fixture reads.
harness/framework-eval/eval_qwen_agent.py	Minor hardening/typing: `Path(...).open()`, `ClassVar`, kwargs naming, formatting.
harness/framework-eval/eval_openai_tool_calling.py	Minor hardening: switch to `Path(...).open()` for fixture reads.
harness/framework-eval/eval_google_adk.py	Minor hardening: switch to `Path(...).open()` for fixture reads.
harness/framework-eval/README.md	Updates status text and links to follow-up benchmark request template.
examples/envelope.valid.json	Adds canonical valid envelope example fixture.
examples/envelope.invalid.json	Adds canonical invalid envelope example fixture.
docs/schema.md	Adds prose schema walkthrough and versioning guidance.
docs/journal/2026-04-19-qwen36-benchmark-session.md	Adds archived benchmark session notes.
docs/journal/2026-04-10-bifrost-benchmark-session.md	Adds archived benchmark session notes.
docs/architecture.md	Adds architecture diagram and component/CI overview.
configs/lm-eval/qwen3-tasks/utils.py	Improves task utils exports and validation (e.g., strict zip, parity checks).
configs/LAYOUT.md	Updates current vs planned config layout and links to request template.
SECURITY.md	Adds security policy, token handling guidance, and unsafe-code warnings.
README.md	Major rewrite: badges, quickstart, layout, schema/publisher usage, viewer instructions.
CONTRIBUTING.md	Adds developer workflow, quality gates, and contribution conventions.
CODEOWNERS	Adds repo-wide code owner.
CLAUDE.md	Updates agent-facing conventions and common tasks.
.release-please-manifest.json	Adds release-please manifest with current version.
.release-please-config.json	Adds release-please configuration for Python releases/changelog sections.
.pre-commit-config.yaml	Adds pre-commit hooks for formatting/lint/type/actionlint.
.github/workflows/validate-schema.yml	Reworks schema validation workflow to call committed validator script.
.github/workflows/test.yml	Adds CI workflow for ruff/mypy/pytest across Python 3.11/3.12.
.github/workflows/release-please.yml	Adds automated release PR workflow via release-please.
.github/workflows/dry-run-publish.yml	Adds CI workflow to dry-run publisher on canonical fixture.
.github/workflows/deploy-space.yml	Adds workflow to deploy `space/` to HF Space using `scripts/deploy_space.py`.
.github/workflows/dependency-review.yml	Adds dependency review workflow gate.
.github/workflows/codeql.yml	Adds CodeQL workflow for Python security-and-quality queries.
.github/PULL_REQUEST_TEMPLATE.md	Adds PR template with required local test checklist.
.github/ISSUE_TEMPLATE/config.yml	Adds issue template config and contact links.
.github/ISSUE_TEMPLATE/bug.yml	Adds structured bug report template.
.github/ISSUE_TEMPLATE/benchmark-request.yml	Adds structured benchmark/converter request template.
.editorconfig	Adds repository-wide editor defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request introduces a comprehensive project structure for the mlx-benchmarks harness, including configuration files, issue templates, a pre-commit configuration, and documentation. The changes establish a standardized envelope schema for benchmark results and a publisher CLI to upload these results to a HuggingFace dataset. I have identified a few issues: the pre-commit configuration uses invalid revision tags for hooks, there is a typo in the repository URL within the contribution guide, and the security policy references a non-existent configuration file.

- test.yml + dry-run-publish.yml: pin astral-sh/setup-uv@v8.1.0 (v8 major tag does not exist; CI was failing to resolve). - Drop .github/workflows/codeql.yml — the repo has GitHub's default CodeQL setup enabled, which conflicts with advanced configurations ("CodeQL analyses from advanced configurations cannot be processed when the default setup is enabled"). Default setup already covers Python + actions scanning. - CONTRIBUTING.md: fix typo meh-benchmarks.git -> mlx-benchmarks.git and simplify to a single-line clone command. (reported by Copilot, Gemini) - CLAUDE.md: scope the "no uv run / uvx" convention to the main publisher workflow; acknowledge that harness/framework-eval uses \`uv run --with ...\` because each script carries PEP 723 inline metadata. (reported by Copilot) - src/mlx_benchmarks/system.py: docstring now accurately describes fallback behavior (schema-required fields always populated via "unknown"/0 fallback; optional fields omitted when detection fails). (reported by Copilot) - space/tests/test_charts.py: replace \`import app # noqa: E402\` with \`importlib.import_module("app")\` after sys.path insert, so ruff stays clean without suppressions. (reported by Copilot)

JacobPEvans-personal

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…hema format checker Addresses blockers and high-priority findings from the senior-review pass that the automated reviewers missed. BLOCKERS - pyproject.toml: schema.json was shipped via [shared-data] which lands it outside the package dir. Installed wheel -> FileNotFoundError on every publish. Switched to [force-include] so it sits at \`site-packages/mlx_benchmarks/schema.json\`, where importlib.resources resolves it. Verified by building the wheel + pip-installing into a clean venv + running mlx-bench-publish --dry-run. - scripts/deploy_space.py: SKIP_PARTS now also excludes \`tests\` so the HF Space sync does not accidentally ship pytest files into the Space repo. HIGH - envelope.py: Draft7Validator was constructed without a format_checker, meaning \`format: date-time\` was purely decorative — any string passed validation. Now uses Draft7Validator.FORMAT_CHECKER. \`jsonschema\` dep upgraded to \`jsonschema[format]>=4.23.0\` so rfc3339-validator is available. New test: bad timestamp on an otherwise-valid envelope fails with error path \`[timestamp]\`. - publish.py: target_path now appends an 8-char sha256 prefix of the parquet payload (backwards-compatible when payload=None). Prevents silent overwrite when two runs in the same second produce different bytes. Policy is documented in-code. - publish.py: introduce PublishError. rows_to_parquet and HF upload both raise it, and CLI catches PublishError instead of RuntimeError so HfHubHTTPError propagates through a single clean exit path (no more tracebacks on auth / rate-limit / network errors). - converters/lm_eval.py: removed the \`group_subtasks\` duration fallback — that field maps group names to subtask name lists, not durations. Zero durations are now preserved (explicit \`is None\` check instead of \`or\`). All .get-chains are null-safe so an lm-eval output with an explicit \`null\` config.gen_kwargs no longer AttributeErrors. - converters/lm_eval.py: _extract_timestamp validates ISO-8601 via regex before passing strings through; malformed timestamps log a warning and fall back to UTC now. - .release-please-config.json + extra-files: add x-release-please-version anchor comments in src/mlx_benchmarks/__init__.py AND pyproject.toml so release-please actually bumps the version instead of skipping. - docs/architecture.md, README.md, CLAUDE.md: remove references to the now-deleted codeql.yml workflow (the broken CodeQL badge on the README is replaced; the architecture doc notes CodeQL comes from the repo's default setup). MEDIUM - cli.py: --log-level now has explicit choices, typo no longer crashes with a traceback. - test.yml: cancel-in-progress only on PRs so back-to-back main pushes keep their green signals. - validate-schema.yml: push path filter mirrors PR filter so TOML changes merged to main still get validated. - validate-schema.yml: installs \`jsonschema[format]\` so format-checker assertions the validator will soon rely on are available there too. - dry-run-publish.yml: now builds the wheel and installs into a clean venv before running the CLI — catches packaging-layer bugs that editable installs hide (the fix in this very commit would not have been caught by the previous in-place variant). - dry-run-publish.yml: concurrency group + cancel-in-progress added. TESTS - New test_format_checker_rejects_non_iso_timestamp covers the format checker gap. - New test_target_path_includes_payload_hash asserts the collision-proof suffix. - test_rows_to_parquet_rejects_empty and test_publish_skipping_validation_still_rejects_empty now expect PublishError (tighter than plain ValueError).

…--python

Consolidate CI to follow the same conventions as nix-darwin, nix-ai, nix-home, terraform-proxmox, ansible-proxmox, and ai-workflows. Workflow consolidation: - Replace 6 standalone workflows (test, dependency-review, dry-run- publish, validate-schema, plus existing release-please and deploy- space) with a single ci-gate.yml orchestrator that uses dorny/ paths-filter for change detection and conditionally calls central reusable workflows from JacobPEvans/.github: * _python-security.yml — pip-audit on resolved deps * _osv-scan.yml — multi-ecosystem OSV lockfile scan * _markdown-lint.yml — markdownlint on .md files * _file-size.yml — repo-wide size guard - Inline jobs (python-test matrix, schema-validate, dry-run-publish) participate in the same merge gate via re-actors/alls-green. - Convert release-please.yml to a thin wrapper around _release-please.yml@main with GH_ACTION_JACOBPEVANS_APP_ID + GH_APP_PRIVATE_KEY so release PRs trigger downstream pull_request workflows (GITHUB_TOKEN does not). - Drop dependency-review.yml outright; OSV + pip-audit cover the same surface (full lockfile, multi-ecosystem) more comprehensively than the PR-diff-only GHSA scan. Repo hygiene cleanup: - Delete inherited templates (.github/ISSUE_TEMPLATE/bug.yml, config.yml, PULL_REQUEST_TEMPLATE.md) — they fall back to the JacobPEvans/.github community-health-files inheritance. - Delete CODEOWNERS and .editorconfig — absent from every other surveyed repo. - Update benchmark-request.yml to use canonical org label (type:feature) and add priority:* / size:* dropdowns so auto-label-issues.yml can extract them. Config alignments: - Rename .release-please-config.json -> release-please-config.json (no leading dot — the central reusable expects this filename). - Extend renovate.json with local>JacobPEvans/.github:renovate- presets to inherit the trusted-org allow-list, automerge rules, and Nix/uv custom managers. - Pre-commit check-yaml now uses --unsafe to accept lm-eval task files with custom !function tags (still validates YAML syntax). Docs: - README badges, README repo-shape diagram, docs/architecture.md CI table, and CLAUDE.md repo-shape block all updated to match. (claude)

JacobPEvans-personal added 4 commits April 24, 2026 11:45

Copilot AI review requested due to automatic review settings April 24, 2026 16:03

Copilot started reviewing on behalf of JacobPEvans-personal April 24, 2026 16:04 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread CONTRIBUTING.md Outdated

Comment thread src/mlx_benchmarks/system.py Outdated

Comment thread CLAUDE.md Outdated

Comment thread space/tests/test_charts.py Outdated

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread .pre-commit-config.yaml

Comment thread .pre-commit-config.yaml

Comment thread CONTRIBUTING.md Outdated

Comment thread SECURITY.md

JacobPEvans-personal commented Apr 24, 2026

View reviewed changes

JacobPEvans-personal added 3 commits April 24, 2026 15:50

fix(ci): dry-run-publish — uv venv has no pip by default, use uv pip …

4e49a94

…--python

JacobPEvans-personal mentioned this pull request Apr 25, 2026

feat(renovate): auto-merge pip_requirements + onboarding docs JacobPEvans-personal/.github#228

Merged

4 tasks

JacobPEvans-personal merged commit 0876689 into main Apr 25, 2026
9 of 13 checks passed

JacobPEvans-personal deleted the feat/production-polish branch April 25, 2026 03:00

JacobPEvans-personal mentioned this pull request Apr 25, 2026

fix: pin pillow + orjson in space/, bump base version to 0.4.0 #19

Merged

7 tasks

This was referenced Apr 26, 2026

chore: release 0.5.0 #24

Merged

chore(main): release 0.6.0 #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: production polish — package layout, CI, viewer, docs#15

feat: production polish — package layout, CI, viewer, docs#15
JacobPEvans-personal merged 8 commits into
mainfrom
feat/production-polish

JacobPEvans-personal commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacobPEvans-personal left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JacobPEvans-personal commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Out of scope (follow-ups)

Test plan

Live-smoke status

Migration notes

Release-please

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JacobPEvans-personal left a comment

Choose a reason for hiding this comment

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacobPEvans-personal commented Apr 24, 2026 •

edited

Loading