feat: production polish — package layout, CI, viewer, docs#15
Conversation
- Migrate scripts/publish_run.py into src/mlx_benchmarks/ package with typed envelope, runtime system detection, converter protocol, strict jsonschema validation before every publish, and structured logging. - Add mlx-bench-publish CLI exposed via project.scripts. Keep the original scripts/publish_run.py as a thin back-compat shim. - Extend schema.json with \$id, optional reproducibility fields (python_version, mlx_version, mlx_lm_version, lm_eval_version, kernel, seed, gen_kwargs, model_revision, quantization, duration_seconds) and ship canonical valid + invalid envelope examples. - Introduce ruff + mypy (strict) + pytest + pre-commit quality gates. First run is green across 18 pytest cases, 0 ruff findings, 0 mypy errors. - Replace hardcoded SYSTEM dict with detect_system() so published envelopes reflect whoever ran the benchmark, not one specific laptop. - Fix incidental ruff findings in harness/framework-eval/* and configs/lm-eval/qwen3-tasks/utils.py without silencing them. - Add scripts/validate_schema.py runnable locally (replaces heredoc embedded in validate-schema.yml).
- Add space/README.md with proper HF Spaces front-matter (sdk: gradio, app_file: app.py, pinned, tags, linked dataset) plus Installation and Usage sections satisfying the readme validator. - Add space/tests/test_charts.py covering bar/trend/table builders against a small in-memory DataFrame, plus an empty-data path. Guarantees the viewer cannot ship with a trivially-broken chart. - Pair with the upcoming deploy-space.yml workflow that auto-syncs this directory to huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer on every main push.
…e, release-please - test.yml: 3.11 + 3.12 matrix running ruff + ruff-format + mypy strict + pytest (package + viewer). Concurrency-cancelled per ref. - dry-run-publish.yml: round-trip the canonical lm-eval fixture through mlx-bench-publish on every PR that touches the publisher. Guards the envelope contract at PR time, not just locally. - validate-schema.yml rewritten to call the committed scripts/validate_schema.py instead of a heredoc Python snippet. Same behavior, reviewable locally. - codeql.yml: Python SAST on PR + weekly schedule. - dependency-review.yml: fail PRs introducing high-severity advisories. - deploy-space.yml: on main pushes touching space/**, sync to HF Space JacobPEvans/mlx-benchmarks-viewer via scripts/deploy_space.py (no inline Python). Uses HF_TOKEN from the huggingface-space environment. - release-please.yml + config + manifest: conventional-commits driven Python releases. feat=minor, fix=patch; manual major bumps only. - All workflow inputs that would ordinarily carry user-controllable data are piped through env: blocks, not interpolated into run: strings.
…ocs + templates
- README: new hero, badges (test, validate-schema, CodeQL, HF dataset, HF
Space, license, Python), accurate quickstart using mlx-bench-publish,
real repo tree, differentiator paragraph ("why not just lm-eval?").
- CLAUDE.md: agent-facing summary aligned with README; zero drift.
- Add CONTRIBUTING.md (dev setup, quality-gate checklist, converter
guide, what-not-to-add).
- Add SECURITY.md (HF token handling, \`--confirm_run_unsafe_code\`
disclosure, third-party action pinning policy, non-vuln clarifications).
- Add CODEOWNERS, .editorconfig, .github/PULL_REQUEST_TEMPLATE.md, and
two issue templates (bug, benchmark/converter request) with a
contact_links config pointing to the viewer + discussions.
- Add docs/architecture.md (ASCII diagram + per-component notes) and
docs/schema.md (prose walk-through of the authoritative contract).
- Update configs/LAYOUT.md to match what actually ships and mark
lighteval/mlxbench/extra lm-eval suites as planned, not present.
- Update harness/framework-eval/README.md to reference the benchmark-
request issue template instead of claiming a "follow-up PR" tracked
nowhere.
There was a problem hiding this comment.
Pull request overview
This PR polishes mlx-benchmarks into a production-ready project by packaging the publisher as mlx_benchmarks, extending/validating the envelope schema, adding a tested Gradio viewer artifact, and introducing CI + repo hygiene docs/templates.
Changes:
- Introduces a typed
mlx_benchmarkspackage with converter(s), system detection, schema validation, CLI entrypoint, and HF publish path. - Extends
schema.json(non-breaking optional fields) and ships canonical valid/invalid envelope examples + schema tests. - Adds CI workflows + pre-commit/mypy/ruff/pytest gates and promotes the dataset viewer under
space/with tests and deploy tooling.
Reviewed changes
Copilot reviewed 55 out of 58 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_system.py | Adds smoke tests for runtime system detection shape/content. |
| tests/test_schema.py | Adds Draft-07 schema self-validation + example fixture validation tests. |
| tests/test_publish.py | Adds unit tests for publish helpers (slugify/path/Parquet/dry-run). |
| tests/test_lm_eval_converter.py | Adds end-to-end lm-eval fixture → envelope → schema validation test. |
| tests/test_cli.py | Adds CLI argparse/dispatch smoke tests (dry-run, tag validation, JSON errors). |
| tests/fixtures/lm_eval_results_sample.json | Adds canonical lm-eval raw fixture used by converter/CLI tests. |
| tests/conftest.py | Adds shared fixtures for lm-eval + valid/invalid envelope examples. |
| src/mlx_benchmarks/system.py | Implements best-effort system metadata detection (os/chip/memory/versions). |
| src/mlx_benchmarks/publish.py | Implements envelope→rows→Parquet and HF dataset publish with deterministic shard paths. |
| src/mlx_benchmarks/logging_config.py | Adds console logging configuration with optional JSON-lines formatter. |
| src/mlx_benchmarks/envelope.py | Adds typed envelope definitions + cached jsonschema validator and schema lookup. |
| src/mlx_benchmarks/converters/lm_eval.py | Adds lm-eval results.json → envelope converter. |
| src/mlx_benchmarks/converters/base.py | Defines converter protocol + ConverterContext dataclass. |
| src/mlx_benchmarks/converters/init.py | Adds converter registry + get_converter() with explicit unknown-kind error. |
| src/mlx_benchmarks/cli.py | Adds mlx-bench-publish CLI for conversion + validation + publish/dry-run. |
| src/mlx_benchmarks/init.py | Exposes public API symbols and package version. |
| space/tests/test_charts.py | Adds Plotly chart-builder tests and empty-data behavior tests. |
| space/requirements.txt | Declares Space runtime dependencies for viewer deployment. |
| space/app.py | Viewer polish/formatting adjustments and small refactors. |
| space/README.md | Adds HF Spaces front-matter + viewer usage/deploy documentation. |
| scripts/validate_schema.py | Adds standalone schema + TOML validation script for CI/local use. |
| scripts/publish_run.py | Replaces legacy script with back-compat shim to the new CLI. |
| scripts/deploy_space.py | Adds script to atomically sync space/ contents to HF Space. |
| schema.json | Adds $id, optional reproducibility fields, and embedded examples. |
| pyproject.toml | Converts repo into an installable package, adds scripts/extras, and configures ruff/mypy/pytest/coverage. |
| harness/framework-eval/eval_smolagents.py | Minor hardening: switch to Path(...).open() for fixture reads. |
| harness/framework-eval/eval_qwen_agent.py | Minor hardening/typing: Path(...).open(), ClassVar, kwargs naming, formatting. |
| harness/framework-eval/eval_openai_tool_calling.py | Minor hardening: switch to Path(...).open() for fixture reads. |
| harness/framework-eval/eval_google_adk.py | Minor hardening: switch to Path(...).open() for fixture reads. |
| harness/framework-eval/README.md | Updates status text and links to follow-up benchmark request template. |
| examples/envelope.valid.json | Adds canonical valid envelope example fixture. |
| examples/envelope.invalid.json | Adds canonical invalid envelope example fixture. |
| docs/schema.md | Adds prose schema walkthrough and versioning guidance. |
| docs/journal/2026-04-19-qwen36-benchmark-session.md | Adds archived benchmark session notes. |
| docs/journal/2026-04-10-bifrost-benchmark-session.md | Adds archived benchmark session notes. |
| docs/architecture.md | Adds architecture diagram and component/CI overview. |
| configs/lm-eval/qwen3-tasks/utils.py | Improves task utils exports and validation (e.g., strict zip, parity checks). |
| configs/LAYOUT.md | Updates current vs planned config layout and links to request template. |
| SECURITY.md | Adds security policy, token handling guidance, and unsafe-code warnings. |
| README.md | Major rewrite: badges, quickstart, layout, schema/publisher usage, viewer instructions. |
| CONTRIBUTING.md | Adds developer workflow, quality gates, and contribution conventions. |
| CODEOWNERS | Adds repo-wide code owner. |
| CLAUDE.md | Updates agent-facing conventions and common tasks. |
| .release-please-manifest.json | Adds release-please manifest with current version. |
| .release-please-config.json | Adds release-please configuration for Python releases/changelog sections. |
| .pre-commit-config.yaml | Adds pre-commit hooks for formatting/lint/type/actionlint. |
| .github/workflows/validate-schema.yml | Reworks schema validation workflow to call committed validator script. |
| .github/workflows/test.yml | Adds CI workflow for ruff/mypy/pytest across Python 3.11/3.12. |
| .github/workflows/release-please.yml | Adds automated release PR workflow via release-please. |
| .github/workflows/dry-run-publish.yml | Adds CI workflow to dry-run publisher on canonical fixture. |
| .github/workflows/deploy-space.yml | Adds workflow to deploy space/ to HF Space using scripts/deploy_space.py. |
| .github/workflows/dependency-review.yml | Adds dependency review workflow gate. |
| .github/workflows/codeql.yml | Adds CodeQL workflow for Python security-and-quality queries. |
| .github/PULL_REQUEST_TEMPLATE.md | Adds PR template with required local test checklist. |
| .github/ISSUE_TEMPLATE/config.yml | Adds issue template config and contact links. |
| .github/ISSUE_TEMPLATE/bug.yml | Adds structured bug report template. |
| .github/ISSUE_TEMPLATE/benchmark-request.yml | Adds structured benchmark/converter request template. |
| .editorconfig | Adds repository-wide editor defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive project structure for the mlx-benchmarks harness, including configuration files, issue templates, a pre-commit configuration, and documentation. The changes establish a standardized envelope schema for benchmark results and a publisher CLI to upload these results to a HuggingFace dataset. I have identified a few issues: the pre-commit configuration uses invalid revision tags for hooks, there is a typo in the repository URL within the contribution guide, and the security policy references a non-existent configuration file.
- test.yml + dry-run-publish.yml: pin astral-sh/setup-uv@v8.1.0
(v8 major tag does not exist; CI was failing to resolve).
- Drop .github/workflows/codeql.yml — the repo has GitHub's default
CodeQL setup enabled, which conflicts with advanced configurations
("CodeQL analyses from advanced configurations cannot be processed
when the default setup is enabled"). Default setup already covers
Python + actions scanning.
- CONTRIBUTING.md: fix typo meh-benchmarks.git -> mlx-benchmarks.git
and simplify to a single-line clone command. (reported by Copilot,
Gemini)
- CLAUDE.md: scope the "no uv run / uvx" convention to the main
publisher workflow; acknowledge that harness/framework-eval uses
\`uv run --with ...\` because each script carries PEP 723 inline
metadata. (reported by Copilot)
- src/mlx_benchmarks/system.py: docstring now accurately describes
fallback behavior (schema-required fields always populated via
"unknown"/0 fallback; optional fields omitted when detection fails).
(reported by Copilot)
- space/tests/test_charts.py: replace \`import app # noqa: E402\` with
\`importlib.import_module("app")\` after sys.path insert, so ruff
stays clean without suppressions. (reported by Copilot)
JacobPEvans-personal
left a comment
There was a problem hiding this comment.
Code review
No issues found. Checked for bugs and CLAUDE.md compliance.
🤖 Generated with Claude Code
- If this code review was useful, please react with 👍. Otherwise, react with 👎.
…hema format checker Addresses blockers and high-priority findings from the senior-review pass that the automated reviewers missed. BLOCKERS - pyproject.toml: schema.json was shipped via [shared-data] which lands it outside the package dir. Installed wheel -> FileNotFoundError on every publish. Switched to [force-include] so it sits at \`site-packages/mlx_benchmarks/schema.json\`, where importlib.resources resolves it. Verified by building the wheel + pip-installing into a clean venv + running mlx-bench-publish --dry-run. - scripts/deploy_space.py: SKIP_PARTS now also excludes \`tests\` so the HF Space sync does not accidentally ship pytest files into the Space repo. HIGH - envelope.py: Draft7Validator was constructed without a format_checker, meaning \`format: date-time\` was purely decorative — any string passed validation. Now uses Draft7Validator.FORMAT_CHECKER. \`jsonschema\` dep upgraded to \`jsonschema[format]>=4.23.0\` so rfc3339-validator is available. New test: bad timestamp on an otherwise-valid envelope fails with error path \`[timestamp]\`. - publish.py: target_path now appends an 8-char sha256 prefix of the parquet payload (backwards-compatible when payload=None). Prevents silent overwrite when two runs in the same second produce different bytes. Policy is documented in-code. - publish.py: introduce PublishError. rows_to_parquet and HF upload both raise it, and CLI catches PublishError instead of RuntimeError so HfHubHTTPError propagates through a single clean exit path (no more tracebacks on auth / rate-limit / network errors). - converters/lm_eval.py: removed the \`group_subtasks\` duration fallback — that field maps group names to subtask name lists, not durations. Zero durations are now preserved (explicit \`is None\` check instead of \`or\`). All .get-chains are null-safe so an lm-eval output with an explicit \`null\` config.gen_kwargs no longer AttributeErrors. - converters/lm_eval.py: _extract_timestamp validates ISO-8601 via regex before passing strings through; malformed timestamps log a warning and fall back to UTC now. - .release-please-config.json + extra-files: add x-release-please-version anchor comments in src/mlx_benchmarks/__init__.py AND pyproject.toml so release-please actually bumps the version instead of skipping. - docs/architecture.md, README.md, CLAUDE.md: remove references to the now-deleted codeql.yml workflow (the broken CodeQL badge on the README is replaced; the architecture doc notes CodeQL comes from the repo's default setup). MEDIUM - cli.py: --log-level now has explicit choices, typo no longer crashes with a traceback. - test.yml: cancel-in-progress only on PRs so back-to-back main pushes keep their green signals. - validate-schema.yml: push path filter mirrors PR filter so TOML changes merged to main still get validated. - validate-schema.yml: installs \`jsonschema[format]\` so format-checker assertions the validator will soon rely on are available there too. - dry-run-publish.yml: now builds the wheel and installs into a clean venv before running the CLI — catches packaging-layer bugs that editable installs hide (the fix in this very commit would not have been caught by the previous in-place variant). - dry-run-publish.yml: concurrency group + cancel-in-progress added. TESTS - New test_format_checker_rejects_non_iso_timestamp covers the format checker gap. - New test_target_path_includes_payload_hash asserts the collision-proof suffix. - test_rows_to_parquet_rejects_empty and test_publish_skipping_validation_still_rejects_empty now expect PublishError (tighter than plain ValueError).
Consolidate CI to follow the same conventions as nix-darwin, nix-ai,
nix-home, terraform-proxmox, ansible-proxmox, and ai-workflows.
Workflow consolidation:
- Replace 6 standalone workflows (test, dependency-review, dry-run-
publish, validate-schema, plus existing release-please and deploy-
space) with a single ci-gate.yml orchestrator that uses dorny/
paths-filter for change detection and conditionally calls central
reusable workflows from JacobPEvans/.github:
* _python-security.yml — pip-audit on resolved deps
* _osv-scan.yml — multi-ecosystem OSV lockfile scan
* _markdown-lint.yml — markdownlint on .md files
* _file-size.yml — repo-wide size guard
- Inline jobs (python-test matrix, schema-validate, dry-run-publish)
participate in the same merge gate via re-actors/alls-green.
- Convert release-please.yml to a thin wrapper around
_release-please.yml@main with GH_ACTION_JACOBPEVANS_APP_ID +
GH_APP_PRIVATE_KEY so release PRs trigger downstream pull_request
workflows (GITHUB_TOKEN does not).
- Drop dependency-review.yml outright; OSV + pip-audit cover the
same surface (full lockfile, multi-ecosystem) more comprehensively
than the PR-diff-only GHSA scan.
Repo hygiene cleanup:
- Delete inherited templates (.github/ISSUE_TEMPLATE/bug.yml,
config.yml, PULL_REQUEST_TEMPLATE.md) — they fall back to the
JacobPEvans/.github community-health-files inheritance.
- Delete CODEOWNERS and .editorconfig — absent from every other
surveyed repo.
- Update benchmark-request.yml to use canonical org label
(type:feature) and add priority:* / size:* dropdowns so
auto-label-issues.yml can extract them.
Config alignments:
- Rename .release-please-config.json -> release-please-config.json
(no leading dot — the central reusable expects this filename).
- Extend renovate.json with local>JacobPEvans/.github:renovate-
presets to inherit the trusted-org allow-list, automerge rules,
and Nix/uv custom managers.
- Pre-commit check-yaml now uses --unsafe to accept lm-eval task
files with custom !function tags (still validates YAML syntax).
Docs:
- README badges, README repo-shape diagram, docs/architecture.md
CI table, and CLAUDE.md repo-shape block all updated to match.
(claude)
Summary
Take
mlx-benchmarksfrom early-stage scaffolding to a polished,senior-reviewer-ready project in four logical commits.
scripts/publish_run.pyinto a propersrc/mlx_benchmarks/package (typed envelope, runtime systemdetection, converter protocol, strict jsonschema validation before
every publish, structured JSON-lines logging). Expose as
mlx-bench-publish. Keep the old script as a back-compat shim.$id, optional reproducibility fields(
python_version,mlx_version,mlx_lm_version,lm_eval_version,kernel,seed,gen_kwargs,model_revision,quantization,duration_seconds), and ship canonical valid + invalid envelopeexamples. Non-breaking.
18-test pytest suite with schema round-trip, publisher dry-run,
runtime system smoke. Zero findings, zero warnings.
app.pytospace/withHF Spaces front-matter, add viewer chart tests (5 cases) covering
empty-data path, bar/trend/pivot rendering.
test,dry-run-publish,codeql,dependency-review,deploy-space,release-please) + rewrite ofvalidate-schemato call committed script instead of heredoc. Alluser-controllable inputs flow through
env:blocks; trusted-orgactions pinned to major-version tags.
quickstart, plus CONTRIBUTING, SECURITY, architecture, schema prose,
CODEOWNERS, .editorconfig, bug/benchmark-request issue templates, PR
template. Session notes moved to
docs/journal/.Out of scope (follow-ups)
— architecture is in place (
get_converterregistry), implementationdeferred.
— the current three tabs ship first; enhancements deferred.
data/run-rescue-*.parquetshards; verified in Phase 0.Test plan
.venv/bin/ruff check .— all checks passed.venv/bin/ruff format --check .— 25 files formatted.venv/bin/mypy src/mlx_benchmarks— strict, 0 errors.venv/bin/pytest tests space/tests— 23 pass.venv/bin/python scripts/validate_schema.py— schema + 3 TOMLs OKmlx-bench-publish tests/fixtures/lm_eval_results_sample.json --kind lm-eval --suite reasoning --dry-run— greenLive-smoke status
Pipeline fully validated via fixture round-trip (the full code path a live
benchmark would take: raw JSON → LmEvalConverter → typed envelope →
runtime system detection → schema validation → Parquet → deterministic
HF filename).
Live MLX smoke could not complete this session:
llama-swapon :11434 iswedged after the pinned default
Qwen3.5-27B-4bitgot SIGKILL'd. Thevllm-mlx log shows 30+ minutes of 502s on
POST /v1/chat/completions.Any lm_eval run hangs until the service is restarted
(
launchctl kickstart -k gui/$(id -u)/dev.vllm-mlx.server). I chose notto take that shared-system action without approval.
Once the backend is back, the 3-model sweep drafted in
~/.claude/plans/i-never-know-vivid-sonnet.mdPhase 8 is ready to runas-is — pointing at
Qwen3.5-9B-MLX-4bit,gemma-4-e4b-it-4bit, andDeepSeek-R1-0528-Qwen3-8B-4bitforgsm8k_cot_zeroshot --limit 3.Migration notes
scripts/publish_run.pykeeps its old CLI shape via shim; nobreaking change for existing runbooks.
uv syncpicks up the new required deps (jsonschema,psutil,huggingface-hub,pyarrow) automatically.pyproject.tomlnolonger declares
[tool.uv] package = false.Release-please
Merge will bump to 0.3.0 (minor; feat + breaking-package-layout is
arguably major, but the public contract holds — published envelopes
remain valid, old CLI invocation still works via shim).