Skip to content

feat: production polish — package layout, CI, viewer, docs#15

Merged
JacobPEvans-personal merged 8 commits into
mainfrom
feat/production-polish
Apr 25, 2026
Merged

feat: production polish — package layout, CI, viewer, docs#15
JacobPEvans-personal merged 8 commits into
mainfrom
feat/production-polish

Conversation

@JacobPEvans-personal
Copy link
Copy Markdown
Member

@JacobPEvans-personal JacobPEvans-personal commented Apr 24, 2026

Summary

Take mlx-benchmarks from early-stage scaffolding to a polished,
senior-reviewer-ready project in four logical commits.

  • Publisher — migrate scripts/publish_run.py into a proper
    src/mlx_benchmarks/ package (typed envelope, runtime system
    detection, converter protocol, strict jsonschema validation before
    every publish, structured JSON-lines logging). Expose as
    mlx-bench-publish. Keep the old script as a back-compat shim.
  • Schema v1.1 — add $id, optional reproducibility fields
    (python_version, mlx_version, mlx_lm_version, lm_eval_version,
    kernel, seed, gen_kwargs, model_revision, quantization,
    duration_seconds), and ship canonical valid + invalid envelope
    examples. Non-breaking.
  • Quality gates — ruff + mypy strict + pytest + pre-commit.
    18-test pytest suite with schema round-trip, publisher dry-run,
    runtime system smoke. Zero findings, zero warnings.
  • Viewer as a first-class artifact — move app.py to space/ with
    HF Spaces front-matter, add viewer chart tests (5 cases) covering
    empty-data path, bar/trend/pivot rendering.
  • CI — six new workflows (test, dry-run-publish, codeql,
    dependency-review, deploy-space, release-please) + rewrite of
    validate-schema to call committed script instead of heredoc. All
    user-controllable inputs flow through env: blocks; trusted-org
    actions pinned to major-version tags.
  • Docs — full README rewrite with hero + real badges + correct
    quickstart, plus CONTRIBUTING, SECURITY, architecture, schema prose,
    CODEOWNERS, .editorconfig, bug/benchmark-request issue templates, PR
    template. Session notes moved to docs/journal/.

Out of scope (follow-ups)

  • Converter coverage beyond lm-eval (vllm throughput, framework-eval)
    — architecture is in place (get_converter registry), implementation
    deferred.
  • Viewer bells-and-whistles (heatmap, leaderboard, sample drill-down)
    — the current three tabs ship first; enhancements deferred.
  • F8-sweep historical rescue — already published as
    data/run-rescue-*.parquet shards; verified in Phase 0.

Test plan

  • .venv/bin/ruff check . — all checks passed
  • .venv/bin/ruff format --check . — 25 files formatted
  • .venv/bin/mypy src/mlx_benchmarks — strict, 0 errors
  • .venv/bin/pytest tests space/tests — 23 pass
  • .venv/bin/python scripts/validate_schema.py — schema + 3 TOMLs OK
  • mlx-bench-publish tests/fixtures/lm_eval_results_sample.json --kind lm-eval --suite reasoning --dry-run — green
  • Live MLX smoke — BLOCKED on wedged llama-swap backend (see below)

Live-smoke status

Pipeline fully validated via fixture round-trip (the full code path a live
benchmark would take: raw JSON → LmEvalConverter → typed envelope →
runtime system detection → schema validation → Parquet → deterministic
HF filename).

Live MLX smoke could not complete this session: llama-swap on :11434 is
wedged after the pinned default Qwen3.5-27B-4bit got SIGKILL'd. The
vllm-mlx log shows 30+ minutes of 502s on POST /v1/chat/completions.
Any lm_eval run hangs until the service is restarted
(launchctl kickstart -k gui/$(id -u)/dev.vllm-mlx.server). I chose not
to take that shared-system action without approval.

Once the backend is back, the 3-model sweep drafted in
~/.claude/plans/i-never-know-vivid-sonnet.md Phase 8 is ready to run
as-is — pointing at Qwen3.5-9B-MLX-4bit, gemma-4-e4b-it-4bit, and
DeepSeek-R1-0528-Qwen3-8B-4bit for gsm8k_cot_zeroshot --limit 3.

Migration notes

  • scripts/publish_run.py keeps its old CLI shape via shim; no
    breaking change for existing runbooks.
  • uv sync picks up the new required deps (jsonschema, psutil,
    huggingface-hub, pyarrow) automatically. pyproject.toml no
    longer declares [tool.uv] package = false.
  • Schema additions are all optional → existing consumers unaffected.

Release-please

Merge will bump to 0.3.0 (minor; feat + breaking-package-layout is
arguably major, but the public contract holds — published envelopes
remain valid, old CLI invocation still works via shim).

- Migrate scripts/publish_run.py into src/mlx_benchmarks/ package with
  typed envelope, runtime system detection, converter protocol, strict
  jsonschema validation before every publish, and structured logging.
- Add mlx-bench-publish CLI exposed via project.scripts. Keep the original
  scripts/publish_run.py as a thin back-compat shim.
- Extend schema.json with \$id, optional reproducibility fields
  (python_version, mlx_version, mlx_lm_version, lm_eval_version, kernel,
  seed, gen_kwargs, model_revision, quantization, duration_seconds) and
  ship canonical valid + invalid envelope examples.
- Introduce ruff + mypy (strict) + pytest + pre-commit quality gates.
  First run is green across 18 pytest cases, 0 ruff findings, 0 mypy errors.
- Replace hardcoded SYSTEM dict with detect_system() so published envelopes
  reflect whoever ran the benchmark, not one specific laptop.
- Fix incidental ruff findings in harness/framework-eval/* and
  configs/lm-eval/qwen3-tasks/utils.py without silencing them.
- Add scripts/validate_schema.py runnable locally (replaces heredoc
  embedded in validate-schema.yml).
- Add space/README.md with proper HF Spaces front-matter (sdk: gradio,
  app_file: app.py, pinned, tags, linked dataset) plus Installation and
  Usage sections satisfying the readme validator.
- Add space/tests/test_charts.py covering bar/trend/table builders against
  a small in-memory DataFrame, plus an empty-data path. Guarantees the
  viewer cannot ship with a trivially-broken chart.
- Pair with the upcoming deploy-space.yml workflow that auto-syncs this
  directory to huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer
  on every main push.
…e, release-please

- test.yml: 3.11 + 3.12 matrix running ruff + ruff-format + mypy strict +
  pytest (package + viewer). Concurrency-cancelled per ref.
- dry-run-publish.yml: round-trip the canonical lm-eval fixture through
  mlx-bench-publish on every PR that touches the publisher. Guards the
  envelope contract at PR time, not just locally.
- validate-schema.yml rewritten to call the committed
  scripts/validate_schema.py instead of a heredoc Python snippet. Same
  behavior, reviewable locally.
- codeql.yml: Python SAST on PR + weekly schedule.
- dependency-review.yml: fail PRs introducing high-severity advisories.
- deploy-space.yml: on main pushes touching space/**, sync to HF Space
  JacobPEvans/mlx-benchmarks-viewer via scripts/deploy_space.py (no
  inline Python). Uses HF_TOKEN from the huggingface-space environment.
- release-please.yml + config + manifest: conventional-commits driven
  Python releases. feat=minor, fix=patch; manual major bumps only.
- All workflow inputs that would ordinarily carry user-controllable data
  are piped through env: blocks, not interpolated into run: strings.
…ocs + templates

- README: new hero, badges (test, validate-schema, CodeQL, HF dataset, HF
  Space, license, Python), accurate quickstart using mlx-bench-publish,
  real repo tree, differentiator paragraph ("why not just lm-eval?").
- CLAUDE.md: agent-facing summary aligned with README; zero drift.
- Add CONTRIBUTING.md (dev setup, quality-gate checklist, converter
  guide, what-not-to-add).
- Add SECURITY.md (HF token handling, \`--confirm_run_unsafe_code\`
  disclosure, third-party action pinning policy, non-vuln clarifications).
- Add CODEOWNERS, .editorconfig, .github/PULL_REQUEST_TEMPLATE.md, and
  two issue templates (bug, benchmark/converter request) with a
  contact_links config pointing to the viewer + discussions.
- Add docs/architecture.md (ASCII diagram + per-component notes) and
  docs/schema.md (prose walk-through of the authoritative contract).
- Update configs/LAYOUT.md to match what actually ships and mark
  lighteval/mlxbench/extra lm-eval suites as planned, not present.
- Update harness/framework-eval/README.md to reference the benchmark-
  request issue template instead of claiming a "follow-up PR" tracked
  nowhere.
Copilot AI review requested due to automatic review settings April 24, 2026 16:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR polishes mlx-benchmarks into a production-ready project by packaging the publisher as mlx_benchmarks, extending/validating the envelope schema, adding a tested Gradio viewer artifact, and introducing CI + repo hygiene docs/templates.

Changes:

  • Introduces a typed mlx_benchmarks package with converter(s), system detection, schema validation, CLI entrypoint, and HF publish path.
  • Extends schema.json (non-breaking optional fields) and ships canonical valid/invalid envelope examples + schema tests.
  • Adds CI workflows + pre-commit/mypy/ruff/pytest gates and promotes the dataset viewer under space/ with tests and deploy tooling.

Reviewed changes

Copilot reviewed 55 out of 58 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_system.py Adds smoke tests for runtime system detection shape/content.
tests/test_schema.py Adds Draft-07 schema self-validation + example fixture validation tests.
tests/test_publish.py Adds unit tests for publish helpers (slugify/path/Parquet/dry-run).
tests/test_lm_eval_converter.py Adds end-to-end lm-eval fixture → envelope → schema validation test.
tests/test_cli.py Adds CLI argparse/dispatch smoke tests (dry-run, tag validation, JSON errors).
tests/fixtures/lm_eval_results_sample.json Adds canonical lm-eval raw fixture used by converter/CLI tests.
tests/conftest.py Adds shared fixtures for lm-eval + valid/invalid envelope examples.
src/mlx_benchmarks/system.py Implements best-effort system metadata detection (os/chip/memory/versions).
src/mlx_benchmarks/publish.py Implements envelope→rows→Parquet and HF dataset publish with deterministic shard paths.
src/mlx_benchmarks/logging_config.py Adds console logging configuration with optional JSON-lines formatter.
src/mlx_benchmarks/envelope.py Adds typed envelope definitions + cached jsonschema validator and schema lookup.
src/mlx_benchmarks/converters/lm_eval.py Adds lm-eval results.json → envelope converter.
src/mlx_benchmarks/converters/base.py Defines converter protocol + ConverterContext dataclass.
src/mlx_benchmarks/converters/init.py Adds converter registry + get_converter() with explicit unknown-kind error.
src/mlx_benchmarks/cli.py Adds mlx-bench-publish CLI for conversion + validation + publish/dry-run.
src/mlx_benchmarks/init.py Exposes public API symbols and package version.
space/tests/test_charts.py Adds Plotly chart-builder tests and empty-data behavior tests.
space/requirements.txt Declares Space runtime dependencies for viewer deployment.
space/app.py Viewer polish/formatting adjustments and small refactors.
space/README.md Adds HF Spaces front-matter + viewer usage/deploy documentation.
scripts/validate_schema.py Adds standalone schema + TOML validation script for CI/local use.
scripts/publish_run.py Replaces legacy script with back-compat shim to the new CLI.
scripts/deploy_space.py Adds script to atomically sync space/ contents to HF Space.
schema.json Adds $id, optional reproducibility fields, and embedded examples.
pyproject.toml Converts repo into an installable package, adds scripts/extras, and configures ruff/mypy/pytest/coverage.
harness/framework-eval/eval_smolagents.py Minor hardening: switch to Path(...).open() for fixture reads.
harness/framework-eval/eval_qwen_agent.py Minor hardening/typing: Path(...).open(), ClassVar, kwargs naming, formatting.
harness/framework-eval/eval_openai_tool_calling.py Minor hardening: switch to Path(...).open() for fixture reads.
harness/framework-eval/eval_google_adk.py Minor hardening: switch to Path(...).open() for fixture reads.
harness/framework-eval/README.md Updates status text and links to follow-up benchmark request template.
examples/envelope.valid.json Adds canonical valid envelope example fixture.
examples/envelope.invalid.json Adds canonical invalid envelope example fixture.
docs/schema.md Adds prose schema walkthrough and versioning guidance.
docs/journal/2026-04-19-qwen36-benchmark-session.md Adds archived benchmark session notes.
docs/journal/2026-04-10-bifrost-benchmark-session.md Adds archived benchmark session notes.
docs/architecture.md Adds architecture diagram and component/CI overview.
configs/lm-eval/qwen3-tasks/utils.py Improves task utils exports and validation (e.g., strict zip, parity checks).
configs/LAYOUT.md Updates current vs planned config layout and links to request template.
SECURITY.md Adds security policy, token handling guidance, and unsafe-code warnings.
README.md Major rewrite: badges, quickstart, layout, schema/publisher usage, viewer instructions.
CONTRIBUTING.md Adds developer workflow, quality gates, and contribution conventions.
CODEOWNERS Adds repo-wide code owner.
CLAUDE.md Updates agent-facing conventions and common tasks.
.release-please-manifest.json Adds release-please manifest with current version.
.release-please-config.json Adds release-please configuration for Python releases/changelog sections.
.pre-commit-config.yaml Adds pre-commit hooks for formatting/lint/type/actionlint.
.github/workflows/validate-schema.yml Reworks schema validation workflow to call committed validator script.
.github/workflows/test.yml Adds CI workflow for ruff/mypy/pytest across Python 3.11/3.12.
.github/workflows/release-please.yml Adds automated release PR workflow via release-please.
.github/workflows/dry-run-publish.yml Adds CI workflow to dry-run publisher on canonical fixture.
.github/workflows/deploy-space.yml Adds workflow to deploy space/ to HF Space using scripts/deploy_space.py.
.github/workflows/dependency-review.yml Adds dependency review workflow gate.
.github/workflows/codeql.yml Adds CodeQL workflow for Python security-and-quality queries.
.github/PULL_REQUEST_TEMPLATE.md Adds PR template with required local test checklist.
.github/ISSUE_TEMPLATE/config.yml Adds issue template config and contact links.
.github/ISSUE_TEMPLATE/bug.yml Adds structured bug report template.
.github/ISSUE_TEMPLATE/benchmark-request.yml Adds structured benchmark/converter request template.
.editorconfig Adds repository-wide editor defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CONTRIBUTING.md Outdated
Comment thread src/mlx_benchmarks/system.py Outdated
Comment thread CLAUDE.md Outdated
Comment thread space/tests/test_charts.py Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive project structure for the mlx-benchmarks harness, including configuration files, issue templates, a pre-commit configuration, and documentation. The changes establish a standardized envelope schema for benchmark results and a publisher CLI to upload these results to a HuggingFace dataset. I have identified a few issues: the pre-commit configuration uses invalid revision tags for hooks, there is a typo in the repository URL within the contribution guide, and the security policy references a non-existent configuration file.

Comment thread .pre-commit-config.yaml
Comment thread .pre-commit-config.yaml
Comment thread CONTRIBUTING.md Outdated
Comment thread SECURITY.md
- test.yml + dry-run-publish.yml: pin astral-sh/setup-uv@v8.1.0
  (v8 major tag does not exist; CI was failing to resolve).
- Drop .github/workflows/codeql.yml — the repo has GitHub's default
  CodeQL setup enabled, which conflicts with advanced configurations
  ("CodeQL analyses from advanced configurations cannot be processed
  when the default setup is enabled"). Default setup already covers
  Python + actions scanning.
- CONTRIBUTING.md: fix typo meh-benchmarks.git -> mlx-benchmarks.git
  and simplify to a single-line clone command. (reported by Copilot,
  Gemini)
- CLAUDE.md: scope the "no uv run / uvx" convention to the main
  publisher workflow; acknowledge that harness/framework-eval uses
  \`uv run --with ...\` because each script carries PEP 723 inline
  metadata. (reported by Copilot)
- src/mlx_benchmarks/system.py: docstring now accurately describes
  fallback behavior (schema-required fields always populated via
  "unknown"/0 fallback; optional fields omitted when detection fails).
  (reported by Copilot)
- space/tests/test_charts.py: replace \`import app # noqa: E402\` with
  \`importlib.import_module("app")\` after sys.path insert, so ruff
  stays clean without suppressions. (reported by Copilot)
Copy link
Copy Markdown
Member Author

@JacobPEvans-personal JacobPEvans-personal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

…hema format checker

Addresses blockers and high-priority findings from the senior-review pass
that the automated reviewers missed.

BLOCKERS
- pyproject.toml: schema.json was shipped via [shared-data] which lands it
  outside the package dir. Installed wheel -> FileNotFoundError on every
  publish. Switched to [force-include] so it sits at
  \`site-packages/mlx_benchmarks/schema.json\`, where importlib.resources
  resolves it. Verified by building the wheel + pip-installing into a clean
  venv + running mlx-bench-publish --dry-run.
- scripts/deploy_space.py: SKIP_PARTS now also excludes \`tests\` so the HF
  Space sync does not accidentally ship pytest files into the Space repo.

HIGH
- envelope.py: Draft7Validator was constructed without a format_checker,
  meaning \`format: date-time\` was purely decorative — any string passed
  validation. Now uses Draft7Validator.FORMAT_CHECKER. \`jsonschema\` dep
  upgraded to \`jsonschema[format]>=4.23.0\` so rfc3339-validator is
  available. New test: bad timestamp on an otherwise-valid envelope fails
  with error path \`[timestamp]\`.
- publish.py: target_path now appends an 8-char sha256 prefix of the
  parquet payload (backwards-compatible when payload=None). Prevents
  silent overwrite when two runs in the same second produce different
  bytes. Policy is documented in-code.
- publish.py: introduce PublishError. rows_to_parquet and HF upload both
  raise it, and CLI catches PublishError instead of RuntimeError so
  HfHubHTTPError propagates through a single clean exit path (no more
  tracebacks on auth / rate-limit / network errors).
- converters/lm_eval.py: removed the \`group_subtasks\` duration fallback —
  that field maps group names to subtask name lists, not durations. Zero
  durations are now preserved (explicit \`is None\` check instead of
  \`or\`). All .get-chains are null-safe so an lm-eval output with an
  explicit \`null\` config.gen_kwargs no longer AttributeErrors.
- converters/lm_eval.py: _extract_timestamp validates ISO-8601 via regex
  before passing strings through; malformed timestamps log a warning and
  fall back to UTC now.
- .release-please-config.json + extra-files: add x-release-please-version
  anchor comments in src/mlx_benchmarks/__init__.py AND pyproject.toml so
  release-please actually bumps the version instead of skipping.
- docs/architecture.md, README.md, CLAUDE.md: remove references to the
  now-deleted codeql.yml workflow (the broken CodeQL badge on the README
  is replaced; the architecture doc notes CodeQL comes from the repo's
  default setup).

MEDIUM
- cli.py: --log-level now has explicit choices, typo no longer crashes
  with a traceback.
- test.yml: cancel-in-progress only on PRs so back-to-back main pushes
  keep their green signals.
- validate-schema.yml: push path filter mirrors PR filter so TOML changes
  merged to main still get validated.
- validate-schema.yml: installs \`jsonschema[format]\` so format-checker
  assertions the validator will soon rely on are available there too.
- dry-run-publish.yml: now builds the wheel and installs into a clean
  venv before running the CLI — catches packaging-layer bugs that
  editable installs hide (the fix in this very commit would not have
  been caught by the previous in-place variant).
- dry-run-publish.yml: concurrency group + cancel-in-progress added.

TESTS
- New test_format_checker_rejects_non_iso_timestamp covers the format
  checker gap.
- New test_target_path_includes_payload_hash asserts the collision-proof
  suffix.
- test_rows_to_parquet_rejects_empty and
  test_publish_skipping_validation_still_rejects_empty now expect
  PublishError (tighter than plain ValueError).
Consolidate CI to follow the same conventions as nix-darwin, nix-ai,
nix-home, terraform-proxmox, ansible-proxmox, and ai-workflows.

Workflow consolidation:
- Replace 6 standalone workflows (test, dependency-review, dry-run-
  publish, validate-schema, plus existing release-please and deploy-
  space) with a single ci-gate.yml orchestrator that uses dorny/
  paths-filter for change detection and conditionally calls central
  reusable workflows from JacobPEvans/.github:
    * _python-security.yml — pip-audit on resolved deps
    * _osv-scan.yml         — multi-ecosystem OSV lockfile scan
    * _markdown-lint.yml    — markdownlint on .md files
    * _file-size.yml        — repo-wide size guard
- Inline jobs (python-test matrix, schema-validate, dry-run-publish)
  participate in the same merge gate via re-actors/alls-green.
- Convert release-please.yml to a thin wrapper around
  _release-please.yml@main with GH_ACTION_JACOBPEVANS_APP_ID +
  GH_APP_PRIVATE_KEY so release PRs trigger downstream pull_request
  workflows (GITHUB_TOKEN does not).
- Drop dependency-review.yml outright; OSV + pip-audit cover the
  same surface (full lockfile, multi-ecosystem) more comprehensively
  than the PR-diff-only GHSA scan.

Repo hygiene cleanup:
- Delete inherited templates (.github/ISSUE_TEMPLATE/bug.yml,
  config.yml, PULL_REQUEST_TEMPLATE.md) — they fall back to the
  JacobPEvans/.github community-health-files inheritance.
- Delete CODEOWNERS and .editorconfig — absent from every other
  surveyed repo.
- Update benchmark-request.yml to use canonical org label
  (type:feature) and add priority:* / size:* dropdowns so
  auto-label-issues.yml can extract them.

Config alignments:
- Rename .release-please-config.json -> release-please-config.json
  (no leading dot — the central reusable expects this filename).
- Extend renovate.json with local>JacobPEvans/.github:renovate-
  presets to inherit the trusted-org allow-list, automerge rules,
  and Nix/uv custom managers.
- Pre-commit check-yaml now uses --unsafe to accept lm-eval task
  files with custom !function tags (still validates YAML syntax).

Docs:
- README badges, README repo-shape diagram, docs/architecture.md
  CI table, and CLAUDE.md repo-shape block all updated to match.

(claude)
@JacobPEvans-personal JacobPEvans-personal merged commit 0876689 into main Apr 25, 2026
9 of 13 checks passed
@JacobPEvans-personal JacobPEvans-personal deleted the feat/production-polish branch April 25, 2026 03:00
This was referenced Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants