diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..c1a56e4 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,98 @@ +# mlx-benchmarks — AI Agent Documentation + +Agent-facing notes for AI coding sessions in this repo. For the +human-facing overview, install instructions, and contribution guide see +[`README.md`](README.md), [`CONTRIBUTING.md`](CONTRIBUTING.md), and +[`docs/architecture.md`](docs/architecture.md). + +## Project overview + +Benchmark harness for MLX-quantized and locally-hosted LLMs on Apple Silicon. +Orchestration configs and the envelope v1 schema live here; results publish +to the [`JacobPEvans/mlx-benchmarks`](https://huggingface.co/datasets/JacobPEvans/mlx-benchmarks) +HF dataset, visualized at the +[`mlx-benchmarks-viewer`](https://huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer) +HF Space. + +## Repository shape (short) + +```text +src/mlx_benchmarks/ Python package (envelope, publish, converters, CLI) +scripts/ Small one-shot tools (validator, space deploy, shim) +configs/ TOML per (tool, suite) pair — see configs/LAYOUT.md +harness/framework-eval/ Inline Python suites (agent frameworks) +schema.json Envelope v1 authoritative contract +examples/ Canonical valid + invalid envelope fixtures +tests/ Pytest suite with fixtures +space/ Gradio viewer (deployed to HF Space) +docs/ architecture.md, schema.md, journal/ (session notes) +.github/workflows/ ci-gate (test + lint + scan + dry-run-publish + + schema-validate via paths-filter), release-please + (wraps JacobPEvans/.github reusable), deploy-space + (CodeQL is via GitHub's default setup, not a workflow) +``` + +## Key conventions (non-negotiable) + +- **Envelope contract**: every published result validates against + `schema.json` inside `publish()`. Do not bypass with `validate=False` in + real runs. Breaking changes require a `$id` bump + `schema_version` + increment. +- **Unique filenames**: `data/run----.parquet`. + Never overwrite. Use `target_path(envelope)` to compute. +- **System detection**: always use `detect_system()`. Never hardcode system + metadata. +- **Main workflow uses the venv directly**: for the publisher and primary + evaluation commands, use `.venv/bin/lm_eval`, `.venv/bin/mlx-bench-publish`, + etc. rather than `uv run` / `uvx`. Exception: `harness/framework-eval/` + uses `uv run --with ...` because each per-framework script declares its + own dependencies via PEP 723 inline metadata. +- **Conventional commits**: `release-please` consumes them. `feat:` minor, + `fix:` patch. Never manually edit `CHANGELOG.md`. +- **Pre-commit must pass**: `.venv/bin/pre-commit run --all-files`. CI + re-runs ruff + ruff-format + mypy + pytest. Zero tolerance for `# noqa` + suppressions — fix the underlying issue. + +## Common tasks + +```bash +# Quality gates (run all before committing) +.venv/bin/ruff check . && .venv/bin/ruff format --check . +.venv/bin/mypy src/mlx_benchmarks +.venv/bin/pytest tests space/tests +.venv/bin/python scripts/validate_schema.py + +# Publish a run (dry-run first!) +.venv/bin/mlx-bench-publish run-output/<...>/results_*.json \ + --kind lm-eval --suite reasoning --dry-run +.venv/bin/mlx-bench-publish run-output/<...>/results_*.json \ + --kind lm-eval --suite reasoning + +# Run lm-eval smoke (adjust model + task) +BASE="http://localhost:11434/v1/chat/completions" +.venv/bin/lm_eval --model local-chat-completions \ + --model_args "base_url=$BASE,model=$MODEL,max_length=32768,timeout=3600" \ + --tasks gsm8k_cot_zeroshot --batch_size 1 --num_fewshot 0 --limit 10 \ + --gen_kwargs "max_gen_toks=4096" \ + --apply_chat_template --fewshot_as_multiturn --log_samples \ + --output_path ./run-output +``` + +## Environment requirements + +- macOS on Apple Silicon (inference); CI runs publisher on ubuntu-latest. +- `vllm-mlx` (via `llama-swap`) on `http://localhost:11434/v1`. The + `base_url` for lm-eval must include the full `/v1/chat/completions` + path — not just `/v1`. +- Python 3.11+. +- `HF_TOKEN` with write scope on the dataset namespace (for publish) and + on the space namespace (for deploy, stored as a repo secret). + +## Gotchas learned the hard way + +- **Model names**: verify against the live catalog + (`curl http://localhost:11434/v1/models`). Don't trust docs. +- **Never discard completed runs**: publish with `tags.caveat=...` and file + an issue rather than throwing away benchmark time. +- **Coding benchmarks execute model-generated code** — see + [`SECURITY.md`](SECURITY.md) before running outside a sandbox. diff --git a/CLAUDE.md b/CLAUDE.md index eb73edc..5a1f43a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,98 +1,3 @@ -# CLAUDE.md +# AI Agents Configuration -Agent-facing notes for Claude Code sessions in this repo. For the -human-facing overview, install instructions, and contribution guide see -[`README.md`](README.md), [`CONTRIBUTING.md`](CONTRIBUTING.md), and -[`docs/architecture.md`](docs/architecture.md). - -## Project overview - -Benchmark harness for MLX-quantized and locally-hosted LLMs on Apple Silicon. -Orchestration configs and the envelope v1 schema live here; results publish -to the [`JacobPEvans/mlx-benchmarks`](https://huggingface.co/datasets/JacobPEvans/mlx-benchmarks) -HF dataset, visualized at the -[`mlx-benchmarks-viewer`](https://huggingface.co/spaces/JacobPEvans/mlx-benchmarks-viewer) -HF Space. - -## Repository shape (short) - -```text -src/mlx_benchmarks/ Python package (envelope, publish, converters, CLI) -scripts/ Small one-shot tools (validator, space deploy, shim) -configs/ TOML per (tool, suite) pair — see configs/LAYOUT.md -harness/framework-eval/ Inline Python suites (agent frameworks) -schema.json Envelope v1 authoritative contract -examples/ Canonical valid + invalid envelope fixtures -tests/ Pytest suite with fixtures -space/ Gradio viewer (deployed to HF Space) -docs/ architecture.md, schema.md, journal/ (session notes) -.github/workflows/ ci-gate (test + lint + scan + dry-run-publish + - schema-validate via paths-filter), release-please - (wraps JacobPEvans/.github reusable), deploy-space - (CodeQL is via GitHub's default setup, not a workflow) -``` - -## Key conventions (non-negotiable) - -- **Envelope contract**: every published result validates against - `schema.json` inside `publish()`. Do not bypass with `validate=False` in - real runs. Breaking changes require a `$id` bump + `schema_version` - increment. -- **Unique filenames**: `data/run----.parquet`. - Never overwrite. Use `target_path(envelope)` to compute. -- **System detection**: always use `detect_system()`. Never hardcode system - metadata. -- **Main workflow uses the venv directly**: for the publisher and primary - evaluation commands, use `.venv/bin/lm_eval`, `.venv/bin/mlx-bench-publish`, - etc. rather than `uv run` / `uvx`. Exception: `harness/framework-eval/` - uses `uv run --with ...` because each per-framework script declares its - own dependencies via PEP 723 inline metadata. -- **Conventional commits**: `release-please` consumes them. `feat:` minor, - `fix:` patch. Never manually edit `CHANGELOG.md`. -- **Pre-commit must pass**: `.venv/bin/pre-commit run --all-files`. CI - re-runs ruff + ruff-format + mypy + pytest. Zero tolerance for `# noqa` - suppressions — fix the underlying issue. - -## Common tasks - -```bash -# Quality gates (run all before committing) -.venv/bin/ruff check . && .venv/bin/ruff format --check . -.venv/bin/mypy src/mlx_benchmarks -.venv/bin/pytest tests space/tests -.venv/bin/python scripts/validate_schema.py - -# Publish a run (dry-run first!) -.venv/bin/mlx-bench-publish run-output/<...>/results_*.json \ - --kind lm-eval --suite reasoning --dry-run -.venv/bin/mlx-bench-publish run-output/<...>/results_*.json \ - --kind lm-eval --suite reasoning - -# Run lm-eval smoke (adjust model + task) -BASE="http://localhost:11434/v1/chat/completions" -.venv/bin/lm_eval --model local-chat-completions \ - --model_args "base_url=$BASE,model=$MODEL,max_length=32768,timeout=3600" \ - --tasks gsm8k_cot_zeroshot --batch_size 1 --num_fewshot 0 --limit 10 \ - --gen_kwargs "max_gen_toks=4096" \ - --apply_chat_template --fewshot_as_multiturn --log_samples \ - --output_path ./run-output -``` - -## Environment requirements - -- macOS on Apple Silicon (inference); CI runs publisher on ubuntu-latest. -- `vllm-mlx` (via `llama-swap`) on `http://localhost:11434/v1`. The - `base_url` for lm-eval must include the full `/v1/chat/completions` - path — not just `/v1`. -- Python 3.11+. -- `HF_TOKEN` with write scope on the dataset namespace (for publish) and - on the space namespace (for deploy, stored as a repo secret). - -## Gotchas learned the hard way - -- **Model names**: verify against the live catalog - (`curl http://localhost:11434/v1/models`). Don't trust docs. -- **Never discard completed runs**: publish with `tags.caveat=...` and file - an issue rather than throwing away benchmark time. -- **Coding benchmarks execute model-generated code** — see - [`SECURITY.md`](SECURITY.md) before running outside a sandbox. +@AGENTS.md