A Claude Code skill (coptimize) that runs a measurement-driven optimization loop on a specific function, algorithm, or pipeline — one change per iteration, kept only if it improves the metric above the observed noise floor, reproduces on a second benchmark invocation, and preserves correctness against an immutable oracle. Every experiment, including discards, is logged with a reason.
Methodology distilled from 110+ iterations across primality-testing implementations in Java and C (5.3× over Google Guava, 3.0× over FLINT).
Clone anywhere, then symlink the repo root (which contains SKILL.md) into one of the Claude Code skill locations:
# Personal (all projects):
mkdir -p ~/.claude/skills && ln -s "$(pwd)" ~/.claude/skills/coptimize
# Project-scoped (committed with the repo):
mkdir -p <project>/.claude/skills && ln -s "$(pwd)" <project>/.claude/skills/coptimizeRun from inside the cloned repo so $(pwd) points at the directory with SKILL.md.
Or, once published as a plugin marketplace:
/plugin marketplace add <owner>/<repo>
/plugin install coptimize
Ask Claude Code to optimize something specific:
optimize the hot loop in src/parser/tokenizer.go — minimize ns/op
The skill auto-triggers on optimize / speed-up / reduce-memory / benchmark requests pointed at a concrete piece of code. You can also invoke it explicitly:
/coptimize
Then:
- Claude analyses the target, drafts a correctness test and a benchmark, and presents both — listing what is checked and how — for your approval. Nothing is written until you OK it.
- It snapshots the original code as the immutable oracle, dumps the environment, and runs a null experiment to prove the harness isn't biased.
- It iterates: one change, correctness against the oracle, benchmark, keep if
improvement_pct ≥ max(2%, 2×stddev)and reproduced on a second process, else discard. Every row logged. - It stops after 5 consecutive discards, a seed-revalidation gap, a known lower bound, or your
stop.
You can redirect at any time (try X next, skip to structural changes, stop).
- A specific function or module to optimize.
- A scalar metric (ns/op, ops/sec, peak RSS, p99, binary size — one number).
- A realistic input distribution, or enough context for Claude to pick one.
Claude provides the harness, the loop, the log, and the discipline.
claude-optimizer-skill/
├── SKILL.md main workflow: setup → baseline → loop → stop
├── correctness-rules.md what a trustworthy correctness harness covers
├── benchmark-rules.md what a trustworthy benchmark measures, incl. stratified inputs
├── log-template.md starting point for autoresearch-log.md
├── reproducibility.md env dumps + platform hardening (required reading)
└── templates/
├── c.c C / C++ harness (SplitMix64, do_not_optimize, perf-stat pointer)
├── python.py Python harness (gc-disabled measurement, pyperf pointer)
├── java-jmh.java JMH harness (pinned JVM args, Mode.SampleTime for p99)
├── csharp.cs BenchmarkDotNet harness (Xoshiro256**, MemoryDiagnoser)
└── other-languages.md 5-step analysis checklist for Go, Rust, JS/TS, Swift, Zig, Ruby, ...
- One change per iteration. Bundling hides regressions.
- Correctness before timing. A correctness failure is an instant discard — no debugging inside the loop.
- Oracle is immutable. It does not mutate across keeps. The correctness target at iter 50 is the same as at iter 0.
- Noise floor, not magic number. The
≥ 2%rule is only a lower bound; the real threshold is2 × stddev. - Reproducibility gate. A win counts only if a second cold benchmark process confirms it.
- Discards are logged with reasons. The rejection log is the most valuable artifact of a session.
MIT — see LICENSE.