Self-discovery optimization engine for AI skills. Autonomously finds weaknesses, generates evaluation criteria, runs experiments, and validates improvements — no manual eval definition required.
Inspired by karpathy/autoresearch.
Autoresearch implements a Science Loop — a 4-phase research cycle that iterates until the skill is optimized or budget is exhausted:
ANALYZE → HYPOTHESIZE → EXPERIMENT → VALIDATE → (loop or stop)
↑ │
└──── RE-ANALYZE (every 3 exps) ─────┘
- Analyze — LLM reads the skill file, discovers weaknesses, and auto-generates binary eval criteria
- Hypothesize — ranks weaknesses by severity and failure frequency, generates a targeted fix hypothesis
- Experiment — applies the mutation, runs the skill against test scenarios, scores with LLM-as-judge and rule-based evals
- Validate — keeps the change only if score strictly improves with no per-eval regression (>20% drop triggers rollback)
Key features:
- Self-discovery — finds what to optimize, not just how
- Re-analysis feedback loop — discovers new weaknesses from experiment results
- Eval calibration — automatically drops too-easy evals after baseline
- Rule-based evals — deterministic checks (regex, contains, word count) alongside LLM-as-judge
- Health Scan — batch-analyze a skill directory and auto-optimize the weakest
# Prerequisites: Bun runtime + Claude CLI
bun install
# Optimize a skill file
bun run src/cli.ts optimize path/to/skill.md
# Optimize with custom evals
bun run src/cli.ts optimize skill.md --evals evals.json --max-experiments 10
# Scan all your skills for weaknesses
bun run src/cli.ts scan --scope own
# Scan and auto-optimize the weakest skill
bun run src/cli.ts scan --scope all --autoEach optimization run generates:
dashboard.html— interactive score chartdiff.html— before/after diff with mutation rationaleCHANGELOG.md— experiment logresults.json— structured results for programmatic use
- TypeScript on Bun runtime
- Claude CLI (
claude --print) as the LLM backend - Zero external dependencies beyond Bun
bun install # Install dependencies
bun run build # Build the project
bun test # Run all tests (97 pass)
bun test <file> # Run a single test fileSee CLAUDE.md for detailed architecture, file structure, and design decisions.
MIT