Skip to content

fshaan/Autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Autoresearch

Self-discovery optimization engine for AI skills. Autonomously finds weaknesses, generates evaluation criteria, runs experiments, and validates improvements — no manual eval definition required.

Inspired by karpathy/autoresearch.

How It Works

Autoresearch implements a Science Loop — a 4-phase research cycle that iterates until the skill is optimized or budget is exhausted:

ANALYZE → HYPOTHESIZE → EXPERIMENT → VALIDATE → (loop or stop)
    ↑                                    │
    └──── RE-ANALYZE (every 3 exps) ─────┘
  1. Analyze — LLM reads the skill file, discovers weaknesses, and auto-generates binary eval criteria
  2. Hypothesize — ranks weaknesses by severity and failure frequency, generates a targeted fix hypothesis
  3. Experiment — applies the mutation, runs the skill against test scenarios, scores with LLM-as-judge and rule-based evals
  4. Validate — keeps the change only if score strictly improves with no per-eval regression (>20% drop triggers rollback)

Key features:

  • Self-discovery — finds what to optimize, not just how
  • Re-analysis feedback loop — discovers new weaknesses from experiment results
  • Eval calibration — automatically drops too-easy evals after baseline
  • Rule-based evals — deterministic checks (regex, contains, word count) alongside LLM-as-judge
  • Health Scan — batch-analyze a skill directory and auto-optimize the weakest

Quick Start

# Prerequisites: Bun runtime + Claude CLI
bun install

# Optimize a skill file
bun run src/cli.ts optimize path/to/skill.md

# Optimize with custom evals
bun run src/cli.ts optimize skill.md --evals evals.json --max-experiments 10

# Scan all your skills for weaknesses
bun run src/cli.ts scan --scope own

# Scan and auto-optimize the weakest skill
bun run src/cli.ts scan --scope all --auto

Output

Each optimization run generates:

  • dashboard.html — interactive score chart
  • diff.html — before/after diff with mutation rationale
  • CHANGELOG.md — experiment log
  • results.json — structured results for programmatic use

Tech Stack

  • TypeScript on Bun runtime
  • Claude CLI (claude --print) as the LLM backend
  • Zero external dependencies beyond Bun

Development

bun install            # Install dependencies
bun run build          # Build the project
bun test               # Run all tests (97 pass)
bun test <file>        # Run a single test file

Architecture

See CLAUDE.md for detailed architecture, file structure, and design decisions.

License

MIT

About

Self-discovery optimization engine for AI skills — autonomously finds weaknesses, generates evals, runs experiments, and validates improvements

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors