Autoresearch — General-Purpose Self-Improving Optimization

Based on the Karpathy autoresearch pattern. While the original optimizes LLM training code and autoresearch-diagrams optimizes diagram prompts, this version works for any task with measurable output quality.

The Pattern

LOOP FOREVER:
  1. Generate N variations using current best prompt
  2. Score each against user-defined criteria (1-10)
  3. Keep if batch score improves, discard if not
  4. Mutate the prompt targeting weakest criteria
  5. Log everything → repeat

You define what to optimize and how to score it. The system evolves the prompt itself.

Use Cases

Cold email copy — optimize subject lines, body copy, CTAs
LinkedIn outreach — connection requests, DMs, comment hooks
Sales scripts — call openers, objection handling, closing lines
Landing page copy — headlines, hero text, value propositions
Content hooks — social media post openers, video intros
Prompt engineering — any prompt you want to systematically improve
Code generation — optimize code-gen prompts for quality/correctness

Quick Start

As a Claude Code Skill (Recommended)

# Install as a skill
git clone https://github.com/jbellsolutions/autoresearch-general.git ~/.claude/skills/autoresearch-general

# Then in any Claude Code session:
/autoresearch

Tell it what to optimize:

"Optimize my cold email subject lines for SaaS founders"
"Improve my LinkedIn connection request messages"
"Optimize my sales call opening script"

As a Standalone Script

# Install dependencies
pip install anthropic python-dotenv

# Set your API key
export ANTHROPIC_API_KEY="your-key"

# Interactive setup
python3 autoresearch.py --setup

# Run a single cycle (test)
python3 autoresearch.py --once

# Run 10 cycles
python3 autoresearch.py --cycles 10

# Run continuously (leave it overnight)
python3 autoresearch.py

Setup Questions

When you run setup, you'll configure 4 things:

What are you optimizing? — The task description
Eval criteria — Up to 5 criteria, each scored 1-10
Batch size — Variations per cycle (default 5)
Starting prompt — Your initial approach to improve from

Example Run

🔬 AUTORESEARCH — Cold email to SaaS founders about AI agent deployment
   Criteria: ['persuasiveness', 'personalization', 'brevity', 'conversion_potential']
   Batch size: 5

============================================================
RUN 1 | Best score: 0.0/40
============================================================
  Generating 5 variations...
  Evaluating against 4 criteria...
  Scores by criterion:
    persuasiveness      : 6.2/10 ██████░░░░
    personalization     : 5.8/10 █████░░░░░
    brevity             : 7.1/10 ███████░░░
    conversion_potential: 5.4/10 █████░░░░░

  ✅ KEPT — 24.5 > 0.0 (+24.5)
  Mutating prompt...

============================================================
RUN 2 | Best score: 24.5/40
============================================================
  ...
  ✅ KEPT — 28.3 > 24.5 (+3.8)

============================================================
RUN 3 | Best score: 28.3/40
============================================================
  ...
  ❌ DISCARDED — 26.1 ≤ 28.3

File Structure

autoresearch-general/
  SKILL.md              # Claude Code skill definition
  autoresearch.py       # Main generate → eval → mutate loop
  README.md             # This file
  requirements.txt      # Python dependencies
  data/                 # Created at runtime
    config.json         # Task config
    prompt.txt          # Current prompt being optimized
    best_prompt.txt     # Best prompt found so far
    state.json          # Loop state
    results.jsonl       # Experiment log
    outputs/
      run_001/          # Generated outputs per run
      run_002/

How Mutation Works

After each evaluation cycle:

Identify weakest criteria — which scored lowest on average?
Find failure patterns — what do low-scoring variations have in common?
Targeted rewrite — modify the prompt to specifically address weak areas
Preserve strengths — don't lose what's working well
One major change per mutation — isolate variables to know what worked

Cost

~$0.01-0.03 per evaluation (Claude scoring text outputs)
~$0.01 per mutation (prompt rewriting)
~$0.05-0.15 per cycle (5 variations)
At continuous running: ~$3-9/hour
Overnight (8hrs): ~$24-72

The Karpathy Autoresearch Family

Repo	Optimizes	Metric	Requires
karpathy/autoresearch	LLM training code	val_bpb	NVIDIA GPU
autoresearch-diagrams	Diagram prompts	Vision score	Gemini + Claude API
autoresearch-general (this repo)	Any text/prompt	User-defined criteria	Claude API

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
infrastructure/templates		infrastructure/templates
rules		rules
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
SKILL.md		SKILL.md
autoresearch.py		autoresearch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoresearch — General-Purpose Self-Improving Optimization

The Pattern

Use Cases

Quick Start

As a Claude Code Skill (Recommended)

As a Standalone Script

Setup Questions

Example Run

File Structure

How Mutation Works

Cost

The Karpathy Autoresearch Family

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autoresearch — General-Purpose Self-Improving Optimization

The Pattern

Use Cases

Quick Start

As a Claude Code Skill (Recommended)

As a Standalone Script

Setup Questions

Example Run

File Structure

How Mutation Works

Cost

The Karpathy Autoresearch Family

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages