🔬 OpenResearch

Search. Evaluate. Evolve.

Mission-driven autonomous optimization for AlphaEvolve-style research and engineering loops.

🇺🇸 English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇪🇸 Español · 🇫🇷 Français

OpenResearch is a local autonomous optimization runtime for iterative research and engineering loops.

It lets you define an objective, constrain the editable surface, run bounded experiments, score the result, and keep a replayable state trail. In one repo, it supports both a generic mission-driven runtime and the original single-GPU train.py improvement loop.

🌐 Language Versions

Use these standalone landing pages if you prefer a localized overview:

docs/i18n/README.zh-CN.md
docs/i18n/README.ja.md
docs/i18n/README.ko.md
docs/i18n/README.es.md
docs/i18n/README.fr.md

The English README.md remains the canonical source of truth, and the deeper technical docs are still primarily in English.

✨ Highlights

Single repo, two surfaces: a general openresearch runtime plus the original OpenResearch Legacy training loop
Objective-driven iteration: define a candidate surface, evaluator, budget, and optimization goal
Search with memory: preserve round history, frontier state, artifacts, and reports under .openresearch/
Multiple task families: benchmark tuning, prompt iteration, offline eval, repo patch eval, artifact transformation, and real GPU-backed training search
Inspectable outputs: inspect saved state, replay artifacts, and generate shareable HTML reports
Release-grade validation: packaged smoke paths and a real hard gate are part of the shipped workflow

🚀 Why OpenResearch

Mission-driven runtime: encode a task as a JSON mission instead of wiring a one-off script
AlphaEvolve-style search: propose changes, run experiments, score them, and exploit the best frontier
Replayable outputs: persist state, artifacts, reports, and proposal journals under .openresearch/
Multiple adapter families: command benchmark, offline eval, prompt eval, repo patch eval, artifact transform, and real train.py search
Real release gates: smoke and hard-gate scripts validate the shipped operator paths

🆚 OpenResearch vs. AutoResearch

Dimension	AutoResearch	OpenResearch	Upgrade
Project shape	Original single-path repo centered on the `train.py` improvement loop	Dual-surface system: `OpenResearch Legacy` plus the generic `openresearch` runtime	Keeps the original path, but lifts it into a broader runtime/platform
Main user input	`program.md`, `prepare.py`, `train.py`, and direct repo edits	Mission JSON for the generic runtime, plus the legacy `train.py` path when needed	Adds a structured runtime input without removing the original workflow
Editable surface	Primarily `train.py` in one domain	Mission-defined `editable_paths`, multiple adapters, and still supports the real `train.py` loop	Moves from one fixed surface to configurable search surfaces
Execution model	Repeated short training runs in a focused single workflow	`mission run`, `mission orchestrate`, and `mission search` with bounded runtime/search control	Expands one loop into a reusable execution/search runtime
Search behavior	Improve-or-discard loop around model training	Outer search with `bootstrap`, `exploit_best`, `diversify`, and parent-refinement strategies	More explicit AlphaEvolve-style search policy instead of a single implicit loop
Search memory	Logs and results are the main operational memory	Persists `policy_state`, `population_entries`, `frontier_entries`, and `search_history`	Adds durable frontier/population state across search rounds
Proposal tracking	No comparable generic proposal journal surface	`proposal_journal.jsonl`, selected-target context, and replayable round records	Makes proposal lineage and target history inspectable
Reports and visualization	Manual log/result inspection	`mission inspect` plus HTML output from `mission visualize`	Adds shareable reporting and post-run diagnosis surfaces
Extensibility	Focused on the original training domain	Multiple adapter families: benchmark, prompt eval, offline eval, repo patch eval, artifact transform, and real `openresearch`	Turns the original loop into a more general optimization runtime
Release validation	Mostly operator-driven/manual	Packaged smoke gate plus real hard gate assertions	Adds explicit release-grade validation instead of only manual confidence

In short: OpenResearch is a platformized, AlphaEvolve-style extension of the original AutoResearch workflow, while still preserving the original single-GPU path as OpenResearch Legacy.

🧬 AlphaEvolve-Style Upgrades

Compared with the original AutoResearch loop, the current OpenResearch runtime adds several concrete search-plane behaviors:

Outer-loop search progression: runs move from bootstrap into exploit_best, and release gates explicitly reject bootstrap-only search histories
Frontier and population memory: the runtime persists frontier candidates, population entries, and search history instead of relying only on logs/results
Target-aware exploitation: refine_parent can adapt target selection from prior proposal history rather than always choosing the next fixed mutation target
Cooldown on stale or overused targets: recently non-improving or overused targets can be deprioritized so search does not get stuck repeating the same exploit
Step shrinking after repeated regressions: repeated same-target failures or unstable depth-style changes trigger more conservative follow-up proposals
Replay and inspection surfaces: proposal journals, saved search state, and HTML reports make the search process auditable after the run

This is not a claim that OpenResearch is literally AlphaEvolve. The point is that it now behaves much more like a reusable AlphaEvolve-style local search runtime than the original single-path AutoResearch loop.

🎯 What It Can Run

OpenResearch is useful when you have:

a candidate surface you can edit or generate
an evaluator that produces a numeric metric
a bounded loop where better candidates can be accepted or discarded

Current examples in this repo include:

improving a command-driven benchmark
iterating on prompts and rubric-driven evaluation
running offline evaluation harnesses
evaluating repo-local patches with replayable artifacts
transforming artifacts through a candidate generation loop
running the real OpenResearch train.py search loop on GPU

🧭 Use Cases

OpenResearch is a good fit when you want to:

improve a workflow through repeated, metric-driven trials instead of one-shot prompting
optimize prompts, configs, or small editable repo surfaces with bounded execution
run a patch-evaluate-decide loop and keep a full artifact trail
compare candidate outputs against a fixed harness or benchmark
explore AlphaEvolve-style local search on a real training workload

OpenResearch is a poor fit when:

there is no executable evaluator
success cannot be expressed as a metric or decision rule
the editable surface is so broad that bounded iteration loses meaning

⚡ 30-Second Quickstart

Requirements:

Python >=3.10
uv
NVIDIA GPU only if you want the real openresearch training search

Install dependencies:

uv sync

Run the safest end-to-end example:

uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.json

You should get:

persisted state under .openresearch/
artifacts under the mission workspace
a shareable HTML report such as reports/command-benchmark-basic.html

🧱 Two Product Surfaces

1️⃣ `openresearch` Runtime

Use this when you want a generic, mission-driven runtime with:

mission run
mission orchestrate
mission search
mission inspect
mission validate
mission visualize

Main paths:

packages/openresearch
docs/openresearch/README.md
examples/missions/
scripts/release/generic_v1_smoke.sh
scripts/release/openresearch_real_run_gate.sh

2️⃣ `OpenResearch Legacy`

Use this when you want the original single-GPU loop:

prepare data
edit train.py
run short training jobs
compare val_bpb
keep or discard changes

Main paths:

apps/prepare.py
apps/train.py
apps/program.md
packages/openresearch_legacy
docs/openresearch-legacy/README.md
examples/legacy/README.md

📝 Mission Input

The main user input to openresearch is a mission JSON file.

Minimal shape:

{
  "id": "command-benchmark-basic",
  "adapter": {
    "kind": "command_benchmark"
  },
  "workspace": {
    "repo_root": ".",
    "editable_paths": ["prompts/**", "configs/**"]
  },
  "evaluation": {
    "primary_metric": "score",
    "goal": "maximize"
  },
  "budget": {
    "max_iterations": 3
  }
}

See the shipped examples in examples/missions/:

examples/missions/command-benchmark-basic.json
examples/missions/offline-eval-harness-basic.json
examples/missions/prompt-eval-harness-basic.json
examples/missions/repo-patch-eval-harness-basic.json
examples/missions/artifact-transform-harness-basic.json
examples/missions/openresearch-openai-compatible.json
examples/missions/template-basic.json

🛠️ Common Usage Paths

🟢 Safe Mission Runtime

uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission search examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.json

🧪 Legacy Single-GPU Workflow

uv run openresearch-prepare
uv run openresearch-train

Compatibility note:

uv run python apps/prepare.py
uv run python apps/train.py
root prepare.py, train.py, attention_kernel.py, and program.md

🔥 Real OpenResearch Search

Prerequisites:

NVIDIA GPU
uv sync
uv run openresearch-prepare
exported provider credentials:
- OPENRESEARCH_LLM_BASE_URL
- OPENRESEARCH_LLM_API_KEY

Run:

uv run openresearch mission search examples/missions/openresearch-openai-compatible.json --fresh

For release-grade validation:

OPENRESEARCH_LLM_BASE_URL="$OPENAI_BASE_URL" \
OPENRESEARCH_LLM_API_KEY="$OPENAI_API_KEY" \
./scripts/release/openresearch_real_run_gate.sh examples/missions/openresearch-openai-compatible.json

📦 What You Get Back

Typical outputs include:

persisted mission state under .openresearch/<mission>.json
replayable artifacts under artifacts/
HTML reports under reports/
proposal histories such as proposal_journal.jsonl
legacy run outputs such as run.log and results.tsv

Checked-in showcase artifacts:

release_examples/alphaevolve-long-horizon/real-run-summary.json
release_examples/alphaevolve-long-horizon/results.tsv

📊 Long-Horizon Real Run

The public README now highlights one stronger showcase artifact instead of three weak short runs: a real historical AlphaEvolve-style search campaign with 87 rounds. The chart below plots best-so-far val_bpb across the full campaign, so the curve is monotonic by design while the raw per-round results remain linked below.

Scenario: real historical autoresearch / OpenResearch Legacy search campaign with 87 proposal rounds
Primary metric: val_bpb, goal minimize
Frontier result: best_val_bpb=5.82925, 6 frontier improvements, 56 successful metric-bearing experiments
Robustness signal: 8 crash rounds and 18 degraded rounds without losing the tracked frontier
Display rule: the chart shows best-so-far val_bpb, not raw per-round values; this is the optimization frontier, not a claim that every raw round improved
Artifacts: release_examples/alphaevolve-long-horizon/real-run-summary.json, release_examples/alphaevolve-long-horizon/results.tsv

🔄 How It Works

flowchart TD
    A[User goal] --> B[Mission JSON]
    B --> C[Adapter kind]
    C --> D[mission run / orchestrate / search]
    D --> E[Generate candidate]
    E --> F[Execute experiment]
    F --> G[Evaluate metric]
    G --> H{Keep or discard?}
    H -->|keep| I[Persist state and artifacts]
    H -->|discard| I
    I --> J[Inspect / Replay / Visualize]

For the real openresearch adapter, the AlphaEvolve-style behavior appears in the outer search loop:

LLM-backed roles propose edits to train.py
the runtime executes bounded training experiments
val_bpb and runtime metrics are recorded
the search policy exploits the best known frontier across rounds
accepted history is persisted for replay and inspection

🧩 Built-In Adapter Families

Adapter	Typical candidate	Primary metric examples	Typical output
`openresearch`	`train.py` patch	`val_bpb`	results table, run log, proposal journal
`command_benchmark`	command/config change	`score`, `latency_ms`	benchmark metrics, run log
`offline_eval_harness`	eval candidate	`score`, `pass_rate`	eval summary, predictions
`prompt_eval_harness`	prompt/rubric change	`score`, `pass_rate`	summary, outputs
`repo_patch_eval_harness`	repo-local patch	task-defined metric	summary, diff artifact
`artifact_transform_harness`	transformed artifact	task-defined metric	transformed files, summary
`template`	adapter scaffold candidate	mission-defined metric	starter output surface

✅ Validated Status

Current validated snapshot:

milestone notes: docs/openresearch/release-notes-milestone1-v1.md
release validation guide: docs/openresearch/release-validation.md

Latest checked validation evidence in this repo includes:

437 passing tests
packaged smoke gate passing
real hard gate passing with 3 rounds and a numeric best_val_bpb

📌 Project Status

Current milestone: milestone1 / complete version V1
Stable user surfaces: openresearch runtime and OpenResearch Legacy
Validated paths: mission runtime, inspect/visualize flow, packaged smoke gate, and real openresearch hard gate
Hard requirements for the real search path: NVIDIA GPU plus OpenAI-compatible provider credentials
Current boundaries: no old-name CLI compatibility shim, no formal citation metadata yet, and package-level standardized license metadata is still pending

🛣️ Roadmap

Near-term improvements most aligned with the current codebase:

harden public packaging and repository metadata
add formal citation metadata and refine packaging/provenance metadata
expand public-facing release examples and visualization assets
continue improving generic adapter ergonomics and validation surfaces
tighten contributor workflows and release documentation

❓ FAQ

🧱 What is the difference between `openresearch` and `OpenResearch Legacy`?

openresearch is the generic mission-driven runtime. OpenResearch Legacy is the original single-GPU workflow centered on preparing data, editing train.py, and comparing val_bpb.

🖥️ Do I need a GPU?

Not for most shipped examples. Command benchmark, offline eval, prompt eval, repo patch eval, and artifact transform missions can be used without the real GPU-backed training path. A GPU is required for the real openresearch training search.

📝 What is the main user input?

For the generic runtime, the main input is a mission JSON file under examples/missions/ or your own equivalent mission spec. For the legacy path, the effective operator surface is train.py plus the legacy policy in apps/program.md.

🎯 What kinds of tasks are a good fit?

Tasks work best when you have a constrained editable surface, an executable evaluation harness, and a metric that can decide whether a candidate is better or worse than a baseline.

🧬 Is this literally AlphaEvolve?

No. OpenResearch is an AlphaEvolve-style local search runtime: it proposes changes, runs bounded experiments, evaluates results, and iterates across rounds with persisted history. It is inspired by that style of optimization loop, not a literal reproduction of any proprietary internal system.

📚 Citation

There is no formal CITATION.cff or paper release in this repository yet.

If you need to cite the project today, use a software-style reference such as:

OpenResearch contributors. OpenResearch: a local autonomous optimization runtime
for iterative research and engineering loops. Software, 2026.

If a canonical public repository URL, paper, or citation metadata is published later, this section should be updated to point to that authoritative source.

⚖️ License

This repository now ships a root LICENSE file.

Important scope note:

the LICENSE applies to original OpenResearch-authored contributions in this repository
it does not grant additional rights for third-party or upstream-derived material beyond the terms granted by their original authors
OpenResearch builds on the upstream AutoResearch project by Andrej Karpathy
as of March 19, 2026, the upstream AutoResearch repository page did not appear to publish a LICENSE file, so downstream users should review upstream rights separately before reusing upstream-derived portions

If upstream licensing terms are later published or clarified, this repository LICENSE and this section should be updated accordingly.

🙏 Acknowledgements

We want to explicitly thank the upstream open-source project that helped make OpenResearch possible:

AutoResearch (Andrej Karpathy) — End-to-end research automation

OpenResearch preserves the original single-GPU legacy loop as OpenResearch Legacy, while extending that line of work into a broader mission-driven runtime for replayable, AlphaEvolve-style local search.

🤝 Contributing

If you want to contribute:

start with docs/user-quickstart.md and docs/openresearch/README.md
keep changes scoped to one surface when possible: openresearch runtime or OpenResearch Legacy
update stable docs when you change the public operator path

Minimum validation before opening a patch:

uv run pytest -q
./scripts/release/generic_v1_smoke.sh

For runtime/docs-only changes, a smaller validation loop is usually enough while iterating:

uv run pytest tests/integration/test_docs_navigation.py -q
git diff --check

🗂️ Documentation

Start here:

docs/user-quickstart.md
docs/openresearch/README.md
docs/openresearch/quickstart.md
docs/openresearch/mission-runtime.md

Runtime docs:

docs/openresearch/adapter-authoring.md
docs/openresearch/replay-and-inspect.md
docs/openresearch/release-validation.md
docs/openresearch/troubleshooting.md

Legacy docs:

docs/openresearch-legacy/README.md
docs/openresearch-legacy/operator-guide.md
docs/openresearch-legacy/setup-and-data.md
docs/openresearch-legacy/troubleshooting.md

Historical archive:

docs/plans/README.md

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
apps		apps
docs		docs
examples		examples
packages		packages
release_examples		release_examples
scripts		scripts
src/openresearch_monorepo		src/openresearch_monorepo
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
attention_kernel.py		attention_kernel.py
prepare.py		prepare.py
program.md		program.md
progress.png		progress.png
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🔬 OpenResearch

🌐 Language Versions

✨ Highlights

🚀 Why OpenResearch

🆚 OpenResearch vs. AutoResearch

🧬 AlphaEvolve-Style Upgrades

🎯 What It Can Run

🧭 Use Cases

⚡ 30-Second Quickstart

🧱 Two Product Surfaces

1️⃣ openresearch Runtime

2️⃣ OpenResearch Legacy

📝 Mission Input

🛠️ Common Usage Paths

🟢 Safe Mission Runtime

🧪 Legacy Single-GPU Workflow

🔥 Real OpenResearch Search

📦 What You Get Back

📊 Long-Horizon Real Run

🔄 How It Works

🧩 Built-In Adapter Families

✅ Validated Status

📌 Project Status

🛣️ Roadmap

❓ FAQ

🧱 What is the difference between openresearch and OpenResearch Legacy?

🖥️ Do I need a GPU?

📝 What is the main user input?

🎯 What kinds of tasks are a good fit?

🧬 Is this literally AlphaEvolve?

📚 Citation

⚖️ License

🙏 Acknowledgements

🤝 Contributing

🗂️ Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1️⃣ `openresearch` Runtime

2️⃣ `OpenResearch Legacy`

🧱 What is the difference between `openresearch` and `OpenResearch Legacy`?

Packages