Search. Evaluate. Evolve.
Mission-driven autonomous optimization for AlphaEvolve-style research and engineering loops.
πΊπΈ English Β· π¨π³ δΈζ Β· π―π΅ ζ₯ζ¬θͺ Β· π°π· νκ΅μ΄ Β· πͺπΈ EspaΓ±ol Β· π«π· FranΓ§ais
OpenResearch is a local autonomous optimization runtime for iterative research and engineering loops.
It lets you define an objective, constrain the editable surface, run bounded experiments, score the result, and keep a replayable state trail. In one repo, it supports both a generic mission-driven runtime and the original single-GPU train.py improvement loop.
Use these standalone landing pages if you prefer a localized overview:
docs/i18n/README.zh-CN.mddocs/i18n/README.ja.mddocs/i18n/README.ko.mddocs/i18n/README.es.mddocs/i18n/README.fr.md
The English README.md remains the canonical source of truth, and the deeper technical docs are still primarily in English.
- Single repo, two surfaces: a general
openresearchruntime plus the originalOpenResearch Legacytraining loop - Objective-driven iteration: define a candidate surface, evaluator, budget, and optimization goal
- Search with memory: preserve round history, frontier state, artifacts, and reports under
.openresearch/ - Multiple task families: benchmark tuning, prompt iteration, offline eval, repo patch eval, artifact transformation, and real GPU-backed training search
- Inspectable outputs: inspect saved state, replay artifacts, and generate shareable HTML reports
- Release-grade validation: packaged smoke paths and a real hard gate are part of the shipped workflow
- Mission-driven runtime: encode a task as a JSON mission instead of wiring a one-off script
- AlphaEvolve-style search: propose changes, run experiments, score them, and exploit the best frontier
- Replayable outputs: persist state, artifacts, reports, and proposal journals under
.openresearch/ - Multiple adapter families: command benchmark, offline eval, prompt eval, repo patch eval, artifact transform, and real
train.pysearch - Real release gates: smoke and hard-gate scripts validate the shipped operator paths
| Dimension | AutoResearch | OpenResearch | Upgrade |
|---|---|---|---|
| Project shape | Original single-path repo centered on the train.py improvement loop |
Dual-surface system: OpenResearch Legacy plus the generic openresearch runtime |
Keeps the original path, but lifts it into a broader runtime/platform |
| Main user input | program.md, prepare.py, train.py, and direct repo edits |
Mission JSON for the generic runtime, plus the legacy train.py path when needed |
Adds a structured runtime input without removing the original workflow |
| Editable surface | Primarily train.py in one domain |
Mission-defined editable_paths, multiple adapters, and still supports the real train.py loop |
Moves from one fixed surface to configurable search surfaces |
| Execution model | Repeated short training runs in a focused single workflow | mission run, mission orchestrate, and mission search with bounded runtime/search control |
Expands one loop into a reusable execution/search runtime |
| Search behavior | Improve-or-discard loop around model training | Outer search with bootstrap, exploit_best, diversify, and parent-refinement strategies |
More explicit AlphaEvolve-style search policy instead of a single implicit loop |
| Search memory | Logs and results are the main operational memory | Persists policy_state, population_entries, frontier_entries, and search_history |
Adds durable frontier/population state across search rounds |
| Proposal tracking | No comparable generic proposal journal surface | proposal_journal.jsonl, selected-target context, and replayable round records |
Makes proposal lineage and target history inspectable |
| Reports and visualization | Manual log/result inspection | mission inspect plus HTML output from mission visualize |
Adds shareable reporting and post-run diagnosis surfaces |
| Extensibility | Focused on the original training domain | Multiple adapter families: benchmark, prompt eval, offline eval, repo patch eval, artifact transform, and real openresearch |
Turns the original loop into a more general optimization runtime |
| Release validation | Mostly operator-driven/manual | Packaged smoke gate plus real hard gate assertions | Adds explicit release-grade validation instead of only manual confidence |
In short: OpenResearch is a platformized, AlphaEvolve-style extension of the original AutoResearch workflow, while still preserving the original single-GPU path as OpenResearch Legacy.
Compared with the original AutoResearch loop, the current OpenResearch runtime adds several concrete search-plane behaviors:
- Outer-loop search progression: runs move from
bootstrapintoexploit_best, and release gates explicitly reject bootstrap-only search histories - Frontier and population memory: the runtime persists frontier candidates, population entries, and search history instead of relying only on logs/results
- Target-aware exploitation:
refine_parentcan adapt target selection from prior proposal history rather than always choosing the next fixed mutation target - Cooldown on stale or overused targets: recently non-improving or overused targets can be deprioritized so search does not get stuck repeating the same exploit
- Step shrinking after repeated regressions: repeated same-target failures or unstable depth-style changes trigger more conservative follow-up proposals
- Replay and inspection surfaces: proposal journals, saved search state, and HTML reports make the search process auditable after the run
This is not a claim that OpenResearch is literally AlphaEvolve. The point is that it now behaves much more like a reusable AlphaEvolve-style local search runtime than the original single-path AutoResearch loop.
OpenResearch is useful when you have:
- a candidate surface you can edit or generate
- an evaluator that produces a numeric metric
- a bounded loop where better candidates can be accepted or discarded
Current examples in this repo include:
- improving a command-driven benchmark
- iterating on prompts and rubric-driven evaluation
- running offline evaluation harnesses
- evaluating repo-local patches with replayable artifacts
- transforming artifacts through a candidate generation loop
- running the real OpenResearch
train.pysearch loop on GPU
OpenResearch is a good fit when you want to:
- improve a workflow through repeated, metric-driven trials instead of one-shot prompting
- optimize prompts, configs, or small editable repo surfaces with bounded execution
- run a patch-evaluate-decide loop and keep a full artifact trail
- compare candidate outputs against a fixed harness or benchmark
- explore AlphaEvolve-style local search on a real training workload
OpenResearch is a poor fit when:
- there is no executable evaluator
- success cannot be expressed as a metric or decision rule
- the editable surface is so broad that bounded iteration loses meaning
Requirements:
- Python
>=3.10 uv- NVIDIA GPU only if you want the real
openresearchtraining search
Install dependencies:
uv syncRun the safest end-to-end example:
uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.jsonYou should get:
- persisted state under
.openresearch/ - artifacts under the mission workspace
- a shareable HTML report such as
reports/command-benchmark-basic.html
Use this when you want a generic, mission-driven runtime with:
mission runmission orchestratemission searchmission inspectmission validatemission visualize
Main paths:
packages/openresearchdocs/openresearch/README.mdexamples/missions/scripts/release/generic_v1_smoke.shscripts/release/openresearch_real_run_gate.sh
Use this when you want the original single-GPU loop:
- prepare data
- edit
train.py - run short training jobs
- compare
val_bpb - keep or discard changes
Main paths:
apps/prepare.pyapps/train.pyapps/program.mdpackages/openresearch_legacydocs/openresearch-legacy/README.mdexamples/legacy/README.md
The main user input to openresearch is a mission JSON file.
Minimal shape:
{
"id": "command-benchmark-basic",
"adapter": {
"kind": "command_benchmark"
},
"workspace": {
"repo_root": ".",
"editable_paths": ["prompts/**", "configs/**"]
},
"evaluation": {
"primary_metric": "score",
"goal": "maximize"
},
"budget": {
"max_iterations": 3
}
}See the shipped examples in examples/missions/:
examples/missions/command-benchmark-basic.jsonexamples/missions/offline-eval-harness-basic.jsonexamples/missions/prompt-eval-harness-basic.jsonexamples/missions/repo-patch-eval-harness-basic.jsonexamples/missions/artifact-transform-harness-basic.jsonexamples/missions/openresearch-openai-compatible.jsonexamples/missions/template-basic.json
uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission search examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.jsonuv run openresearch-prepare
uv run openresearch-trainCompatibility note:
uv run python apps/prepare.pyuv run python apps/train.py- root
prepare.py,train.py,attention_kernel.py, andprogram.md
Prerequisites:
- NVIDIA GPU
uv syncuv run openresearch-prepare- exported provider credentials:
OPENRESEARCH_LLM_BASE_URLOPENRESEARCH_LLM_API_KEY
Run:
uv run openresearch mission search examples/missions/openresearch-openai-compatible.json --freshFor release-grade validation:
OPENRESEARCH_LLM_BASE_URL="$OPENAI_BASE_URL" \
OPENRESEARCH_LLM_API_KEY="$OPENAI_API_KEY" \
./scripts/release/openresearch_real_run_gate.sh examples/missions/openresearch-openai-compatible.jsonTypical outputs include:
- persisted mission state under
.openresearch/<mission>.json - replayable artifacts under
artifacts/ - HTML reports under
reports/ - proposal histories such as
proposal_journal.jsonl - legacy run outputs such as
run.logandresults.tsv
Checked-in showcase artifacts:
release_examples/alphaevolve-long-horizon/real-run-summary.jsonrelease_examples/alphaevolve-long-horizon/results.tsv
The public README now highlights one stronger showcase artifact instead of three weak short runs: a real historical AlphaEvolve-style search campaign with 87 rounds. The chart below plots best-so-far val_bpb across the full campaign, so the curve is monotonic by design while the raw per-round results remain linked below.
- Scenario: real historical
autoresearch/ OpenResearch Legacy search campaign with87proposal rounds - Primary metric:
val_bpb, goalminimize - Frontier result:
best_val_bpb=5.82925,6frontier improvements,56successful metric-bearing experiments - Robustness signal:
8crash rounds and18degraded rounds without losing the tracked frontier - Display rule: the chart shows
best-so-far val_bpb, not raw per-round values; this is the optimization frontier, not a claim that every raw round improved - Artifacts:
release_examples/alphaevolve-long-horizon/real-run-summary.json,release_examples/alphaevolve-long-horizon/results.tsv
flowchart TD
A[User goal] --> B[Mission JSON]
B --> C[Adapter kind]
C --> D[mission run / orchestrate / search]
D --> E[Generate candidate]
E --> F[Execute experiment]
F --> G[Evaluate metric]
G --> H{Keep or discard?}
H -->|keep| I[Persist state and artifacts]
H -->|discard| I
I --> J[Inspect / Replay / Visualize]
For the real openresearch adapter, the AlphaEvolve-style behavior appears in the outer search loop:
- LLM-backed roles propose edits to
train.py - the runtime executes bounded training experiments
val_bpband runtime metrics are recorded- the search policy exploits the best known frontier across rounds
- accepted history is persisted for replay and inspection
| Adapter | Typical candidate | Primary metric examples | Typical output |
|---|---|---|---|
openresearch |
train.py patch |
val_bpb |
results table, run log, proposal journal |
command_benchmark |
command/config change | score, latency_ms |
benchmark metrics, run log |
offline_eval_harness |
eval candidate | score, pass_rate |
eval summary, predictions |
prompt_eval_harness |
prompt/rubric change | score, pass_rate |
summary, outputs |
repo_patch_eval_harness |
repo-local patch | task-defined metric | summary, diff artifact |
artifact_transform_harness |
transformed artifact | task-defined metric | transformed files, summary |
template |
adapter scaffold candidate | mission-defined metric | starter output surface |
Current validated snapshot:
- milestone notes:
docs/openresearch/release-notes-milestone1-v1.md - release validation guide:
docs/openresearch/release-validation.md
Latest checked validation evidence in this repo includes:
437passing tests- packaged smoke gate passing
- real hard gate passing with
3rounds and a numericbest_val_bpb
- Current milestone:
milestone1/ complete versionV1 - Stable user surfaces:
openresearchruntime andOpenResearch Legacy - Validated paths: mission runtime, inspect/visualize flow, packaged smoke gate, and real
openresearchhard gate - Hard requirements for the real search path: NVIDIA GPU plus OpenAI-compatible provider credentials
- Current boundaries: no old-name CLI compatibility shim, no formal citation metadata yet, and package-level standardized license metadata is still pending
Near-term improvements most aligned with the current codebase:
- harden public packaging and repository metadata
- add formal citation metadata and refine packaging/provenance metadata
- expand public-facing release examples and visualization assets
- continue improving generic adapter ergonomics and validation surfaces
- tighten contributor workflows and release documentation
openresearch is the generic mission-driven runtime. OpenResearch Legacy is the original single-GPU workflow centered on preparing data, editing train.py, and comparing val_bpb.
Not for most shipped examples. Command benchmark, offline eval, prompt eval, repo patch eval, and artifact transform missions can be used without the real GPU-backed training path. A GPU is required for the real openresearch training search.
For the generic runtime, the main input is a mission JSON file under examples/missions/ or your own equivalent mission spec. For the legacy path, the effective operator surface is train.py plus the legacy policy in apps/program.md.
Tasks work best when you have a constrained editable surface, an executable evaluation harness, and a metric that can decide whether a candidate is better or worse than a baseline.
No. OpenResearch is an AlphaEvolve-style local search runtime: it proposes changes, runs bounded experiments, evaluates results, and iterates across rounds with persisted history. It is inspired by that style of optimization loop, not a literal reproduction of any proprietary internal system.
There is no formal CITATION.cff or paper release in this repository yet.
If you need to cite the project today, use a software-style reference such as:
OpenResearch contributors. OpenResearch: a local autonomous optimization runtime
for iterative research and engineering loops. Software, 2026.
If a canonical public repository URL, paper, or citation metadata is published later, this section should be updated to point to that authoritative source.
This repository now ships a root LICENSE file.
Important scope note:
- the
LICENSEapplies to original OpenResearch-authored contributions in this repository - it does not grant additional rights for third-party or upstream-derived material beyond the terms granted by their original authors
- OpenResearch builds on the upstream
AutoResearchproject by Andrej Karpathy - as of March 19, 2026, the upstream
AutoResearchrepository page did not appear to publish aLICENSEfile, so downstream users should review upstream rights separately before reusing upstream-derived portions
If upstream licensing terms are later published or clarified, this repository LICENSE and this section should be updated accordingly.
We want to explicitly thank the upstream open-source project that helped make OpenResearch possible:
OpenResearch preserves the original single-GPU legacy loop as OpenResearch Legacy, while extending that line of work into a broader mission-driven runtime for replayable, AlphaEvolve-style local search.
If you want to contribute:
- start with
docs/user-quickstart.mdanddocs/openresearch/README.md - keep changes scoped to one surface when possible:
openresearchruntime orOpenResearch Legacy - update stable docs when you change the public operator path
Minimum validation before opening a patch:
uv run pytest -q
./scripts/release/generic_v1_smoke.shFor runtime/docs-only changes, a smaller validation loop is usually enough while iterating:
uv run pytest tests/integration/test_docs_navigation.py -q
git diff --checkStart here:
docs/user-quickstart.mddocs/openresearch/README.mddocs/openresearch/quickstart.mddocs/openresearch/mission-runtime.md
Runtime docs:
docs/openresearch/adapter-authoring.mddocs/openresearch/replay-and-inspect.mddocs/openresearch/release-validation.mddocs/openresearch/troubleshooting.md
Legacy docs:
docs/openresearch-legacy/README.mddocs/openresearch-legacy/operator-guide.mddocs/openresearch-legacy/setup-and-data.mddocs/openresearch-legacy/troubleshooting.md
Historical archive:
docs/plans/README.md