Skip to content

SamuelZeng/OpenResearch

Repository files navigation

OpenResearch banner

πŸ”¬ OpenResearch

Search. Evaluate. Evolve.

Mission-driven autonomous optimization for AlphaEvolve-style research and engineering loops.

πŸ‡ΊπŸ‡Έ English Β· πŸ‡¨πŸ‡³ δΈ­ζ–‡ Β· πŸ‡―πŸ‡΅ ζ—₯本θͺž Β· πŸ‡°πŸ‡· ν•œκ΅­μ–΄ Β· πŸ‡ͺπŸ‡Έ EspaΓ±ol Β· πŸ‡«πŸ‡· FranΓ§ais

Python Runtime Search Status

OpenResearch is a local autonomous optimization runtime for iterative research and engineering loops.

It lets you define an objective, constrain the editable surface, run bounded experiments, score the result, and keep a replayable state trail. In one repo, it supports both a generic mission-driven runtime and the original single-GPU train.py improvement loop.

🌐 Language Versions

Use these standalone landing pages if you prefer a localized overview:

  • docs/i18n/README.zh-CN.md
  • docs/i18n/README.ja.md
  • docs/i18n/README.ko.md
  • docs/i18n/README.es.md
  • docs/i18n/README.fr.md

The English README.md remains the canonical source of truth, and the deeper technical docs are still primarily in English.

✨ Highlights

  • Single repo, two surfaces: a general openresearch runtime plus the original OpenResearch Legacy training loop
  • Objective-driven iteration: define a candidate surface, evaluator, budget, and optimization goal
  • Search with memory: preserve round history, frontier state, artifacts, and reports under .openresearch/
  • Multiple task families: benchmark tuning, prompt iteration, offline eval, repo patch eval, artifact transformation, and real GPU-backed training search
  • Inspectable outputs: inspect saved state, replay artifacts, and generate shareable HTML reports
  • Release-grade validation: packaged smoke paths and a real hard gate are part of the shipped workflow

πŸš€ Why OpenResearch

  • Mission-driven runtime: encode a task as a JSON mission instead of wiring a one-off script
  • AlphaEvolve-style search: propose changes, run experiments, score them, and exploit the best frontier
  • Replayable outputs: persist state, artifacts, reports, and proposal journals under .openresearch/
  • Multiple adapter families: command benchmark, offline eval, prompt eval, repo patch eval, artifact transform, and real train.py search
  • Real release gates: smoke and hard-gate scripts validate the shipped operator paths

πŸ†š OpenResearch vs. AutoResearch

Dimension AutoResearch OpenResearch Upgrade
Project shape Original single-path repo centered on the train.py improvement loop Dual-surface system: OpenResearch Legacy plus the generic openresearch runtime Keeps the original path, but lifts it into a broader runtime/platform
Main user input program.md, prepare.py, train.py, and direct repo edits Mission JSON for the generic runtime, plus the legacy train.py path when needed Adds a structured runtime input without removing the original workflow
Editable surface Primarily train.py in one domain Mission-defined editable_paths, multiple adapters, and still supports the real train.py loop Moves from one fixed surface to configurable search surfaces
Execution model Repeated short training runs in a focused single workflow mission run, mission orchestrate, and mission search with bounded runtime/search control Expands one loop into a reusable execution/search runtime
Search behavior Improve-or-discard loop around model training Outer search with bootstrap, exploit_best, diversify, and parent-refinement strategies More explicit AlphaEvolve-style search policy instead of a single implicit loop
Search memory Logs and results are the main operational memory Persists policy_state, population_entries, frontier_entries, and search_history Adds durable frontier/population state across search rounds
Proposal tracking No comparable generic proposal journal surface proposal_journal.jsonl, selected-target context, and replayable round records Makes proposal lineage and target history inspectable
Reports and visualization Manual log/result inspection mission inspect plus HTML output from mission visualize Adds shareable reporting and post-run diagnosis surfaces
Extensibility Focused on the original training domain Multiple adapter families: benchmark, prompt eval, offline eval, repo patch eval, artifact transform, and real openresearch Turns the original loop into a more general optimization runtime
Release validation Mostly operator-driven/manual Packaged smoke gate plus real hard gate assertions Adds explicit release-grade validation instead of only manual confidence

In short: OpenResearch is a platformized, AlphaEvolve-style extension of the original AutoResearch workflow, while still preserving the original single-GPU path as OpenResearch Legacy.

🧬 AlphaEvolve-Style Upgrades

Compared with the original AutoResearch loop, the current OpenResearch runtime adds several concrete search-plane behaviors:

  • Outer-loop search progression: runs move from bootstrap into exploit_best, and release gates explicitly reject bootstrap-only search histories
  • Frontier and population memory: the runtime persists frontier candidates, population entries, and search history instead of relying only on logs/results
  • Target-aware exploitation: refine_parent can adapt target selection from prior proposal history rather than always choosing the next fixed mutation target
  • Cooldown on stale or overused targets: recently non-improving or overused targets can be deprioritized so search does not get stuck repeating the same exploit
  • Step shrinking after repeated regressions: repeated same-target failures or unstable depth-style changes trigger more conservative follow-up proposals
  • Replay and inspection surfaces: proposal journals, saved search state, and HTML reports make the search process auditable after the run

This is not a claim that OpenResearch is literally AlphaEvolve. The point is that it now behaves much more like a reusable AlphaEvolve-style local search runtime than the original single-path AutoResearch loop.

🎯 What It Can Run

OpenResearch is useful when you have:

  • a candidate surface you can edit or generate
  • an evaluator that produces a numeric metric
  • a bounded loop where better candidates can be accepted or discarded

Current examples in this repo include:

  • improving a command-driven benchmark
  • iterating on prompts and rubric-driven evaluation
  • running offline evaluation harnesses
  • evaluating repo-local patches with replayable artifacts
  • transforming artifacts through a candidate generation loop
  • running the real OpenResearch train.py search loop on GPU

🧭 Use Cases

OpenResearch is a good fit when you want to:

  • improve a workflow through repeated, metric-driven trials instead of one-shot prompting
  • optimize prompts, configs, or small editable repo surfaces with bounded execution
  • run a patch-evaluate-decide loop and keep a full artifact trail
  • compare candidate outputs against a fixed harness or benchmark
  • explore AlphaEvolve-style local search on a real training workload

OpenResearch is a poor fit when:

  • there is no executable evaluator
  • success cannot be expressed as a metric or decision rule
  • the editable surface is so broad that bounded iteration loses meaning

⚑ 30-Second Quickstart

Requirements:

  • Python >=3.10
  • uv
  • NVIDIA GPU only if you want the real openresearch training search

Install dependencies:

uv sync

Run the safest end-to-end example:

uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.json

You should get:

  • persisted state under .openresearch/
  • artifacts under the mission workspace
  • a shareable HTML report such as reports/command-benchmark-basic.html

🧱 Two Product Surfaces

1️⃣ openresearch Runtime

Use this when you want a generic, mission-driven runtime with:

  • mission run
  • mission orchestrate
  • mission search
  • mission inspect
  • mission validate
  • mission visualize

Main paths:

  • packages/openresearch
  • docs/openresearch/README.md
  • examples/missions/
  • scripts/release/generic_v1_smoke.sh
  • scripts/release/openresearch_real_run_gate.sh

2️⃣ OpenResearch Legacy

Use this when you want the original single-GPU loop:

  • prepare data
  • edit train.py
  • run short training jobs
  • compare val_bpb
  • keep or discard changes

Main paths:

  • apps/prepare.py
  • apps/train.py
  • apps/program.md
  • packages/openresearch_legacy
  • docs/openresearch-legacy/README.md
  • examples/legacy/README.md

πŸ“ Mission Input

The main user input to openresearch is a mission JSON file.

Minimal shape:

{
  "id": "command-benchmark-basic",
  "adapter": {
    "kind": "command_benchmark"
  },
  "workspace": {
    "repo_root": ".",
    "editable_paths": ["prompts/**", "configs/**"]
  },
  "evaluation": {
    "primary_metric": "score",
    "goal": "maximize"
  },
  "budget": {
    "max_iterations": 3
  }
}

See the shipped examples in examples/missions/:

  • examples/missions/command-benchmark-basic.json
  • examples/missions/offline-eval-harness-basic.json
  • examples/missions/prompt-eval-harness-basic.json
  • examples/missions/repo-patch-eval-harness-basic.json
  • examples/missions/artifact-transform-harness-basic.json
  • examples/missions/openresearch-openai-compatible.json
  • examples/missions/template-basic.json

πŸ› οΈ Common Usage Paths

🟒 Safe Mission Runtime

uv run openresearch mission orchestrate examples/missions/command-benchmark-basic.json
uv run openresearch mission search examples/missions/command-benchmark-basic.json
uv run openresearch mission inspect examples/missions/command-benchmark-basic.json
uv run openresearch mission visualize examples/missions/command-benchmark-basic.json

πŸ§ͺ Legacy Single-GPU Workflow

uv run openresearch-prepare
uv run openresearch-train

Compatibility note:

  • uv run python apps/prepare.py
  • uv run python apps/train.py
  • root prepare.py, train.py, attention_kernel.py, and program.md

πŸ”₯ Real OpenResearch Search

Prerequisites:

  • NVIDIA GPU
  • uv sync
  • uv run openresearch-prepare
  • exported provider credentials:
    • OPENRESEARCH_LLM_BASE_URL
    • OPENRESEARCH_LLM_API_KEY

Run:

uv run openresearch mission search examples/missions/openresearch-openai-compatible.json --fresh

For release-grade validation:

OPENRESEARCH_LLM_BASE_URL="$OPENAI_BASE_URL" \
OPENRESEARCH_LLM_API_KEY="$OPENAI_API_KEY" \
./scripts/release/openresearch_real_run_gate.sh examples/missions/openresearch-openai-compatible.json

πŸ“¦ What You Get Back

Typical outputs include:

  • persisted mission state under .openresearch/<mission>.json
  • replayable artifacts under artifacts/
  • HTML reports under reports/
  • proposal histories such as proposal_journal.jsonl
  • legacy run outputs such as run.log and results.tsv

Checked-in showcase artifacts:

  • release_examples/alphaevolve-long-horizon/real-run-summary.json
  • release_examples/alphaevolve-long-horizon/results.tsv

πŸ“Š Long-Horizon Real Run

The public README now highlights one stronger showcase artifact instead of three weak short runs: a real historical AlphaEvolve-style search campaign with 87 rounds. The chart below plots best-so-far val_bpb across the full campaign, so the curve is monotonic by design while the raw per-round results remain linked below.

Long-Horizon AlphaEvolve-style Search

  • Scenario: real historical autoresearch / OpenResearch Legacy search campaign with 87 proposal rounds
  • Primary metric: val_bpb, goal minimize
  • Frontier result: best_val_bpb=5.82925, 6 frontier improvements, 56 successful metric-bearing experiments
  • Robustness signal: 8 crash rounds and 18 degraded rounds without losing the tracked frontier
  • Display rule: the chart shows best-so-far val_bpb, not raw per-round values; this is the optimization frontier, not a claim that every raw round improved
  • Artifacts: release_examples/alphaevolve-long-horizon/real-run-summary.json, release_examples/alphaevolve-long-horizon/results.tsv

πŸ”„ How It Works

flowchart TD
    A[User goal] --> B[Mission JSON]
    B --> C[Adapter kind]
    C --> D[mission run / orchestrate / search]
    D --> E[Generate candidate]
    E --> F[Execute experiment]
    F --> G[Evaluate metric]
    G --> H{Keep or discard?}
    H -->|keep| I[Persist state and artifacts]
    H -->|discard| I
    I --> J[Inspect / Replay / Visualize]
Loading

For the real openresearch adapter, the AlphaEvolve-style behavior appears in the outer search loop:

  • LLM-backed roles propose edits to train.py
  • the runtime executes bounded training experiments
  • val_bpb and runtime metrics are recorded
  • the search policy exploits the best known frontier across rounds
  • accepted history is persisted for replay and inspection

🧩 Built-In Adapter Families

Adapter Typical candidate Primary metric examples Typical output
openresearch train.py patch val_bpb results table, run log, proposal journal
command_benchmark command/config change score, latency_ms benchmark metrics, run log
offline_eval_harness eval candidate score, pass_rate eval summary, predictions
prompt_eval_harness prompt/rubric change score, pass_rate summary, outputs
repo_patch_eval_harness repo-local patch task-defined metric summary, diff artifact
artifact_transform_harness transformed artifact task-defined metric transformed files, summary
template adapter scaffold candidate mission-defined metric starter output surface

βœ… Validated Status

Current validated snapshot:

  • milestone notes: docs/openresearch/release-notes-milestone1-v1.md
  • release validation guide: docs/openresearch/release-validation.md

Latest checked validation evidence in this repo includes:

  • 437 passing tests
  • packaged smoke gate passing
  • real hard gate passing with 3 rounds and a numeric best_val_bpb

πŸ“Œ Project Status

  • Current milestone: milestone1 / complete version V1
  • Stable user surfaces: openresearch runtime and OpenResearch Legacy
  • Validated paths: mission runtime, inspect/visualize flow, packaged smoke gate, and real openresearch hard gate
  • Hard requirements for the real search path: NVIDIA GPU plus OpenAI-compatible provider credentials
  • Current boundaries: no old-name CLI compatibility shim, no formal citation metadata yet, and package-level standardized license metadata is still pending

πŸ›£οΈ Roadmap

Near-term improvements most aligned with the current codebase:

  • harden public packaging and repository metadata
  • add formal citation metadata and refine packaging/provenance metadata
  • expand public-facing release examples and visualization assets
  • continue improving generic adapter ergonomics and validation surfaces
  • tighten contributor workflows and release documentation

❓ FAQ

🧱 What is the difference between openresearch and OpenResearch Legacy?

openresearch is the generic mission-driven runtime. OpenResearch Legacy is the original single-GPU workflow centered on preparing data, editing train.py, and comparing val_bpb.

πŸ–₯️ Do I need a GPU?

Not for most shipped examples. Command benchmark, offline eval, prompt eval, repo patch eval, and artifact transform missions can be used without the real GPU-backed training path. A GPU is required for the real openresearch training search.

πŸ“ What is the main user input?

For the generic runtime, the main input is a mission JSON file under examples/missions/ or your own equivalent mission spec. For the legacy path, the effective operator surface is train.py plus the legacy policy in apps/program.md.

🎯 What kinds of tasks are a good fit?

Tasks work best when you have a constrained editable surface, an executable evaluation harness, and a metric that can decide whether a candidate is better or worse than a baseline.

🧬 Is this literally AlphaEvolve?

No. OpenResearch is an AlphaEvolve-style local search runtime: it proposes changes, runs bounded experiments, evaluates results, and iterates across rounds with persisted history. It is inspired by that style of optimization loop, not a literal reproduction of any proprietary internal system.

πŸ“š Citation

There is no formal CITATION.cff or paper release in this repository yet.

If you need to cite the project today, use a software-style reference such as:

OpenResearch contributors. OpenResearch: a local autonomous optimization runtime
for iterative research and engineering loops. Software, 2026.

If a canonical public repository URL, paper, or citation metadata is published later, this section should be updated to point to that authoritative source.

βš–οΈ License

This repository now ships a root LICENSE file.

Important scope note:

  • the LICENSE applies to original OpenResearch-authored contributions in this repository
  • it does not grant additional rights for third-party or upstream-derived material beyond the terms granted by their original authors
  • OpenResearch builds on the upstream AutoResearch project by Andrej Karpathy
  • as of March 19, 2026, the upstream AutoResearch repository page did not appear to publish a LICENSE file, so downstream users should review upstream rights separately before reusing upstream-derived portions

If upstream licensing terms are later published or clarified, this repository LICENSE and this section should be updated accordingly.

πŸ™ Acknowledgements

We want to explicitly thank the upstream open-source project that helped make OpenResearch possible:

OpenResearch preserves the original single-GPU legacy loop as OpenResearch Legacy, while extending that line of work into a broader mission-driven runtime for replayable, AlphaEvolve-style local search.

🀝 Contributing

If you want to contribute:

  • start with docs/user-quickstart.md and docs/openresearch/README.md
  • keep changes scoped to one surface when possible: openresearch runtime or OpenResearch Legacy
  • update stable docs when you change the public operator path

Minimum validation before opening a patch:

uv run pytest -q
./scripts/release/generic_v1_smoke.sh

For runtime/docs-only changes, a smaller validation loop is usually enough while iterating:

uv run pytest tests/integration/test_docs_navigation.py -q
git diff --check

πŸ—‚οΈ Documentation

Start here:

  • docs/user-quickstart.md
  • docs/openresearch/README.md
  • docs/openresearch/quickstart.md
  • docs/openresearch/mission-runtime.md

Runtime docs:

  • docs/openresearch/adapter-authoring.md
  • docs/openresearch/replay-and-inspect.md
  • docs/openresearch/release-validation.md
  • docs/openresearch/troubleshooting.md

Legacy docs:

  • docs/openresearch-legacy/README.md
  • docs/openresearch-legacy/operator-guide.md
  • docs/openresearch-legacy/setup-and-data.md
  • docs/openresearch-legacy/troubleshooting.md

Historical archive:

  • docs/plans/README.md

About

AutoResearch Plus with AlphaEvolve-style search for code, prompts, and training loops.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors