ReplicateAI — Roadmap

Last updated: 2026-05-26 Status: One paper replicates end-to-end with MATCH; expanding to a small benchmark suite and turning runs into reusable artifacts.

For architecture see DESIGN.md; for what is intentionally out of scope see DESIGN.md §10; for agent conventions see AGENTS.md.

One-line vision

From paper PDF + public CSV → a referee-style audit of the headline coefficient — automatically, transparently, and reproducibly.

The project is not a general “replicate any paper” tool. It is a focused verification layer for a narrow but real slice: applied micro, public data, one declared estimand at a time.

Where we are (honest gaps)

Shipped (works today):

Host PDF preflight → Modal sandbox → deep agent → auditor → audit markdown
Auditor sub-agent with get_current_date; auto-saves replication_audit.md to the example folder
Textual TUI (phases, scrollable detail pane, run log, headline card) and CLI (--no-tui)
Browser GUI (--gui): launcher (curated packs, file/folder upload), run dashboard, SSE updates (uv sync --group gui)
Five LLM providers (Anthropic, Cloudflare GLM/Kimi, Gemini, Groq)
Six curated example packs with data scripts and published-coefficient references
~80+ unit tests including mocked Modal runs

Demonstrated MATCH: Card & Krueger (1994), Dehejia & Wahba (1999) experimental NSW. Untested in repo: Imbens-Rubin-Sacerdote, Angrist-Lavy, Autor-Dorn-Hanson, AJR. Manual still: add paper.pdf, run agent, interpret failures.

Bets (the explicit choices)

These are not non-goals — they are active decisions that drive the roadmap.

Bet	Why
Python-only sandbox	Half the benefit at 10× less surface area than supporting Stata/R
Public-data-only papers	Avoids credentialed-data infrastructure; covers most teaching/demo cases
Headline coefficient only	Auditor is meaningful with one estimand; multi-table audits are a different product
LLM does spec extraction	If the agent can’t lock a correct `target_specification.json`, the rest is theatre
Curated example packs	Generalizing without proven reliability on knowns is hype
TUI as the demo surface	Better artifact than a Jupyter notebook; doubles as portfolio screen-recording

If any bet turns out wrong (e.g. spec extraction is unreliable across models), the roadmap should change before adding more papers.

Now — measurable next 2–3 weeks

Single goal: prove which packs replicate reliably and tighten the loop around failures.

#	Item	Acceptance criteria
1	Run all 6 packs on Anthropic Sonnet 3× each	Verdicts logged; reproducible from a script
2	`BENCHMARK.md` (or extend test.md)	Table: pack × provider × verdict × runtime × failure tag, regenerated by a script
3	Failure taxonomy	Categories: spec / data / code / audit-parse / timeout / cost; tagged in benchmark rows
4	One mocked end-to-end test per pack	`tests/test_pack_<name>_e2e.py` runs in CI, no live Modal
5	Auditor robustness on SE-only divergence	Welch vs pooled is mentioned in `AUDITOR.md` rubric and the audit notes — already partly there; codify
6	Visible audit path in TUI footer	Footer shows last saved path; `s` re-saves and updates path

Exit: at least 4 / 6 packs reach MATCH on Anthropic without per-pack prompt edits, and BENCHMARK.md makes failures actionable. If under 4, stop adding packs and fix the loop instead.

Next — once Now lands

Build only after the benchmark loop is honest.

Item	Why now becomes possible
`run_manifest.json` per run (model, commit, timestamps, verdict, paths)	Benchmarks need a stable artifact format
Optional human-approved spec	Cheapest reliability win for unreliable packs
2-paragraph “what it is / isn’t” README pitch + demo GIF	Now we have evidence to back it up
`demo_transcript.md` (one full annotated run)	Same
Auditor stricter on missing coefficients	Shows up only after running 6 papers
Headline coefficient card always renders when both JSONs exist	Polishing once we’ve seen real failure modes

Exit: an outside economist can run Dehejia–Wahba or Card–Krueger and explain MATCH/CLOSE/MISMATCH from artifacts alone, without reading agent logs.

Later — directional, not committed

Each item gets revisited only after the registry phase has data.

Replication registry (registry/papers.yaml + stored runs + static pass-rate page) Not built before Now/Next, because a registry without measured pass rates is marketing.
Multi-estimand audits (several coefficients per paper, one audit table)
Author replication zip ingestion (AEA openICPSR packages → /workspace)
Stata in the sandbox (license + image size; only if Python proves insufficient)
Batch mode (replicate-ai --batch registry/ for nightly benchmarks)
Restricted-data documentation path (record verdict CANNOT_RUN with reason; no credential handling)

Won’t do (v1)

(Mirrors DESIGN.md §10.)

Multi-tenant hosting, auth, billing.
Replicating every table or figure in a paper.
Credentialed microdata (PSID, Compustat, IPUMS, Census restricted).
Beating the AEA Data Editor’s replication template on static package analysis.
Generic “any economics paper” claims.

Open questions

These are unresolved enough that the roadmap should bend around the answers.

Spec extraction reliability — Across the six packs, does the agent pick the right table/column without hand-holding? If not, the “locked spec” bet weakens.
Auditor calibration — Is rel.dev. ≤ 5% the right MATCH threshold across all six? Or do SEs need their own bucket (SE_DRIFT)?
Cost / latency — What is the median $ cost and wall-clock time per Anthropic run? Above ~$1 or ~5 minutes erodes the “quick check” pitch.
Cross-model parity — Does Cloudflare Kimi get within one verdict bucket of Sonnet on card_krueger? Determines whether non-Anthropic providers are usable beyond smoke tests.
Reproducibility — Two runs with the same example pack: do they produce the same verdict and similar coefficients? If not, “verification” is fragile.

The benchmark phase is partly designed to answer these.

Metrics worth tracking (and reporting in `BENCHMARK.md`)

Metric	Why
Pass rate (MATCH share over runs × packs × providers)	Headline trust signal
Headline rel. dev. (estimate vs published)	Quality of replication, not just verdict
Wall-clock time (preflight, agent, total)	Demo viability
Run cost ($ tokens)	Determines who can use it
Failure tag distribution	Tells us what to fix next
Audit-vs-handcheck agreement (sampled)	Catches auditor false MATCHes

Example pack backlog (Phase: Later)q

Candidates that fit the bets above. Add only after the current six pass reliably.

Paper	Why interesting
LaLonde (1986) full pseudo-experiment	Extends `dehejia_wahba`; matching/PSID controls
Mincer-style wage regression on a public CPS extract	Canonical OLS; lowest-friction teaching example
Angrist & Krueger (1991) IV — quarter of birth	Heavier data prep; classic IV
Bertrand & Mullainathan (2004)	Audit / field experiment; different design family
Duflo (2001) school construction	RCT-style; AEA package likely available

Each new pack adds: examples/<name>/ + paper link + target_spec_reference.json + a benchmark row.

How to pick something up

Always start with Now. Items 1–6 are the only “committed” work.
Never skip ahead to Later without first writing or updating an open-question section above; if you don’t know the answer to one of the five, that’s the next thing to investigate.
Schemas / architecture changes: update DESIGN.md and AGENTS.md in the same PR.
New example pack: follow the layout in examples/README.md; add a benchmark row in the same PR.

Document	Purpose
DESIGN.md	Architecture, `/workspace` contract, auditor rubric
DESIGN_TUI.md	Terminal dashboard UX spec
DESIGN_GUI.md	Browser GUI UX spec
test.md	Agent test suite (expected vs actual per pack)
examples/README.md	Example packs and paper links
replicate_ai/README.md	Setup and CLI
AGENTS.md	Guidance for coding agents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReplicateAI — Roadmap

One-line vision

Where we are (honest gaps)

Bets (the explicit choices)

Now — measurable next 2–3 weeks

Next — once Now lands

Later — directional, not committed

Won’t do (v1)

Open questions

Metrics worth tracking (and reporting in `BENCHMARK.md`)

Example pack backlog (Phase: Later)q

How to pick something up

Related documents

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

ReplicateAI — Roadmap

One-line vision

Where we are (honest gaps)

Bets (the explicit choices)

Now — measurable next 2–3 weeks

Next — once Now lands

Later — directional, not committed

Won’t do (v1)

Open questions

Metrics worth tracking (and reporting in BENCHMARK.md)

Example pack backlog (Phase: Later)q

How to pick something up

Related documents

Metrics worth tracking (and reporting in `BENCHMARK.md`)