Last updated: 2026-05-26 Status: One paper replicates end-to-end with MATCH; expanding to a small benchmark suite and turning runs into reusable artifacts.
For architecture see DESIGN.md; for what is intentionally out of scope see DESIGN.md §10; for agent conventions see AGENTS.md.
From paper PDF + public CSV → a referee-style audit of the headline coefficient — automatically, transparently, and reproducibly.
The project is not a general “replicate any paper” tool. It is a focused verification layer for a narrow but real slice: applied micro, public data, one declared estimand at a time.
Shipped (works today):
- Host PDF preflight → Modal sandbox → deep agent → auditor → audit markdown
- Auditor sub-agent with
get_current_date; auto-savesreplication_audit.mdto the example folder - Textual TUI (phases, scrollable detail pane, run log, headline card) and CLI (
--no-tui) - Browser GUI (
--gui): launcher (curated packs, file/folder upload), run dashboard, SSE updates (uv sync --group gui) - Five LLM providers (Anthropic, Cloudflare GLM/Kimi, Gemini, Groq)
- Six curated example packs with data scripts and published-coefficient references
- ~80+ unit tests including mocked Modal runs
Demonstrated MATCH: Card & Krueger (1994), Dehejia & Wahba (1999) experimental NSW.
Untested in repo: Imbens-Rubin-Sacerdote, Angrist-Lavy, Autor-Dorn-Hanson, AJR.
Manual still: add paper.pdf, run agent, interpret failures.
These are not non-goals — they are active decisions that drive the roadmap.
| Bet | Why |
|---|---|
| Python-only sandbox | Half the benefit at 10× less surface area than supporting Stata/R |
| Public-data-only papers | Avoids credentialed-data infrastructure; covers most teaching/demo cases |
| Headline coefficient only | Auditor is meaningful with one estimand; multi-table audits are a different product |
| LLM does spec extraction | If the agent can’t lock a correct target_specification.json, the rest is theatre |
| Curated example packs | Generalizing without proven reliability on knowns is hype |
| TUI as the demo surface | Better artifact than a Jupyter notebook; doubles as portfolio screen-recording |
If any bet turns out wrong (e.g. spec extraction is unreliable across models), the roadmap should change before adding more papers.
Single goal: prove which packs replicate reliably and tighten the loop around failures.
| # | Item | Acceptance criteria |
|---|---|---|
| 1 | Run all 6 packs on Anthropic Sonnet 3× each | Verdicts logged; reproducible from a script |
| 2 | BENCHMARK.md (or extend test.md) |
Table: pack × provider × verdict × runtime × failure tag, regenerated by a script |
| 3 | Failure taxonomy | Categories: spec / data / code / audit-parse / timeout / cost; tagged in benchmark rows |
| 4 | One mocked end-to-end test per pack | tests/test_pack_<name>_e2e.py runs in CI, no live Modal |
| 5 | Auditor robustness on SE-only divergence | Welch vs pooled is mentioned in AUDITOR.md rubric and the audit notes — already partly there; codify |
| 6 | Visible audit path in TUI footer | Footer shows last saved path; s re-saves and updates path |
Exit: at least 4 / 6 packs reach MATCH on Anthropic without per-pack prompt edits, and BENCHMARK.md makes failures actionable. If under 4, stop adding packs and fix the loop instead.
Build only after the benchmark loop is honest.
| Item | Why now becomes possible |
|---|---|
run_manifest.json per run (model, commit, timestamps, verdict, paths) |
Benchmarks need a stable artifact format |
| Optional human-approved spec | Cheapest reliability win for unreliable packs |
| 2-paragraph “what it is / isn’t” README pitch + demo GIF | Now we have evidence to back it up |
demo_transcript.md (one full annotated run) |
Same |
| Auditor stricter on missing coefficients | Shows up only after running 6 papers |
| Headline coefficient card always renders when both JSONs exist | Polishing once we’ve seen real failure modes |
Exit: an outside economist can run Dehejia–Wahba or Card–Krueger and explain MATCH/CLOSE/MISMATCH from artifacts alone, without reading agent logs.
Each item gets revisited only after the registry phase has data.
- Replication registry (
registry/papers.yaml+ stored runs + static pass-rate page) Not built before Now/Next, because a registry without measured pass rates is marketing. - Multi-estimand audits (several coefficients per paper, one audit table)
- Author replication zip ingestion (AEA openICPSR packages →
/workspace) - Stata in the sandbox (license + image size; only if Python proves insufficient)
- Batch mode (
replicate-ai --batch registry/for nightly benchmarks) - Restricted-data documentation path (record verdict
CANNOT_RUNwith reason; no credential handling)
(Mirrors DESIGN.md §10.)
- Multi-tenant hosting, auth, billing.
- Replicating every table or figure in a paper.
- Credentialed microdata (PSID, Compustat, IPUMS, Census restricted).
- Beating the AEA Data Editor’s replication template on static package analysis.
- Generic “any economics paper” claims.
These are unresolved enough that the roadmap should bend around the answers.
- Spec extraction reliability — Across the six packs, does the agent pick the right table/column without hand-holding? If not, the “locked spec” bet weakens.
- Auditor calibration — Is rel.dev. ≤ 5% the right MATCH threshold across all six? Or do SEs need their own bucket (
SE_DRIFT)? - Cost / latency — What is the median $ cost and wall-clock time per Anthropic run? Above ~$1 or ~5 minutes erodes the “quick check” pitch.
- Cross-model parity — Does Cloudflare Kimi get within one verdict bucket of Sonnet on
card_krueger? Determines whether non-Anthropic providers are usable beyond smoke tests. - Reproducibility — Two runs with the same example pack: do they produce the same verdict and similar coefficients? If not, “verification” is fragile.
The benchmark phase is partly designed to answer these.
| Metric | Why |
|---|---|
| Pass rate (MATCH share over runs × packs × providers) | Headline trust signal |
| Headline rel. dev. (estimate vs published) | Quality of replication, not just verdict |
| Wall-clock time (preflight, agent, total) | Demo viability |
| Run cost ($ tokens) | Determines who can use it |
| Failure tag distribution | Tells us what to fix next |
| Audit-vs-handcheck agreement (sampled) | Catches auditor false MATCHes |
Candidates that fit the bets above. Add only after the current six pass reliably.
| Paper | Why interesting |
|---|---|
| LaLonde (1986) full pseudo-experiment | Extends dehejia_wahba; matching/PSID controls |
| Mincer-style wage regression on a public CPS extract | Canonical OLS; lowest-friction teaching example |
| Angrist & Krueger (1991) IV — quarter of birth | Heavier data prep; classic IV |
| Bertrand & Mullainathan (2004) | Audit / field experiment; different design family |
| Duflo (2001) school construction | RCT-style; AEA package likely available |
Each new pack adds: examples/<name>/ + paper link + target_spec_reference.json + a benchmark row.
- Always start with Now. Items 1–6 are the only “committed” work.
- Never skip ahead to Later without first writing or updating an open-question section above; if you don’t know the answer to one of the five, that’s the next thing to investigate.
- Schemas / architecture changes: update DESIGN.md and AGENTS.md in the same PR.
- New example pack: follow the layout in examples/README.md; add a benchmark row in the same PR.
| Document | Purpose |
|---|---|
| DESIGN.md | Architecture, /workspace contract, auditor rubric |
| DESIGN_TUI.md | Terminal dashboard UX spec |
| DESIGN_GUI.md | Browser GUI UX spec |
| test.md | Agent test suite (expected vs actual per pack) |
| examples/README.md | Example packs and paper links |
| replicate_ai/README.md | Setup and CLI |
| AGENTS.md | Guidance for coding agents |