Skip to content

Latest commit

 

History

History
188 lines (122 loc) · 10.9 KB

File metadata and controls

188 lines (122 loc) · 10.9 KB

ReplicateAI — Roadmap

Last updated: 2026-05-26 Status: One paper replicates end-to-end with MATCH; expanding to a small benchmark suite and turning runs into reusable artifacts.

For architecture see DESIGN.md; for what is intentionally out of scope see DESIGN.md §10; for agent conventions see AGENTS.md.


One-line vision

From paper PDF + public CSV → a referee-style audit of the headline coefficient — automatically, transparently, and reproducibly.

The project is not a general “replicate any paper” tool. It is a focused verification layer for a narrow but real slice: applied micro, public data, one declared estimand at a time.


Where we are (honest gaps)

Shipped (works today):

  • Host PDF preflight → Modal sandbox → deep agent → auditor → audit markdown
  • Auditor sub-agent with get_current_date; auto-saves replication_audit.md to the example folder
  • Textual TUI (phases, scrollable detail pane, run log, headline card) and CLI (--no-tui)
  • Browser GUI (--gui): launcher (curated packs, file/folder upload), run dashboard, SSE updates (uv sync --group gui)
  • Five LLM providers (Anthropic, Cloudflare GLM/Kimi, Gemini, Groq)
  • Six curated example packs with data scripts and published-coefficient references
  • ~80+ unit tests including mocked Modal runs

Demonstrated MATCH: Card & Krueger (1994), Dehejia & Wahba (1999) experimental NSW. Untested in repo: Imbens-Rubin-Sacerdote, Angrist-Lavy, Autor-Dorn-Hanson, AJR. Manual still: add paper.pdf, run agent, interpret failures.


Bets (the explicit choices)

These are not non-goals — they are active decisions that drive the roadmap.

Bet Why
Python-only sandbox Half the benefit at 10× less surface area than supporting Stata/R
Public-data-only papers Avoids credentialed-data infrastructure; covers most teaching/demo cases
Headline coefficient only Auditor is meaningful with one estimand; multi-table audits are a different product
LLM does spec extraction If the agent can’t lock a correct target_specification.json, the rest is theatre
Curated example packs Generalizing without proven reliability on knowns is hype
TUI as the demo surface Better artifact than a Jupyter notebook; doubles as portfolio screen-recording

If any bet turns out wrong (e.g. spec extraction is unreliable across models), the roadmap should change before adding more papers.


Now — measurable next 2–3 weeks

Single goal: prove which packs replicate reliably and tighten the loop around failures.

# Item Acceptance criteria
1 Run all 6 packs on Anthropic Sonnet 3× each Verdicts logged; reproducible from a script
2 BENCHMARK.md (or extend test.md) Table: pack × provider × verdict × runtime × failure tag, regenerated by a script
3 Failure taxonomy Categories: spec / data / code / audit-parse / timeout / cost; tagged in benchmark rows
4 One mocked end-to-end test per pack tests/test_pack_<name>_e2e.py runs in CI, no live Modal
5 Auditor robustness on SE-only divergence Welch vs pooled is mentioned in AUDITOR.md rubric and the audit notes — already partly there; codify
6 Visible audit path in TUI footer Footer shows last saved path; s re-saves and updates path

Exit: at least 4 / 6 packs reach MATCH on Anthropic without per-pack prompt edits, and BENCHMARK.md makes failures actionable. If under 4, stop adding packs and fix the loop instead.


Next — once Now lands

Build only after the benchmark loop is honest.

Item Why now becomes possible
run_manifest.json per run (model, commit, timestamps, verdict, paths) Benchmarks need a stable artifact format
Optional human-approved spec Cheapest reliability win for unreliable packs
2-paragraph “what it is / isn’t” README pitch + demo GIF Now we have evidence to back it up
demo_transcript.md (one full annotated run) Same
Auditor stricter on missing coefficients Shows up only after running 6 papers
Headline coefficient card always renders when both JSONs exist Polishing once we’ve seen real failure modes

Exit: an outside economist can run Dehejia–Wahba or Card–Krueger and explain MATCH/CLOSE/MISMATCH from artifacts alone, without reading agent logs.


Later — directional, not committed

Each item gets revisited only after the registry phase has data.

  • Replication registry (registry/papers.yaml + stored runs + static pass-rate page) Not built before Now/Next, because a registry without measured pass rates is marketing.
  • Multi-estimand audits (several coefficients per paper, one audit table)
  • Author replication zip ingestion (AEA openICPSR packages → /workspace)
  • Stata in the sandbox (license + image size; only if Python proves insufficient)
  • Batch mode (replicate-ai --batch registry/ for nightly benchmarks)
  • Restricted-data documentation path (record verdict CANNOT_RUN with reason; no credential handling)

Won’t do (v1)

(Mirrors DESIGN.md §10.)

  • Multi-tenant hosting, auth, billing.
  • Replicating every table or figure in a paper.
  • Credentialed microdata (PSID, Compustat, IPUMS, Census restricted).
  • Beating the AEA Data Editor’s replication template on static package analysis.
  • Generic “any economics paper” claims.

Open questions

These are unresolved enough that the roadmap should bend around the answers.

  1. Spec extraction reliability — Across the six packs, does the agent pick the right table/column without hand-holding? If not, the “locked spec” bet weakens.
  2. Auditor calibration — Is rel.dev. ≤ 5% the right MATCH threshold across all six? Or do SEs need their own bucket (SE_DRIFT)?
  3. Cost / latency — What is the median $ cost and wall-clock time per Anthropic run? Above ~$1 or ~5 minutes erodes the “quick check” pitch.
  4. Cross-model parity — Does Cloudflare Kimi get within one verdict bucket of Sonnet on card_krueger? Determines whether non-Anthropic providers are usable beyond smoke tests.
  5. Reproducibility — Two runs with the same example pack: do they produce the same verdict and similar coefficients? If not, “verification” is fragile.

The benchmark phase is partly designed to answer these.


Metrics worth tracking (and reporting in BENCHMARK.md)

Metric Why
Pass rate (MATCH share over runs × packs × providers) Headline trust signal
Headline rel. dev. (estimate vs published) Quality of replication, not just verdict
Wall-clock time (preflight, agent, total) Demo viability
Run cost ($ tokens) Determines who can use it
Failure tag distribution Tells us what to fix next
Audit-vs-handcheck agreement (sampled) Catches auditor false MATCHes

Example pack backlog (Phase: Later)q

Candidates that fit the bets above. Add only after the current six pass reliably.

Paper Why interesting
LaLonde (1986) full pseudo-experiment Extends dehejia_wahba; matching/PSID controls
Mincer-style wage regression on a public CPS extract Canonical OLS; lowest-friction teaching example
Angrist & Krueger (1991) IV — quarter of birth Heavier data prep; classic IV
Bertrand & Mullainathan (2004) Audit / field experiment; different design family
Duflo (2001) school construction RCT-style; AEA package likely available

Each new pack adds: examples/<name>/ + paper link + target_spec_reference.json + a benchmark row.


How to pick something up

  1. Always start with Now. Items 1–6 are the only “committed” work.
  2. Never skip ahead to Later without first writing or updating an open-question section above; if you don’t know the answer to one of the five, that’s the next thing to investigate.
  3. Schemas / architecture changes: update DESIGN.md and AGENTS.md in the same PR.
  4. New example pack: follow the layout in examples/README.md; add a benchmark row in the same PR.

Related documents

Document Purpose
DESIGN.md Architecture, /workspace contract, auditor rubric
DESIGN_TUI.md Terminal dashboard UX spec
DESIGN_GUI.md Browser GUI UX spec
test.md Agent test suite (expected vs actual per pack)
examples/README.md Example packs and paper links
replicate_ai/README.md Setup and CLI
AGENTS.md Guidance for coding agents