Comparing Geo-Experiment Tools: A Monte Carlo Simulation Study

Companion code for Recast's study comparing four geo-experiment estimation tools — Meta GeoLift, CausalPy, Google Matched Markets, and Google CausalImpact — across realistic stress-test scenarios. The study runs 1,000 Monte Carlo iterations per scenario (32,000 total model fits) using a shared-trend DGP with AR(1) noise, weekly seasonality, and log-normal geo baselines.

The full technical report and executive summary are available on Recast's platforms.

Repository Structure

geolift-study/
├── config/
│   └── tools.yaml                     # Tool configs equalized at 95% confidence
├── src/
│   ├── R/
│   │   ├── generate_panels.R          # Parameterized DGP panel generation
│   │   ├── run_geolift.R             # GeoLift wrapper (augmented SC + conformal)
│   │   └── run_causalimpact.R        # CausalImpact wrapper (BSTS)
│   └── python/
│       ├── run_tools.py               # Pipeline orchestration (all tools)
│       ├── run_causalpy.py            # CausalPy wrapper (Bayesian SC)
│       ├── run_google_mm.py           # Google Matched Markets wrapper (TBR)
│       ├── compute_att.py             # ATT extraction utilities
│       └── data_converters.py         # Panel format converters
├── analysis/                          # Post-simulation analysis scripts
│   ├── compute_metrics.py             # Coverage, bias, CI width, power/FPR
│   ├── generate_tables.py             # Markdown result tables
│   ├── plot_forest.py                 # ATT forest plot (article figure)
│   ├── plot_ci_gallery.py             # CI gallery plot (article figure)
│   ├── smoke_test.py                  # End-to-end validation
│   ├── eval_against_golden.py         # Regression detection vs golden reference
│   └── audit_metrics.py              # Independent metrics audit
├── tests/
│   ├── test_compute_att.py            # ATT computation unit tests
│   └── test_data_converters.py        # Data format converter unit tests
├── results/
│   ├── raw/results.jsonl              # Per-iteration results (32,000 records)
│   └── aggregated/                    # Summary tables (CSV + markdown)
├── figures/                           # Generated plots (PNG)
├── environment/
│   └── install_packages.R             # R dependency bootstrapping
├── VERSIONS.md                        # Tool & dependency version manifest
├── renv.lock                          # Pinned R package versions
├── pyproject.toml                     # Python dependencies
├── requirements.txt                   # Frozen Python dependencies (exact versions)
└── Makefile                           # Pipeline automation (see below)

Script Reference

Script	Purpose
`src/R/generate_panels.R`	Parameterized DGP — generates synthetic geo-level panels with shared trend, AR(1) noise, weekly seasonality, and log-normal baselines. Outputs parquet files per scenario/effect/iteration.
`src/python/run_tools.py`	Pipeline orchestrator — loops over all panels, runs all 4 tools concurrently, writes results to JSONL with crash recovery.
`src/python/run_causalpy.py`	CausalPy wrapper — Bayesian SC with Dirichlet prior, convergence monitoring (rhat/ESS), retry logic. Extracts posterior predictive (y_hat).
`src/python/run_google_mm.py`	Google Matched Markets wrapper — TBR regression via `matched_markets` package. Deterministic (OLS).
`src/R/run_geolift.R`	GeoLift wrapper — Augmented SC (Ridge) with block conformal inference. Called as R subprocess.
`src/R/run_causalimpact.R`	CausalImpact wrapper — BSTS (local level + spike-and-slab). Called as R subprocess.
`src/python/compute_att.py`	ATT normalization — computes true ATT from DGP counterfactuals, checks coverage on level-scale CIs.
`src/python/data_converters.py`	Format converters — transforms canonical long-format panels to CausalPy (wide), Google MM (TBR), and GeoLift (location/date/Y) formats.
`analysis/compute_metrics.py`	Aggregates raw JSONL results into per-cell metrics: ATT, bias, coverage, FNR, FPR, CI width.
`analysis/generate_tables.py`	Generates markdown and CSV result tables per scenario from aggregated metrics.
`analysis/plot_forest.py`	ATT forest plot — article figure showing point estimates and CIs across tools and scenarios.
`analysis/plot_ci_gallery.py`	CI gallery — article figure showing credible/confidence interval bands for sampled iterations.
`analysis/smoke_test.py`	End-to-end validation — checks that all tools produced results, all artifacts exist, no 100% null outputs.
`analysis/eval_against_golden.py`	Compares current metrics against a golden reference for regression detection.

How the pipeline works

The simulation is automated through a Makefile — a file that defines commands you can run to reproduce each step of the study without having to know which scripts to call or in what order. You type make <target> and it runs the right scripts in the right sequence.

The core of the pipeline is src/python/run_tools.py. It's a pure Python script (no AI involved) that loops through every scenario, iteration, and tool: it loads a synthetic panel, feeds it to CausalPy, Google MM, GeoLift, and CausalImpact, collects the results, and writes them to results/raw/results.jsonl. It also checkpoints progress so a run can resume if interrupted.

Reproducing Results

Prerequisites

Python 3.12+
R 4.x with the renv package (install.packages("renv"))
GNU Make

Setup

git clone https://github.com/getrecast/geolift-study.git
cd geolift-study

# Install dependencies (Python venv + R packages via renv)
make env

make env creates a Python virtual environment, installs frozen Python dependencies, and restores pinned R packages from renv.lock.

Run the full pipeline

make all    # panels → run → metrics → tables → figures

This generates panels, runs all 32,000 model fits, computes metrics, and produces result tables and figures. The full run takes several hours depending on hardware.

Pre-computed results

Raw results and figures are committed to this repository. Panels (453 MB) are not — they are regenerated via make panels. To skip the simulation and go straight to analysis:

make metrics   # Compute metrics from existing results
make tables    # Generate summary tables
make figures   # Generate article figures

Available commands

Run make help to see all targets:

Command	What it does
`make env`	Install Python and R dependencies
`make panels`	Generate synthetic panels
`make run`	Run GeoLift, CausalPy, Google MM, CausalImpact on all panels
`make metrics`	Compute aggregated metrics from results.jsonl
`make tables`	Generate result tables (CSV + markdown)
`make figures`	Generate article figures
`make smoke`	Quick smoke test (5 iterations, validates all outputs)
`make eval`	Compare current metrics against golden reference
`make eval-capture`	Capture current metrics as golden reference
`make all`	Run the full pipeline end to end
`make clean`	Remove all generated data and results

Scenarios

Scenario	Name	Stress Test
A1	Textbook	Clean data, well-behaved donor pool (baseline)
A2	Outlier (5x)	Treated geo inflated 5x — convex hull violation
A3	Small donor pool	10 total geos (1 treated + 9 controls) — sparse pool
A4	Short pre-treatment	30 pre-treatment days instead of 90 — data scarcity

Each scenario is run under two effect conditions: null (0%, for false positive rate calibration) and effect (7.5% lift, for power and bias assessment).

Tools

Tool	Method	Implementation
Meta GeoLift	Augmented Synthetic Control (Ridge + conformal inference)	R package via subprocess
CausalPy	Bayesian Synthetic Control (Dirichlet-weighted)	Python, PyMC backend
Google Matched Markets	Time-Based Regression (TBR)	Python package
Google CausalImpact	BSTS (local level + spike-and-slab regression)	R package via subprocess

All tools are configured at 95% confidence for fair comparison. See config/tools.yaml for full configuration details.

Configuration

Tool configurations are defined in config/tools.yaml — tool versions, hyperparameters, and inference settings. Scenario definitions and effect sizes are parameterized in src/R/generate_panels.R.

Equalization Protocol

All tools are configured at 95% confidence for fair comparison. The following choices ensure that result differences reflect estimator methodology, not pipeline artifacts:

Decision	Choice	Rationale
Significance test	CI exclusion (all tools)	Two-sided: reject if CI excludes zero. GeoLift's conformal p-value is recorded but not used for FPR/FNR.
CI width metric	Level-scale (absolute units)	Avoids confound from different counterfactual denominators across tools.
Coverage	Level-scale CIs vs true ATT	Computed identically for all tools in `compute_att.py`.
ATT %	Unified denominator (true counterfactual)	All tools use the same denominator for ATT%, not their own estimated counterfactual.
Conformal inference	Block permutations (GeoLift)	Deterministic; preserves temporal dependence from AR(1) DGP.
MCMC seeding	Per-iteration seed (CausalPy, CausalImpact)	Bitwise reproducibility. Google MM is deterministic (OLS). GeoLift block conformal is deterministic.
Convergence monitoring	CausalPy only (rhat/ESS gating)	Each tool is used according to its ecosystem's best practices. CausalImpact uses bsts defaults (niter=2000) without post-hoc convergence filtering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing Geo-Experiment Tools: A Monte Carlo Simulation Study

Repository Structure

Script Reference

How the pipeline works

Reproducing Results

Prerequisites

Setup

Run the full pipeline

Pre-computed results

Available commands

Scenarios

Tools

Configuration

Equalization Protocol

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
config		config
environment		environment
figures		figures
results		results
src		src
tests		tests
.Rprofile		.Rprofile
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
VERSIONS.md		VERSIONS.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
renv.lock		renv.lock
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Comparing Geo-Experiment Tools: A Monte Carlo Simulation Study

Repository Structure

Script Reference

How the pipeline works

Reproducing Results

Prerequisites

Setup

Run the full pipeline

Pre-computed results

Available commands

Scenarios

Tools

Configuration

Equalization Protocol

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages