CNaught-Inc · jm-cnaught · May 20, 2026 · May 13, 2026
diff --git a/.env.example b/.env.example
@@ -0,0 +1,26 @@
+# Copy this file to .env and fill in real values. .env is gitignored.
+# All values are required only for the steps that use them — see README.md
+# for which scripts need which credentials.
+
+# --- GitHub Search API (required for github_ai_daily.py and fetch_bot_donors.py) ---
+# Fine-grained personal access token. No scopes needed for public-search reads.
+# Create at: https://github.com/settings/personal-access-tokens/new
+# Rate limit: 30 req/min with token (vs 10 req/min unauthenticated).
+GITHUB_TOKEN=
+
+# --- Google Cloud Platform (required for fetch_branch_activity.py,
+#     fetch_branch_creates.py, fetch_daily_totals.py) ---
+# Your GCP project ID. Must have BigQuery enabled and a billing account
+# attached (the GH Archive dataset is public but query costs are billed
+# to your project). See README.md for setup instructions.
+GCP_PROJECT=
+
+# Authentication is via Application Default Credentials. Run once on your
+# machine:    gcloud auth application-default login
+# Or set the standard variable to point at a service-account JSON:
+# GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/sa-key.json
+
+# --- Slack alerting (optional — pipeline runs fine without it) ---
+# Incoming-webhook URL. Pipeline posts anomaly/failure summaries here;
+# clean runs stay quiet. Leave blank to disable.
+SLACK_WEBHOOK_URL=
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,29 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - "**"
+  pull_request:
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Run tests
+        run: python -m pytest tests/ -v
diff --git a/.github/workflows/daily.yml b/.github/workflows/daily.yml
@@ -0,0 +1,55 @@
+name: Daily Pipeline
+
+on:
+  # Schedule intentionally disabled until the first manual run on main passes.
+  # Re-enable in a follow-up PR after verifying BigQuery auth, git push, and
+  # Slack webhook work end-to-end in Actions.
+  #
+  # schedule:
+  #   # 10:00 UTC daily — GitHub indexes have settled for "yesterday" by then.
+  #   - cron: "0 10 * * *"
+  workflow_dispatch:
+    inputs:
+      end_date:
+        description: "Target end date (YYYY-MM-DD). Default: yesterday UTC."
+        required: false
+        type: string
+
+permissions:
+  contents: write
+
+jobs:
+  run:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          # Need a real ref so we can push back.
+          fetch-depth: 1
+          token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Configure git for bot commits
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+
+      - name: Run daily pipeline
+        env:
+          GITHUB_TOKEN: ${{ secrets.GH_SEARCH_TOKEN }}
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
+          GITHUB_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+        run: |
+          END_DATE_ARG=""
+          if [ -n "${{ inputs.end_date }}" ]; then
+            END_DATE_ARG="--end-date ${{ inputs.end_date }}"
+          fi
+          python run_pipeline.py --mode daily $END_DATE_ARG
diff --git a/.github/workflows/weekly.yml b/.github/workflows/weekly.yml
@@ -0,0 +1,61 @@
+name: Weekly Pipeline
+
+on:
+  # Schedule intentionally disabled until the first manual run on main passes.
+  # Re-enable in a follow-up PR after verifying BigQuery auth, git push, and
+  # Slack webhook work end-to-end in Actions.
+  #
+  # schedule:
+  #   # Monday 14:00 UTC — Sunday's BigQuery daily table has landed by then.
+  #   - cron: "0 14 * * 1"
+  workflow_dispatch:
+    inputs:
+      end_date:
+        description: "Target end date (YYYY-MM-DD). Default: yesterday UTC."
+        required: false
+        type: string
+
+permissions:
+  contents: write
+
+jobs:
+  run:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 1
+          token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Auth to Google Cloud
+        env:
+          GCP_SA_KEY: ${{ secrets.GCP_SA_KEY }}
+        run: |
+          echo "$GCP_SA_KEY" > /tmp/gcp-sa-key.json
+          echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-sa-key.json" >> "$GITHUB_ENV"
+
+      - name: Configure git for bot commits
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+
+      - name: Run weekly pipeline
+        env:
+          GITHUB_TOKEN: ${{ secrets.GH_SEARCH_TOKEN }}
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
+          GITHUB_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+        run: |
+          END_DATE_ARG=""
+          if [ -n "${{ inputs.end_date }}" ]; then
+            END_DATE_ARG="--end-date ${{ inputs.end_date }}"
+          fi
+          python run_pipeline.py --mode weekly $END_DATE_ARG
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,46 @@
+# OS
+.DS_Store
+
+# Screenshots and dashboard renders (regenerable)
+*.png
+dashboard.html
+
+# PDFs in root (old location — research papers)
+/*.pdf
+
+# References folder is tracked (source material for methodology)
+!references/*.pdf
+
+# Generated Excel (regenerable from build_carbon_workbook.py)
+carbon_estimation_workbook.xlsx
+
+# GH Archive local data
+gharchive_data/
+
+# Python
+__pycache__/
+*.pyc
+
+# Environment
+.env
+
+# Git worktrees (isolated feature branches)
+.worktrees/
+
+# Claude Code local per-machine settings (may contain session-approved
+# shell snippets with embedded secrets — must never be tracked)
+.claude/settings.local.json
+
+# Superpowers brainstorming companion session dirs
+.superpowers/
+
+# Exploratory pilots (kept local, never tracked)
+pilot_*.py
+pilot_*.csv
+claude_code_model_mix_pilot.csv
+
+# Log files (regenerable from re-running fetchers)
+*.log
+
+# Intervention analysis outputs (regenerable from run_intervention_analysis.py)
+outputs/
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,153 @@
+# AGENTS.md
+
+Operating manual for AI coding agents working with this repository. Read this file before running any script or modifying any file.
+
+## Project goal
+
+Estimate market share and adoption velocity of AI coding agents (Cursor, Claude Code, OpenAI Codex, GitHub Copilot, Devin, Aider, and ~25 others) from publicly observable GitHub signals. The repo triangulates four data signals (commit attribution, branch-prefix activity, total push volume, total branch-create volume), generates a Plotly dashboard, and includes a worked intervention analysis (the October 20, 2025 Claude Code web launch).
+
+## File map
+
+**Pipeline scripts:**
+- `github_ai_daily.py` — fetches per-tool daily commit counts via GitHub Search API. Reads `GITHUB_TOKEN`. Appends to `daily_ai_commits.csv`.
+- `fetch_branch_activity.py` — BigQuery scan of `githubarchive.day.*` for PushEvents to agent-prefixed branches (`codex/`, `copilot/`, `cursor/`, etc.). Reads `GCP_PROJECT`. Writes `branch_activity_daily.csv`.
+- `fetch_branch_creates.py` — BigQuery scan for `CreateEvent` + `ref_type='branch'`. Writes `branch_creates_daily.csv` and `agent_branch_creates_daily.csv`.
+- `fetch_daily_totals.py` — BigQuery scan for total PushEvents (denominator). Writes `push_events_daily.csv`.
+- `estimate_carbon.py` — applies Jegham et al. (2025) energy/carbon framework. Writes `daily_carbon_estimates.csv`.
+- `dashboard.py` — builds the Plotly HTML dashboard from the CSVs. Writes `dashboard.html`.
+- `run_pipeline.py` — orchestrator. `--mode daily` or `--mode weekly`. Reads `GITHUB_TOKEN`, optionally `SLACK_WEBHOOK_URL`.
+- `anomaly_analysis.py` — detects signature drift in `daily_ai_commits.csv`.
+- `build_carbon_workbook.py` — Excel QA workbook.
+
+**Intervention analysis:**
+- `run_intervention_analysis.py` — orchestrator for the Oct 20 Claude Code intervention study.
+- `intervention_data.py` — shared data loaders.
+- `intervention_car.py` — event-window cumulative abnormal returns (primary estimator).
+- `intervention_breaks.py` — Bai-Perron / BEAST changepoint detection.
+- `intervention_pairs.py` — first-difference ITS on substitution pairs.
+- `intervention_bsts.py` — BSTS local-linear-trend (robustness).
+- `intervention_sdid.py` — synthetic difference-in-differences (robustness).
+- `intervention_var.py` — compositional VAR on ALR-transformed shares.
+- `intervention_sigmoids.py` — descriptive 3-param logistic fits.
+- `intervention_robustness.py` — supporting placebo + sensitivity checks.
+- `fetch_bot_donors.py` — GitHub Search fetcher for never-treated donors.
+- `vendor/synthdid/` — vendored from d2cml-ai/synthdid.py (PyPI build fails on Python 3.14).
+- `events/model_releases.csv` — dated calendar of model releases / agent launches.
+
+**Data files (all shipped with the repo):**
+- `daily_ai_commits.csv` — ~450 days × 32 tools.
+- `branch_activity_daily.csv`, `branch_creates_daily.csv`, `agent_branch_creates_daily.csv`.
+- `push_events_daily.csv`, `daily_carbon_estimates.csv`.
+- `bot_donors_daily.csv` — donor pool for SDID.
+
+**Documentation:**
+- `README.md` — human-facing setup and usage.
+- `METHODOLOGY.md` — full methodology, caveats, signal definitions.
+- `references/` — 9 PDFs of academic papers cited by the methodology.
+
+**Tests:**
+- `tests/` — pytest suite. CI runs it on every push.
+
+## Setup checklist
+
+Before running anything that talks to a network:
+
+1. **Python 3.12 environment.** `python --version` should report 3.12.x. If you're on 3.13 or 3.14, the vendored `synthdid` works but `tfcausalimpact` / `pycausalimpact` will fail — the repo uses `statsmodels.UnobservedComponents` instead and that's fine. On macOS where Python is installed from python.org, you may need to use `python3` and `pip3` instead of bare `python`/`pip`.
+2. **Install dependencies.** `pip install -r requirements.txt`.
+3. **Credentials (only if refreshing data — not needed to explore the shipped CSVs):**
+   - `cp .env.example .env`, then fill in values.
+   - `GITHUB_TOKEN` — fine-grained PAT, no scopes needed for public-search reads.
+   - `GCP_PROJECT` — your GCP project ID. Requires BigQuery enabled and a billing account. The dataset (`githubarchive.day.*`) is public; the query cost (~$5–15 for a full backfill) is billed to your project.
+   - GCP auth via either `gcloud auth application-default login` or `GOOGLE_APPLICATION_CREDENTIALS` pointing at a service-account JSON.
+   - `SLACK_WEBHOOK_URL` — optional. Leave blank to disable alerting.
+
+## First contact: what can you do?
+
+When you're pointed at this repo for the first time, your first job is to check what credentials are available. Different capabilities require different credentials; some paths work with none at all.
+
+Run this check before anything else:
+
+```bash
+python3 -c "import os; print('GITHUB_TOKEN:', bool(os.environ.get('GITHUB_TOKEN'))); print('GCP_PROJECT:', bool(os.environ.get('GCP_PROJECT')))"
+```
+
+Then route based on what's available:
+
+| Credentials | What you can do |
+|-------------|-----------------|
+| None | Explore shipped CSVs · Build dashboard · Run intervention analysis · Run test suite |
+| `GITHUB_TOKEN` only | All of the above · Refresh `daily_ai_commits.csv` via `github_ai_daily.py` · Refresh `bot_donors_daily.csv` via `fetch_bot_donors.py` |
+| `GCP_PROJECT` + GCP auth only | All no-credentials paths · Refresh BigQuery signals via the three `fetch_*` scripts |
+| Both | Full pipeline refresh via `run_pipeline.py --mode daily` or `--mode weekly` |
+
+**If the user has no credentials**, tell them what you can do and offer to start with the dashboard or the intervention analysis (both run end-to-end in under a minute from the shipped CSVs). Don't pressure them to set up credentials — the shipped data covers ~450 days and the headline findings are reproducible from it.
+
+**If the user has GitHub but not GCP**, note that the BigQuery signals (branch activity, branch creates, push totals) will fall out of date but the commit-attribution signal stays current.
+
+**If the user has both**, before running anything with `--backfill`, confirm the user wants to incur BigQuery costs (~$5–15 for a full backfill against their GCP billing account). Default windows in `run_pipeline.py` are bounded — you can always run those safely.
+
+## Reproduce-the-analysis walkthrough (autonomous mode)
+
+Use this when you are running the analysis without a human in the loop. Each step has a verification check; do not proceed past a failed check.
+
+1. **Clone and install.**
+   ```bash
+   git clone <repo-url> && cd <repo-name>
+   pip install -r requirements.txt
+   ```
+   Verify: `python -c "import plotly, statsmodels, numpy, pandas"` exits 0.
+
+2. **Run the test suite.**
+   ```bash
+   python -m pytest tests/ -v
+   ```
+   Verify: all tests pass. If they don't, stop and report — don't continue building on a broken baseline.
+
+3. **Build the dashboard from shipped CSVs.** No credentials needed.
+   ```bash
+   python dashboard.py
+   ```
+   Verify: `dashboard.html` is created. Open it in a browser to confirm the charts render. The dashboard spans the full data history (~450 days for commits, ~392 days for BigQuery signals).
+
+4. **Run the intervention analysis from shipped CSVs.** No credentials needed.
+   ```bash
+   python run_intervention_analysis.py
+   ```
+   Verify: `outputs/intervention/` directory is created with charts + JSON results. Headline finding is in the CAR section: total-market cumulative abnormal commits ≈ +60K over the 8-day window after Oct 20, z ≈ 5.6.
+
+5. **(Optional, requires credentials) Refresh the data.**
+   - Daily refresh: `python run_pipeline.py --mode daily --dry-run` (drop `--dry-run` to persist).
+   - Weekly refresh (BigQuery): `python run_pipeline.py --mode weekly --dry-run`. **Warning: BigQuery scans cost real money.** A full re-run is ~$5–15 against your GCP billing account. Default date windows in the orchestrator are bounded (7–14 days back); never invoke `--backfill` on the BigQuery fetchers without confirming the user wants to pay for it.
+
+## Walk-the-user-through-it mode
+
+Use this when a human says "walk me through this repo" or "help me reproduce the analysis." Operate one step at a time, not all at once.
+
+1. **Ask what they want.** Common asks: (a) just explore the data, (b) reproduce the dashboard, (c) reproduce the intervention analysis, (d) refresh from source. Each path has a different setup cost.
+2. **Check their environment.** Python version, whether they have `git clone`d already, whether they have a venv.
+3. **For path (a) — data exploration only:** point them at the CSV files and show a 5-line pandas snippet to load one. No installs needed beyond `pandas`.
+4. **For paths (b) and (c) — dashboard or intervention analysis:** `pip install -r requirements.txt`, then run the relevant script. Show them the output file paths. No credentials needed.
+5. **For path (d) — refresh from source:** walk them through `.env.example` → `.env`, the GitHub PAT creation flow, the GCP project + BigQuery enablement flow, and `gcloud auth application-default login`. Warn explicitly about BigQuery costs before running any backfill.
+6. **After each step:** confirm it worked (show the verification command and its expected output) before moving on. Surface anything that didn't behave as expected — the user is the source of truth on whether their environment is healthy, not you.
+
+## Guardrails
+
+- **Never commit `.env`, service-account JSON files, or any file containing a real token.** `.gitignore` already covers `.env`; double-check before any `git add -A`. Prefer `git add <specific-file>` over `git add .`.
+- **Never run unbounded BigQuery scans.** The fetchers accept `--start-date`/`--end-date`. Always bound them. A full backfill is ~$5; a typo (e.g., scanning all of 2024) could be much more.
+- **Never hardcode credentials in code.** Read from environment via `os.environ.get(...)`.
+- **Never commit `dashboard.html` or any large generated artifact.** They're gitignored for a reason — the file churns on every data refresh and bloats history.
+- **Never destructively rewrite tracked CSVs.** The fetchers are designed to append-and-skip-already-fetched-dates. If you find a corrupted row, prefer a targeted patch over a regen-from-scratch, and explain the change in the commit message.
+- **Test before claiming success.** "I ran the script" ≠ "the script worked." Run the verification command and quote the output.
+
+## Known footguns
+
+- **Signature changes.** AI tools change their commit attribution over time. `METHODOLOGY.md` documents three known events: Aider v0.85.0 (May 2025), Copilot SWE Agent rename (March 2026), Warp→Oz (March 2026). When a tool's daily count cliffs to near-zero, suspect a signature change first.
+- **GH Archive data quality.** Three known issues in the BigQuery source data: May 24, 2025 (permanent ~35% drop in push events), Sep 8, 2025 (single-day brownout test), Oct 8–14, 2025 (major outage, 99.5% drop). These dates are nulled in the shipped CSVs.
+- **GitHub Search API limits.** 30 req/min with token. The fetcher is rate-limit-aware but a full backfill takes hours.
+- **`tfcausalimpact` and `pycausalimpact` don't install on Python 3.14.** The repo uses `statsmodels.UnobservedComponents` for the BSTS-style robustness fit. Don't try to "fix" the missing dependency by switching back.
+- **`Rbeast` emits cosmetic SystemErrors on Python 3.14.** During `run_intervention_analysis.py` you may see repeated `SystemError: ...dictobject.c:4172: bad argument to internal function` messages from the `Rbeast` C extension. These are harmless — the script completes successfully and produces all artifacts. They're a known Rbeast/Python-3.14 incompatibility, not a sign of a real failure.
+- **`synthdid` on PyPI doesn't build on Python 3.14.** The repo vendors the source in `vendor/synthdid/` with a one-line pandas 3.x shim. Don't `pip install synthdid`.
+
+## When in doubt
+
+Re-read `METHODOLOGY.md`. It is the canonical reference for what each signal means, why it's structured the way it is, and what its known limitations are.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1 @@
+@AGENTS.md