Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copy this file to .env and fill in real values. .env is gitignored.
# All values are required only for the steps that use them — see README.md
# for which scripts need which credentials.

# --- GitHub Search API (required for github_ai_daily.py and fetch_bot_donors.py) ---
# Fine-grained personal access token. No scopes needed for public-search reads.
# Create at: https://github.com/settings/personal-access-tokens/new
# Rate limit: 30 req/min with token (vs 10 req/min unauthenticated).
GITHUB_TOKEN=

# --- Google Cloud Platform (required for fetch_branch_activity.py,
# fetch_branch_creates.py, fetch_daily_totals.py) ---
# Your GCP project ID. Must have BigQuery enabled and a billing account
# attached (the GH Archive dataset is public but query costs are billed
# to your project). See README.md for setup instructions.
GCP_PROJECT=

# Authentication is via Application Default Credentials. Run once on your
# machine: gcloud auth application-default login
# Or set the standard variable to point at a service-account JSON:
# GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/sa-key.json

# --- Slack alerting (optional — pipeline runs fine without it) ---
# Incoming-webhook URL. Pipeline posts anomaly/failure summaries here;
# clean runs stay quiet. Leave blank to disable.
SLACK_WEBHOOK_URL=
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: CI

on:
push:
branches:
- "**"
pull_request:
workflow_dispatch:

permissions:
contents: read

jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip

- name: Install dependencies
run: pip install -r requirements.txt

- name: Run tests
run: python -m pytest tests/ -v
55 changes: 55 additions & 0 deletions .github/workflows/daily.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: Daily Pipeline

on:
# Schedule intentionally disabled until the first manual run on main passes.
# Re-enable in a follow-up PR after verifying BigQuery auth, git push, and
# Slack webhook work end-to-end in Actions.
#
# schedule:
# # 10:00 UTC daily — GitHub indexes have settled for "yesterday" by then.
# - cron: "0 10 * * *"
workflow_dispatch:
inputs:
end_date:
description: "Target end date (YYYY-MM-DD). Default: yesterday UTC."
required: false
type: string

permissions:
contents: write

jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
# Need a real ref so we can push back.
fetch-depth: 1
token: ${{ secrets.GITHUB_TOKEN }}

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip

- name: Install dependencies
run: pip install -r requirements.txt

- name: Configure git for bot commits
run: |
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"

- name: Run daily pipeline
env:
GITHUB_TOKEN: ${{ secrets.GH_SEARCH_TOKEN }}
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
GITHUB_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
END_DATE_ARG=""
if [ -n "${{ inputs.end_date }}" ]; then
END_DATE_ARG="--end-date ${{ inputs.end_date }}"
fi
python run_pipeline.py --mode daily $END_DATE_ARG
61 changes: 61 additions & 0 deletions .github/workflows/weekly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: Weekly Pipeline

on:
# Schedule intentionally disabled until the first manual run on main passes.
# Re-enable in a follow-up PR after verifying BigQuery auth, git push, and
# Slack webhook work end-to-end in Actions.
#
# schedule:
# # Monday 14:00 UTC — Sunday's BigQuery daily table has landed by then.
# - cron: "0 14 * * 1"
workflow_dispatch:
inputs:
end_date:
description: "Target end date (YYYY-MM-DD). Default: yesterday UTC."
required: false
type: string

permissions:
contents: write

jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 1
token: ${{ secrets.GITHUB_TOKEN }}

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip

- name: Install dependencies
run: pip install -r requirements.txt

- name: Auth to Google Cloud
env:
GCP_SA_KEY: ${{ secrets.GCP_SA_KEY }}
run: |
echo "$GCP_SA_KEY" > /tmp/gcp-sa-key.json
echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-sa-key.json" >> "$GITHUB_ENV"

- name: Configure git for bot commits
run: |
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"

- name: Run weekly pipeline
env:
GITHUB_TOKEN: ${{ secrets.GH_SEARCH_TOKEN }}
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
GITHUB_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
END_DATE_ARG=""
if [ -n "${{ inputs.end_date }}" ]; then
END_DATE_ARG="--end-date ${{ inputs.end_date }}"
fi
python run_pipeline.py --mode weekly $END_DATE_ARG
46 changes: 46 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# OS
.DS_Store

# Screenshots and dashboard renders (regenerable)
*.png
dashboard.html

# PDFs in root (old location — research papers)
/*.pdf

# References folder is tracked (source material for methodology)
!references/*.pdf

# Generated Excel (regenerable from build_carbon_workbook.py)
carbon_estimation_workbook.xlsx

# GH Archive local data
gharchive_data/

# Python
__pycache__/
*.pyc

# Environment
.env

# Git worktrees (isolated feature branches)
.worktrees/

# Claude Code local per-machine settings (may contain session-approved
# shell snippets with embedded secrets — must never be tracked)
.claude/settings.local.json

# Superpowers brainstorming companion session dirs
.superpowers/

# Exploratory pilots (kept local, never tracked)
pilot_*.py
pilot_*.csv
claude_code_model_mix_pilot.csv

# Log files (regenerable from re-running fetchers)
*.log

# Intervention analysis outputs (regenerable from run_intervention_analysis.py)
outputs/
153 changes: 153 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# AGENTS.md

Operating manual for AI coding agents working with this repository. Read this file before running any script or modifying any file.

## Project goal

Estimate market share and adoption velocity of AI coding agents (Cursor, Claude Code, OpenAI Codex, GitHub Copilot, Devin, Aider, and ~25 others) from publicly observable GitHub signals. The repo triangulates four data signals (commit attribution, branch-prefix activity, total push volume, total branch-create volume), generates a Plotly dashboard, and includes a worked intervention analysis (the October 20, 2025 Claude Code web launch).

## File map

**Pipeline scripts:**
- `github_ai_daily.py` — fetches per-tool daily commit counts via GitHub Search API. Reads `GITHUB_TOKEN`. Appends to `daily_ai_commits.csv`.
- `fetch_branch_activity.py` — BigQuery scan of `githubarchive.day.*` for PushEvents to agent-prefixed branches (`codex/`, `copilot/`, `cursor/`, etc.). Reads `GCP_PROJECT`. Writes `branch_activity_daily.csv`.
- `fetch_branch_creates.py` — BigQuery scan for `CreateEvent` + `ref_type='branch'`. Writes `branch_creates_daily.csv` and `agent_branch_creates_daily.csv`.
- `fetch_daily_totals.py` — BigQuery scan for total PushEvents (denominator). Writes `push_events_daily.csv`.
- `estimate_carbon.py` — applies Jegham et al. (2025) energy/carbon framework. Writes `daily_carbon_estimates.csv`.
- `dashboard.py` — builds the Plotly HTML dashboard from the CSVs. Writes `dashboard.html`.
- `run_pipeline.py` — orchestrator. `--mode daily` or `--mode weekly`. Reads `GITHUB_TOKEN`, optionally `SLACK_WEBHOOK_URL`.
- `anomaly_analysis.py` — detects signature drift in `daily_ai_commits.csv`.
- `build_carbon_workbook.py` — Excel QA workbook.

**Intervention analysis:**
- `run_intervention_analysis.py` — orchestrator for the Oct 20 Claude Code intervention study.
- `intervention_data.py` — shared data loaders.
- `intervention_car.py` — event-window cumulative abnormal returns (primary estimator).
- `intervention_breaks.py` — Bai-Perron / BEAST changepoint detection.
- `intervention_pairs.py` — first-difference ITS on substitution pairs.
- `intervention_bsts.py` — BSTS local-linear-trend (robustness).
- `intervention_sdid.py` — synthetic difference-in-differences (robustness).
- `intervention_var.py` — compositional VAR on ALR-transformed shares.
- `intervention_sigmoids.py` — descriptive 3-param logistic fits.
- `intervention_robustness.py` — supporting placebo + sensitivity checks.
- `fetch_bot_donors.py` — GitHub Search fetcher for never-treated donors.
- `vendor/synthdid/` — vendored from d2cml-ai/synthdid.py (PyPI build fails on Python 3.14).
- `events/model_releases.csv` — dated calendar of model releases / agent launches.

**Data files (all shipped with the repo):**
- `daily_ai_commits.csv` — ~450 days × 32 tools.
- `branch_activity_daily.csv`, `branch_creates_daily.csv`, `agent_branch_creates_daily.csv`.
- `push_events_daily.csv`, `daily_carbon_estimates.csv`.
- `bot_donors_daily.csv` — donor pool for SDID.

**Documentation:**
- `README.md` — human-facing setup and usage.
- `METHODOLOGY.md` — full methodology, caveats, signal definitions.
- `references/` — 9 PDFs of academic papers cited by the methodology.

**Tests:**
- `tests/` — pytest suite. CI runs it on every push.

## Setup checklist

Before running anything that talks to a network:

1. **Python 3.12 environment.** `python --version` should report 3.12.x. If you're on 3.13 or 3.14, the vendored `synthdid` works but `tfcausalimpact` / `pycausalimpact` will fail — the repo uses `statsmodels.UnobservedComponents` instead and that's fine. On macOS where Python is installed from python.org, you may need to use `python3` and `pip3` instead of bare `python`/`pip`.
2. **Install dependencies.** `pip install -r requirements.txt`.
3. **Credentials (only if refreshing data — not needed to explore the shipped CSVs):**
- `cp .env.example .env`, then fill in values.
- `GITHUB_TOKEN` — fine-grained PAT, no scopes needed for public-search reads.
- `GCP_PROJECT` — your GCP project ID. Requires BigQuery enabled and a billing account. The dataset (`githubarchive.day.*`) is public; the query cost (~$5–15 for a full backfill) is billed to your project.
- GCP auth via either `gcloud auth application-default login` or `GOOGLE_APPLICATION_CREDENTIALS` pointing at a service-account JSON.
- `SLACK_WEBHOOK_URL` — optional. Leave blank to disable alerting.

## First contact: what can you do?

When you're pointed at this repo for the first time, your first job is to check what credentials are available. Different capabilities require different credentials; some paths work with none at all.

Run this check before anything else:

```bash
python3 -c "import os; print('GITHUB_TOKEN:', bool(os.environ.get('GITHUB_TOKEN'))); print('GCP_PROJECT:', bool(os.environ.get('GCP_PROJECT')))"
```

Then route based on what's available:

| Credentials | What you can do |
|-------------|-----------------|
| None | Explore shipped CSVs · Build dashboard · Run intervention analysis · Run test suite |
| `GITHUB_TOKEN` only | All of the above · Refresh `daily_ai_commits.csv` via `github_ai_daily.py` · Refresh `bot_donors_daily.csv` via `fetch_bot_donors.py` |
| `GCP_PROJECT` + GCP auth only | All no-credentials paths · Refresh BigQuery signals via the three `fetch_*` scripts |
| Both | Full pipeline refresh via `run_pipeline.py --mode daily` or `--mode weekly` |

**If the user has no credentials**, tell them what you can do and offer to start with the dashboard or the intervention analysis (both run end-to-end in under a minute from the shipped CSVs). Don't pressure them to set up credentials — the shipped data covers ~450 days and the headline findings are reproducible from it.

**If the user has GitHub but not GCP**, note that the BigQuery signals (branch activity, branch creates, push totals) will fall out of date but the commit-attribution signal stays current.

**If the user has both**, before running anything with `--backfill`, confirm the user wants to incur BigQuery costs (~$5–15 for a full backfill against their GCP billing account). Default windows in `run_pipeline.py` are bounded — you can always run those safely.

## Reproduce-the-analysis walkthrough (autonomous mode)

Use this when you are running the analysis without a human in the loop. Each step has a verification check; do not proceed past a failed check.

1. **Clone and install.**
```bash
git clone <repo-url> && cd <repo-name>
pip install -r requirements.txt
```
Verify: `python -c "import plotly, statsmodels, numpy, pandas"` exits 0.

2. **Run the test suite.**
```bash
python -m pytest tests/ -v
```
Verify: all tests pass. If they don't, stop and report — don't continue building on a broken baseline.

3. **Build the dashboard from shipped CSVs.** No credentials needed.
```bash
python dashboard.py
```
Verify: `dashboard.html` is created. Open it in a browser to confirm the charts render. The dashboard spans the full data history (~450 days for commits, ~392 days for BigQuery signals).

4. **Run the intervention analysis from shipped CSVs.** No credentials needed.
```bash
python run_intervention_analysis.py
```
Verify: `outputs/intervention/` directory is created with charts + JSON results. Headline finding is in the CAR section: total-market cumulative abnormal commits ≈ +60K over the 8-day window after Oct 20, z ≈ 5.6.

5. **(Optional, requires credentials) Refresh the data.**
- Daily refresh: `python run_pipeline.py --mode daily --dry-run` (drop `--dry-run` to persist).
- Weekly refresh (BigQuery): `python run_pipeline.py --mode weekly --dry-run`. **Warning: BigQuery scans cost real money.** A full re-run is ~$5–15 against your GCP billing account. Default date windows in the orchestrator are bounded (7–14 days back); never invoke `--backfill` on the BigQuery fetchers without confirming the user wants to pay for it.

## Walk-the-user-through-it mode

Use this when a human says "walk me through this repo" or "help me reproduce the analysis." Operate one step at a time, not all at once.

1. **Ask what they want.** Common asks: (a) just explore the data, (b) reproduce the dashboard, (c) reproduce the intervention analysis, (d) refresh from source. Each path has a different setup cost.
2. **Check their environment.** Python version, whether they have `git clone`d already, whether they have a venv.
3. **For path (a) — data exploration only:** point them at the CSV files and show a 5-line pandas snippet to load one. No installs needed beyond `pandas`.
4. **For paths (b) and (c) — dashboard or intervention analysis:** `pip install -r requirements.txt`, then run the relevant script. Show them the output file paths. No credentials needed.
5. **For path (d) — refresh from source:** walk them through `.env.example` → `.env`, the GitHub PAT creation flow, the GCP project + BigQuery enablement flow, and `gcloud auth application-default login`. Warn explicitly about BigQuery costs before running any backfill.
6. **After each step:** confirm it worked (show the verification command and its expected output) before moving on. Surface anything that didn't behave as expected — the user is the source of truth on whether their environment is healthy, not you.

## Guardrails

- **Never commit `.env`, service-account JSON files, or any file containing a real token.** `.gitignore` already covers `.env`; double-check before any `git add -A`. Prefer `git add <specific-file>` over `git add .`.
- **Never run unbounded BigQuery scans.** The fetchers accept `--start-date`/`--end-date`. Always bound them. A full backfill is ~$5; a typo (e.g., scanning all of 2024) could be much more.
- **Never hardcode credentials in code.** Read from environment via `os.environ.get(...)`.
- **Never commit `dashboard.html` or any large generated artifact.** They're gitignored for a reason — the file churns on every data refresh and bloats history.
- **Never destructively rewrite tracked CSVs.** The fetchers are designed to append-and-skip-already-fetched-dates. If you find a corrupted row, prefer a targeted patch over a regen-from-scratch, and explain the change in the commit message.
- **Test before claiming success.** "I ran the script" ≠ "the script worked." Run the verification command and quote the output.

## Known footguns

- **Signature changes.** AI tools change their commit attribution over time. `METHODOLOGY.md` documents three known events: Aider v0.85.0 (May 2025), Copilot SWE Agent rename (March 2026), Warp→Oz (March 2026). When a tool's daily count cliffs to near-zero, suspect a signature change first.
- **GH Archive data quality.** Three known issues in the BigQuery source data: May 24, 2025 (permanent ~35% drop in push events), Sep 8, 2025 (single-day brownout test), Oct 8–14, 2025 (major outage, 99.5% drop). These dates are nulled in the shipped CSVs.
- **GitHub Search API limits.** 30 req/min with token. The fetcher is rate-limit-aware but a full backfill takes hours.
- **`tfcausalimpact` and `pycausalimpact` don't install on Python 3.14.** The repo uses `statsmodels.UnobservedComponents` for the BSTS-style robustness fit. Don't try to "fix" the missing dependency by switching back.
- **`Rbeast` emits cosmetic SystemErrors on Python 3.14.** During `run_intervention_analysis.py` you may see repeated `SystemError: ...dictobject.c:4172: bad argument to internal function` messages from the `Rbeast` C extension. These are harmless — the script completes successfully and produces all artifacts. They're a known Rbeast/Python-3.14 incompatibility, not a sign of a real failure.
- **`synthdid` on PyPI doesn't build on Python 3.14.** The repo vendors the source in `vendor/synthdid/` with a one-line pandas 3.x shim. Don't `pip install synthdid`.

## When in doubt

Re-read `METHODOLOGY.md`. It is the canonical reference for what each signal means, why it's structured the way it is, and what its known limitations are.
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@AGENTS.md
Loading
Loading