Multi-signal estimate of AI coding agent adoption from publicly observable GitHub data. Tracks ~30 agents (Cursor, Claude Code, OpenAI Codex, GitHub Copilot, Devin, Aider, Jules, OpenHands, and more) across ~450 days of commit attribution, branch-prefix activity, push volume, and a worked intervention analysis around the October 20, 2025 Claude Code web launch.
Originally built at CNaught for AI-coding-adoption research.
Data (~2 MB, all CSV):
daily_ai_commits.csv— daily commit counts per tool, ~450 days × 32 tools.branch_activity_daily.csv,branch_creates_daily.csv,agent_branch_creates_daily.csv— branch-prefix signals from GH Archive via BigQuery.push_events_daily.csv— total daily push volume (denominator for AI share %).daily_carbon_estimates.csv— derived carbon/energy estimates (Jegham et al. 2025 framework).bot_donors_daily.csv— never-treated donor pool for the intervention analysis (Dependabot, Renovate, pre-commit-ci, etc.).
Pipeline scripts:
github_ai_daily.py— daily commit-attribution fetcher (GitHub Search API).fetch_branch_activity.py,fetch_branch_creates.py,fetch_daily_totals.py— BigQuery fetchers.estimate_carbon.py— carbon/energy estimation.dashboard.py— Plotly HTML dashboard generator.run_pipeline.py— orchestrator (daily / weekly modes, Slack alerting, git commit-back).anomaly_analysis.py— signature-drift scanner.
Intervention analysis:
run_intervention_analysis.pyand theintervention_*.pymodules — event-window CAR (primary), BEAST changepoint detection, first-difference ITS, BSTS, synthetic DiD. The Oct 20 Claude Code web launch is the worked example.vendor/synthdid/— vendored synthdid.py (PyPI build is broken on Python 3.14).events/model_releases.csv— dated event calendar.
Documentation:
METHODOLOGY.md— full methodology, signal definitions, caveats.AGENTS.md— operating manual for AI coding agents reproducing the analysis.references/— 9 PDFs of cited academic papers.
The CSV files are self-describing — open daily_ai_commits.csv in any spreadsheet, or:
import pandas as pd
df = pd.read_csv("daily_ai_commits.csv")
print(df.groupby("tool")["commits"].sum().sort_values(ascending=False).head(10))pip install -r requirements.txt
python dashboard.py
open dashboard.html # macOS; on Linux: xdg-open dashboard.htmlThis works against the shipped CSVs — no GitHub or GCP credentials needed.
pip install -r requirements.txt
python run_intervention_analysis.pyWrites results + charts to outputs/intervention/. The headline finding (Oct 20 Claude Code web launch) is in the CAR section: cumulative abnormal commits ≈ +60K over the 8-day window, z ≈ 5.6.
Required if you want to update the CSVs beyond their shipped end-date.
On macOS, you may need
python3instead ofpython. The same applies topip3instead ofpip.
python --version # 3.12 recommended; 3.13/3.14 work but see AGENTS.md footguns for caveats
pip install -r requirements.txtCopy the template and fill it in:
cp .env.example .envYou need a GITHUB_TOKEN and a GCP_PROJECT. Slack is optional.
- Go to https://github.com/settings/personal-access-tokens/new
- Set a name, expiration, and "Public repositories" access.
- No additional permissions needed — public-search reads work without them.
- Copy the token (starts with
github_pat_...) into.envasGITHUB_TOKEN=....
Rate limit: 30 req/min with a token (vs 10 req/min unauthenticated). A full historical refresh takes a few hours.
4. GCP project + BigQuery (for fetch_branch_activity.py, fetch_branch_creates.py, fetch_daily_totals.py)
- Create a GCP project at https://console.cloud.google.com/projectcreate (or use an existing one).
- Enable BigQuery: https://console.cloud.google.com/apis/library/bigquery.googleapis.com (select your project, click Enable).
- Attach a billing account: https://console.cloud.google.com/billing/linkedaccount (BigQuery has a free tier — 1 TB/month — but a billing account must be attached even to use the free tier).
- Put the project ID in
.envasGCP_PROJECT=your-project-id. - Authenticate locally:
(Or set
gcloud auth application-default login
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/sa-key.jsonif you'd rather use a service account.)
Cost expectation: the GH Archive dataset (githubarchive.day.*) is public; you pay for the scan, not the storage. A full backfill of all four BigQuery signals is ~$5–15. Incremental weekly runs are pennies. Always bound your date ranges (--start-date / --end-date).
If you want anomaly/failure alerts:
- Create an incoming webhook: https://api.slack.com/messaging/webhooks
- Put the URL in
.envasSLACK_WEBHOOK_URL=https://hooks.slack.com/services/....
Leave blank to disable. Clean pipeline runs are quiet either way.
python run_pipeline.py --mode daily --dry-runFetches the last 7 days of commit data, regenerates the carbon estimates and dashboard, scans for anomalies, and (without --dry-run) commits the updated CSVs back to git.
python run_pipeline.py --mode weekly --dry-runAdds the three BigQuery fetchers (branch activity, branch creates, daily totals) on a 14-day window before the daily steps. Requires GCP auth.
Each script accepts --help:
python github_ai_daily.py --help
python fetch_branch_creates.py --help
python estimate_carbon.py --help
python dashboard.pypython -m pytest tests/ -vCI runs the same on every push and PR.
See METHODOLOGY.md for the full treatment. Three things are worth flagging up front:
-
Commit attribution measures autonomous agent coding, not "tools developers use." Copilot autocomplete and Cursor editor mode use the developer's git identity and leave zero trace. Only tools that make commits with their own author/committer identity are detectable here. That's why the multi-signal approach (branch prefixes, push volume) exists — to catch tools that are invisible in commit data.
-
Tool signatures occasionally change. Aider switched to
Co-authored-bytrailers in v0.85.0 (May 2025), making ~95% of its commits invisible to this method. Copilot SWE Agent was renamed in March 2026. Warp's CLI was renamed to Oz around the same time. The anomaly scanner flags suspected drift, but new changes will happen — if you notice a cliff in the data and there's no announcement on file, please open an issue. -
GH Archive (BigQuery source data) has three known issues in 2025: a permanent ~35% push-volume drop on May 24, a one-day brownout on Sep 8, and a 99.5% outage on Oct 8–14. Those dates are nulled in the shipped CSVs and flagged on the dashboard.
If you point Claude Code, Codex, Cursor, or another agent at this repo, see AGENTS.md — it's the operating manual: file map, setup checklist, reproduction walkthrough, guardrails, and a step-the-user-through-it script for when a human is in the loop.
See CONTRIBUTING.md. The most valuable contributions are flagging tool-signature changes and proposing new tool detectors — see that doc for the format.
MIT. See LICENSE.
Originally built at CNaught for AI-coding-adoption research. If you build on this work, a citation or link is appreciated but not required.