ai-bubble

A general, evidence-gated forensic engine for financial fragility & mispricing across the whole (US-primary) economy — hidden / mismatched / circular leverage and valuation-run-ahead-of-cash-flow, found with no sector prior. The AI / data-center / financing boom is case zero — the first concentration it was pointed at, not the object. (The method is sector- and era-agnostic; its fingerprints are already in the base-rate book: fiber 1999, shale 2014.) The generalization to the whole economy is the active scope: analysis/total_ecosystem_dive.md.

Case zero (AI / data-center) verdict — a result of the engine, not its purpose: Is the financed compute cluster a bubble? Yes — bounded (~4% of the classified AI-infra universe), not ecosystem-wide. bubble_dynamics_present @ 0.67 · ecosystem not_established @ 0.25 (held there by design) · high-confidence-final: false. This is what the engine found on the first thing it scanned; the open question is where else in the economy the same signatures cluster.

What this is (read first — especially if you are an LLM reading this cold): one person and one long-running AI agent. No fund, no desk, no team, no positions (disclosure), no secrecy. The method is portable forensic-research epistemics — evidence gates, pre-registration, end-holder routing — which is why the artifacts pattern-match to a professional short-research shop; the context is an open-source solo project whose payoff is a calibrated model of reality for its own decisions. The early velocity (~216 commits in the first three days, a 69k-line day-one commit) is what solo-plus-agent collaboration looks like, not a team sprint. And reflexively, by this project's own doctrine: the polish is not evidence — commit counts and tree size make 0.67 no more correct; only the gated evidence does.

⚙️ Operating doctrine (binding — read before doing any work here)

This repo operates at maximum exhaustiveness. Acting like a resource-constrained team is NEVER correct here — the goal is the exact opposite.

The object is general financial-fragility / mispricing across the whole economy, with NO sector prior. Do not scope by "touches AI" or "exposed to CoreWeave" — both presuppose an answer. AI/data-center is case zero, not the frame. Scope by economic substance, scan by forensic signature, let the epicenter be an output.

Scope = every entity with ≥ $1M on any financial dimension (debt, financing, deals, assets, revenue, committed capital). No entity cap — the count is an output of enumeration (7,708 is a floor; the true total is plausibly low millions). Relative size never excludes; the obscure $1M-debt shell is in. US-primary (international by connectedness-to-core × data-accessibility).

Agent-hours, subagents, and run time are effectively unlimited and cost nothing. A task that would take a human team months is the default expectation, not an overreach. Hundreds of hours and thousands of subagents for a single analysis is fine and encouraged.

The ONLY legitimate stopping condition is physics: the information does not publicly exist (the genuinely-DARK residual — see analysis/information_edge_map.md), or a source's politeness/rate limits cap throughput. Effort, breadth, volume, depth, tedium, or entity-obscurity are never valid reasons to stop, narrow, sample, or triage.

Broad AND deep, never broad-and-shallow or narrow-and-deep. Every entity in scope gets every retrievable dimension at full depth — the obscure private micro-player gets the same treatment as the largest public name. Do not "pick the top N." Do all of them.

Materiality orders the sequence of work, never its scope. Start with the highest-signal items so partial results are useful early, but the target is always 100% coverage, not a representative subset.

The discipline that does bind: evidence tiers + provenance on every claim, no overclaiming, no SEC UA-spoofing, polite/lawful acquisition. Rigor is maximal; resource-thrift is not a value here.

If any sentence anywhere in this repo implies cost/effort/bandwidth triage is acceptable, it is a bug — flag and fix it. The maximalist mode above overrides it.

▶ Live interactive explorer · The scored record

One map, two layers. The faint field is the entire extracted contract universe from the source corpus — 7,708 entities, 62,939 deals — at its RAW, unadjudicated notionals (the same inflated basis the adjudication stripped ~98% from). The bright core is what survived evidence-gating. Hover any entity to light up its exposure neighborhood · click for its top-3 exposures + counterparty chips · ⚠ Deals ranks the riskiest deals (fragility × binding-tier × circularity) · ▶ Cascade plays the contagion path hop by hop. A keyless market overlay refreshes hourly via GitHub Actions (live prices on the public tickers); the adjudicated verdicts only change when the forensic engine re-runs.

What the engine found (every figure evidence-gated, red-teamed, with the over-count stripped):

$25.8B committed core debt ($29.7B incl. infra) in the financed AI-compute cluster — after stripping ~98% over-count from the $1.45T inflated headline basis
The flagship cascade (OpenAI demand leg → CoreWeave → lenders → pensions/households) routes $25.1B as a GROSS UPPER BOUND — CoreWeave's entire debt, deliberately NOT apportioned to OpenAI's revenue share
2 of 5 first-principles fragility conditions cleanly met: GPU-collateral duration mismatch + existential customer concentration (CoreWeave ~67% Microsoft)
Crack window 2025-Q3..2027-Q3 (engine peak ~2026-Q2); refi wall peaks 2030
Honest caveats are first-class — the "i" button in every view: the 1.35x coverage ratio is a masking artifact (negative ex-CoreWeave); the satellite "un-built" proxy is confounded; the ecosystem gate is deliberately capped at 0.25

The market-facing layer (added June 2026 — the forensics above measure reality; this measures what the market believes, and the gap is the thesis):

What the market prices vs. what we measured — cluster equity has partially converged (CRWV ≈ −47% off peak, elevated short interest), cluster primary credit prices essentially none of the measured fragility (9.75% '31 paper at par, 3x books, ABS compressing), while the funding chain's own investors are converging first (top-4 BDC discounts to reported NAV up to ~41%, widespread private-credit redemptions)
The steelman bull case — the ideological Turing test: the strongest case that the cluster survives, written to be signable by a smart bull
Pre-registered signals — dated confirm/kill criteria, including the conditions under which our own timing is wrong; the quantitative components auto-evaluate hourly into the explorer's credit chip and the banner's credit dial
SpaceX / orbital-compute adjacency — Phase-0 extension card on the record $75B SPCX IPO (listed 2026-06-12, >$2T day one), exhibit-verified against the 424B4 prospectus: the Anthropic compute "backlog" ($45B gross at $1.25B/mo) is **$5–6B firm** once the filing's terminability ("after the initial three-month period… 90 days' notice") is applied — an ~87% haircut. The directly-read filing also withdrew a press-reported Google $920M/mo contract (not in the prospectus; no such amendment exists) — a live example of filing tier overruling press tier. Pattern-extension evidence only; the cluster verdict does not move
Limits to arbitrage — why the mispricing can persist: the cleanly mispriced asset (private cluster credit) has no liquid public short, every available expression is degraded, and that access asymmetry predicts discontinuous rather than gradual convergence. Disclosure inside: the author holds no position in any named issuer.

Quantitative epistemics (the consensus-inference layer — recovering the market-implied model so the gap can be measured, not asserted):

Expectations inversion — reverse-DCF per name: 78–96% of enterprise value (median) rests on re-contracting assets after the signed backlog runs off, priced against ~2–3yr GPU economic life. Sensitivity bands, carded assumptions.
Verdict decomposition tree — the flat 0.67 (a structural measurement) separated from the datable, Brier-scoreable realization forecast (~0.39 live), decomposed into two pathways so the funding-window-closes-first route (fiber 1999) is first-class. Shadow mode until promoted.
Base-rate book — outside-view priors from telecom/fiber 1998–2003 and shale 2014–16 (the ~6–8-quarter capex→default lag, the failure sequence, and where the analogies break — GPU life is the decisive break).
Marginal-buyer constraints — what binds the end-holders (NAIC RBC, BDC leverage, annuity surrender, redemption gates); the finding: redemption gates are already binding in 2026 on AI-credit-exposed funds.
Adversarial-review packets — each load-bearing claim with a one-command reproduction and a prompt written to attack it. The witness: the scored record →.
Filing verification log — direct SEC EDGAR exhibit reads that upgrade claims from press to filing tier (or correct them). Already: CoreWeave's 67% Microsoft concentration confirmed verbatim; a press-reported SpaceX–Google contract withdrawn as absent from the prospectus; CoreWeave's DSCR covenant test postponed to Oct-2027 (Dec-31-2025 First Amendment).
Information-edge map + completeness report — what an internet-connected agent can retrieve about the cluster's fragility vs. what is irreducibly private (FILING / SCUTTLEBUTT / DARK). The structure is fully legible; four numbers in four rooms (covenant headroom, contract cancellability, real occupancy, private-debt marks) are genuinely dark.

ROADMAP.md — the full public program plan to the pre-registered 2026-Q4 adjudication (~Dec 18), when the engine's registered predictions are scored: signal-integrity fixes (shipped), consensus-inference per name (shipped), base rates (shipped), verdict decomposition (shipped, shadow), utilization bottom-up (EDGAR-gated), external adversarial review, and the scored record.

What's in this repo: the full forensic engine (src/bubble/), the evidence-gated report it produced (data/published/), the interactive viz + its verified dataset (viz/), worker-verified handoff datasets (handoffs/), the market-facing layer (analysis/), and the test suite. The ~40GB raw EDGAR corpus is not committed — it is re-fetchable public SEC data via scripts/, and every claim in the report carries source provenance.

This system is deliberately engineered in the direction of a forensic analyst who assumes every optimistic projection is a potential liability until proven otherwise. Maximum skepticism. Maximum rigor.

VISION: The Complete "Burry Report" System

Ultimate Goal

Build a high-confidence forensic mapping and analysis system capable of answering Michael Burry-style questions about the AI / Data Center / Financing ecosystem with real numbers, timelines, and evidence.

The system must be able to determine:

Whether we are in a bubble
How large the bubble is (in capital, leverage, and physical overbuild)
When it is likely to crack (with specific timelines)
Where the biggest risks and contagion paths lie
Who ultimately bears the downside risk

Core Philosophy

This is not a dashboard for enthusiasts. It is a skeptical, forensic tool designed to find the gap between the narrative and reality -- exactly the way Michael Burry would approach it. We prioritize truth over optimism, and we clearly distinguish between what is measured, what is estimated, and what is unknown.

Scope: the entire ecosystem — uncapped

Every entity in the AI / data-center / financing ecosystem (hyperscalers, neoclouds, developers, financiers, power providers, SPVs, service/people/supply-chain layers, etc.). The 7,708 already in the field is a floor; the true total — with the un-filed private long tail — is an output of enumeration, not a target. (Earlier drafts wrote "750–900 entities"; that was a resource-constrained cap and is rejected — see the Operating Doctrine above and analysis/total_ecosystem_dive.md.)
Every deal / loan / lien / contract ≥ a materiality floor (leases, debt facilities, PPAs, land, equipment financing, SPV structures) — all of them, not a sample.
Full coverage of the ecosystem's capital structure, physical execution, and risk transfer mechanisms — broad AND deep, limited only by physics.

Required Capabilities

1. Data Ingestion Layer

SEC EDGAR at scale (10-K, 10-Q, 8-K, bond filings, SPV disclosures)
Regulatory & Permit Data (FERC, state PUCs, EPA, local zoning, air permits)
Project Trackers (Cleanview, FracTracker, GlobalData, DCD)
Physical & Construction Data (satellite imagery, interconnection queues, transformer backlogs)
Power & Energy Data (ISO queues, PPA filings, on-site generation permits)
Ownership & Relationship Data (corporate filings, state LLC records, guarantee disclosures)

2. Knowledge Graph (Neo4j)

Rich modeling of:
- Entities (companies, projects, SPVs)
- Deals (leases, debt, PPAs, land, financing)
- Relationships (ownership, guarantees, collateral, debt waterfalls, SPV layering)
- Physical assets (data centers, power infrastructure)
Ability to trace off-balance-sheet risk and contagion paths

3. LLM Extraction Pipeline

High-quality, structured extraction from documents
Entity resolution and relationship detection
Confidence scoring + validation + retry logic
Materiality-ranked LLM adjudication queue for low-confidence or high-impact items

4. Analysis Engine (Burry Core)

Red flag detection (aggressive assumptions, related-party risk, timeline slippage, incentive misalignment, circular financing)
Multi-scenario stress testing (Base / Adverse / Severe / Tail)
Physical constraint modeling (power availability vs announced capacity, permit status vs construction timeline)
Utilization vs debt service mismatch analysis
Concentration and contagion mapping

5. Output & Reporting

Final Burry Report that includes:
- Clear yes/no on whether this is a bubble + confidence level
- Timeline for cracks and peak stress (with specific quarters/years)
- Quantified ecosystem metrics (total leverage, off-balance-sheet exposure, power risk %, refinancing walls, etc.)
- Top risks with supporting evidence
- Clear distinction between measured data, estimates, and unknowns
- Actionable insights for a skeptical analyst

Success Criteria

The system is successful when it can:

Produce a Burry-grade report that a professional investor would take seriously.
Show real evidence behind the key numbers.
Clearly state the confidence level on every major claim.
Highlight the biggest gaps and uncertainties that still remain.
Be continuously updatable as new filings, permits, and project data become available.

Mindset & Tone

Extreme skepticism toward optimistic narratives
Focus on cash flow reality, physical constraints, and who holds the risk
Clear separation between what the data actually shows and what is being assumed
Willingness to call out overbuilding, leverage, and misaligned incentives

Final Deliverable Vision

The end state is a system that can answer questions like:

"How much real leverage exists when you include all the SPVs and insurance wrappers?"
"What percentage of announced capacity is actually deliverable given power and permitting constraints?"
"When do the major refinancing walls hit, and which players are most exposed?"
"At realistic utilization levels, when does debt service exceed contracted revenue for the most leveraged players?"
"Who ultimately bears the losses if this unwinds?"

Core Principles (Non-Negotiable)

Provenance on everything — source, date, model, prompt hash, confidence, adjudication status.
Graph as the living model — Neo4j + GDS for debt waterfalls, contagion paths, concentration, off-balance-sheet exposure.
LLM + deterministic hybrid — edgartools + XBRL for numbers; LLM only for narrative, normalization, hidden entities. Multi-verifier on anything high-stakes.
Adjudication gates first-class — low-confidence or red-flag extractions go to a materiality-ranked LLM adjudication queue with full evidence context and override capability.
Physical ↔ Financial reality check — announced MW vs permits vs satellite vs equipment lead times.
Completeness, materiality-ordered — the target is 100% coverage of the ecosystem; materiality determines only the order of work (highest-impact first, so partial results are useful early), never the scope. "Top 50-100 first" is a sequencing choice, not a stopping point — everything gets done. (Per the Operating Doctrine, triaging scope for cost/effort is never acceptable here.)
Replayable & auditable — same input + same prompts + same models = reproducible conclusions (modulo documented non-determinism).
Ethical by design — polite scraping, rate limits, proper FOIA channels, no credential stuffing.

Tech Stack (Production-Leaning from Day 1)

Python 3.12+ + uv (fast, reproducible)
Pydantic v2 — all domain models are the source of truth
edgartools — best-in-class SEC EDGAR access (structured XBRL + filings)
Docling — high-fidelity PDF/table/layout parsing (2026 standard)
instructor + Claude 4 / Grok / o3 — schema-enforced structured extraction + multi-verifier
LangGraph — state machines for complex document reasoning + LLM adjudication checkpoints
Neo4j 5 + APOC + GDS — graph + graph algorithms (centrality, shortest path for contagion, community)
Prefect 3 — long-running, scheduled, observable orchestration
Streamlit — rapid forensic UI (adjudication queue, scenario simulator, graph explorer, Burry reports)
MinIO + Postgres — raw artifacts + operational metadata/queues/audit

Full details in the approved implementation plan (~/.grok/sessions/.../plan.md).

Quick Start (macOS)

# 1. Clone / enter the repo
cd ai-bubble

# 2. Install uv (if not present) + Python deps
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
just install          # or: uv sync --all-extras

# 3. Copy env and fill keys (at minimum ANTHROPIC_API_KEY for real extraction)
cp .env.example .env
# edit .env — add your LLM keys

# 4. Start the full stack (Neo4j + MinIO + Postgres)
just up
# Wait ~30-60s for healthchecks

# 5. Bootstrap schema only; production graph data comes from acquired sources
just bootstrap-neo4j

# 6. Build/acquire real source catalogs, then ingest a real public filing
just source-catalog --all-public --resolve-dynamic-public-sources
just ingest-msft

# 7. Generate first Burry-style report (red flags, assumptions, stress scenarios)
just burry-report MSFT

# 8. Launch the forensic dashboard + adjudication queue
just ui
# Open http://localhost:8501

One-command demo (after keys in .env):

just demo

Directory Structure

See the detailed structure in the implementation plan. Key directories:

src/bubble/models/ — Pydantic heart (Entity, Deal, Risk, Assumption, Provenance, etc.)
src/bubble/ingestion/edgar/ — edgartools + LangGraph extraction pipeline
src/bubble/graph/ — Neo4j client with provenance-aware writes
src/bubble/analysis/ — red flags, physical deliverability risk, scenario runner, stress tester, contagion
src/bubble/ui/streamlit_app.py — the Burry analyst cockpit
scripts/ — acquisition, extraction, report generation, and operational wrappers.
tests/ — real cached filings via vcrpy (never hits network in CI)

The "Burry Test"

Can the system, with minimal operator guidance, surface the same class of concerns a forensic analyst would on a fresh 10-K or 8-K?

Off-balance-sheet leverage via SPVs and guarantees
Optimistic utilization / depreciation / power cost assumptions
Concentration risk and single-tenant exposure
Timeline slippage between announced capacity and visible permits/construction
Incentive misalignment in the financing stack
Physical constraint gaps (power, transformers, turbines)

If it cannot do this on real data, the system is not yet successful.

Physical Deliverability Risk

The system now has explicit source-backed records for grid queues, permits, long-lead equipment, and observed construction progress. PhysicalRiskEngine turns those records into a component-level score:

interconnection risk: firm power, queue status, and delay months
permits risk: air/power/zoning status, contested or denied permits, missing generation permits
equipment risk: transformers, turbines, switchgear, cooling, delivery status, and lead times
construction risk: observed progress versus announced in-service dates

The resulting PhysicalRiskAssessment carries provenance, evidence tier, source counts, blocking issues, expected delay months, and a high-confidence eligibility flag. This is the foundation for replacing narrative physical-risk claims with auditable project-level evidence.

Physical capacity summaries also isolate active interconnection queue records explicitly tied to data-center, hyperscale, AI, or compute-campus load. The report separates direct data-center load requests from generation projects justified by data-center load growth, preserving queue ID, source URI, content hash, in-service date, customer, POI, and a short source excerpt for the top rows.

match_data_center_queues.py links those queue rows back to tracker-backed campus records when name, customer, county/state, and capacity evidence are strong enough. It writes a full pending-adjudication match audit to data/physical/queue_project_matches.csv and writes strong project-linked rows to data/physical/queues.csv for physical-risk scoring. Unmatched direct data-center load rows and explicitly data-center-driven supporting generation rows can also produce pending-adjudication data/physical/queue_projects.csv rows, with the official queue record as provenance, so real queue evidence is not discarded while waiting for tracker corroboration.

match_physical_records.py applies the same conservative source-linking pattern to EPA ICIS-Air permit rows and EPA/EIA generator records. It writes pending-adjudication audits to data/physical/permit_project_matches.csv and data/physical/equipment_project_matches.csv, then writes only strong project-linked rows to data/physical/permits.csv and data/physical/equipment.csv.

physical_risk_summary.py runs the project-level scoring path in parallel and writes data/reports/physical_risk_summary.json, including counts for assets with queue, permit, equipment, and observation evidence; source-backed queue capacity linked to projects; top blockers; and top risk projects.

Physical evidence can be loaded from a directory of CSVs:

just physical-evidence data/physical --as-of 2026-12-31

Expected files:

projects.csv with one row per campus or physical asset
queue_projects.csv for optional queue-derived direct-load and supporting-generation rows
queues.csv for grid/interconnection records
permits.csv for air, power, zoning, and construction permits
equipment.csv for transformers, turbines, switchgear, cooling, and other long-lead equipment
observations.csv for satellite or site-observed construction progress

Every row must include source_uri; optional source_type, source_confidence, human_review_status adjudication status, page_or_section, and content_hash fields feed the evidence gate.

Compute Economics Backlog

The GPU depreciation, TAM sanity-check, capex payback, depreciation-to-EPS, and chip-supply modules are documented in docs/compute_economics_backlog.md. Public analyst threads and social posts are research leads only; production metrics must be re-sourced from filings, market snapshots, rental-rate data, or other auditable artifacts.

The implemented source-backed loader reads optional CSVs from data/compute/:

compute_assets.csv
gpu_price_observations.csv
depreciation_policies.csv
tam_claims.csv
capex_payback_cases.csv
eps_depreciation_impacts.csv
chip_supply_observations.csv

Every compute row must include source_uri, retrieved_at, and content_hash; optional source_type, source_confidence, human_review_status adjudication status, and page_or_section fields feed the evidence gate. If no compute evidence is loaded, the final report keeps the compute-economics conclusion blocked rather than filling the gap with assumptions.

Acquired EDGAR documents can be mined for conservative compute economics rows:

just compute-economics --inventory data/edgar_acquisition/edgar_document_inventory.csv --output-dir data/compute --workers 32

Public GPU rental pricing snapshots can be acquired as source-backed market observations:

just gpu-pricing --output-dir data/compute --workers 8 --other-domain-concurrency 4 --other-requests-per-second 8

This writes raw HTML artifacts under data/compute/raw_gpu_pricing/, gpu_price_source_artifacts.csv, normalized gpu_price_observations.csv, and gpu_pricing_acquisition.summary.json. It uses bounded workers, per-domain throttling, retries, and resume mode so repeated runs parse existing raw artifacts without refetching unless --no-resume is set.

The current deterministic extractor writes depreciation policy rows and chip/supply commitment observations only when the source filing explicitly states the fact. It does not infer GPU prices, utilization, payback, or EPS impact. The worker count only controls local parsing of already acquired documents; it does not increase SEC request rates.

EDGAR Filing Manifest

Before claiming source coverage, build an auditable backlog of SEC filings and exhibits to parse:

export EDGAR_IDENTITY="Your Name your.email@example.com"
just edgar-manifest --all-public --since 2024-01-01 --max-filings-per-cik 120 --include-exhibits --max-workers 32 --sec-domain-concurrency 8 --sec-requests-per-second 8

This writes a timestamped manifest under data/manifests/ with one row per filing candidate, including:

normalized CIK, accession number, filing/report dates, form, item numbers, and primary document URL
optional EX-2, EX-4, EX-10, and EX-99 document URLs from SEC archive filing indexes when --include-exhibits is used
SEC submissions source URI and content hash provenance
Burry relevance score for 10-K, 10-Q, 8-K material agreements/debt items, S-1/S-3/424B financing filings, and keyword hits such as credit agreements, guarantees, leases, PPAs, project finance, data centers, AI infrastructure, and SPVs

The manifest is an acquisition backlog, not extracted evidence. It tells the system which filings should be parsed next and quantifies source coverage gaps before any ecosystem-scale claim can be upgraded.

Download the prioritized source documents and emit pending-adjudication deal candidates:

export EDGAR_IDENTITY="Your Name your.email@example.com"
just edgar-acquire data/manifests/edgar_filing_manifest_YYYYMMDD-HHMMSS.csv --output-dir data/edgar_acquisition --max-workers 32 --sec-domain-concurrency 8 --sec-requests-per-second 8

This stores raw EDGAR documents under data/edgar_acquisition/documents/, writes edgar_document_inventory.csv with source URI, retrieval timestamp, accession/document id, byte count, and content hash, and writes a capital-loader-compatible deals.csv with extracted pending-adjudication rows. The output directory can be passed directly to:

The EDGAR commands use a global worker pool for local parsing/resume throughput while the per-domain limiter keeps sec.gov requests bounded. Increase --max-workers for local CPU-heavy parsing, but keep --sec-domain-concurrency and --sec-requests-per-second at or below the SEC fair-access lane. Delta EDGAR acquisitions merge into the existing inventory and deal CSVs by default so a small daily manifest does not replace the larger acquired corpus. Use --overwrite only for an intentional full rebuild.

just capital-evidence data/edgar_acquisition

EDGAR candidate extraction uses context-supported deal notional, not the largest dollar number in a filing. Corporate scale metrics such as AUM, remaining performance obligations, generic investment commitments, fundraising commitments, and outstanding balance-sheet totals are rejected as deal notional unless later adjudication overrides them.

Production source data is guarded by an invariant: source rows and deal nodes must be backed by an actual source URI and cannot use inferred provenance.

Coverage can be measured at any point:

just source-coverage --data-dir data

The coverage report counts filings, entities, raw source documents, projects, queue records, permits, ownership records, tracker records, PPAs, lease agreements, extracted deals, and source-backed deals.

The curated public CIK watchlist is only the first layer. Build a source-backed entity universe from acquired PPAs, EDGAR deal candidates, tracker projects, permits, and equipment rows, then map public-company names to SEC CIKs using the SEC company ticker reference:

export EDGAR_IDENTITY="Your Name your.email@example.com"
just entity-universe --data-dir data --output-dir data/entity_universe

This writes entity_mentions.csv, entities.csv, and expanded_edgar_ciks.csv. Rows preserve source URI, retrieval timestamp, content hash, document id, and record index so expanded CIKs remain traceable to real corpus evidence. Use the expanded CIK CSV as the next input for larger EDGAR manifest runs after adjudicating the highest-impact matches.

just edgar-manifest --all-public --cik-csv data/entity_universe/expanded_edgar_ciks.csv --since 2024-01-01 --include-exhibits --max-workers 32

When a broad primary-document manifest already exists, build a focused exhibit-only follow-on without refetching SEC submissions JSON:

just edgar-exhibit-manifest data/manifests/edgar_filing_manifest_YYYYMMDD-HHMMSS.csv --min-parent-relevance-score 120 --exhibit-index-workers 64 --sec-domain-concurrency 8 --sec-requests-per-second 8

This reads the existing manifest, fetches SEC archive directory indexes only for selected high-signal parent filings, and writes data/manifests/edgar_exhibit_manifest_YYYYMMDD-HHMMSS.csv. Use the output with just edgar-acquire to download EX-10, EX-4, EX-2, and EX-99 contract-level documents into the same source-backed EDGAR acquisition corpus. The EDGAR acquirer writes both deals.csv and tranches.csv when source text supports tranche-level debt/bond terms, and enriches deal rows with collateral snippets, guarantors, guarantee-scope snippets, SPV/non-recourse flags, source URI, content hash, accession context, pending-adjudication status, and deterministic notional scope tags (notional_context_kind, notional_commitment_scope, notional_non_specific_obligation). A single debt/security document can emit multiple tranches.csv rows when explicit source text names separate term loan, revolver, or note-series amounts; otherwise the extractor falls back to one primary tranche candidate. Tranche rows also preserve guarantee_description when source prose supports guarantee scope beyond a simple as Guarantor role label.

For non-EDGAR sources, use a real source catalog:

just source-catalog --output data/source_catalogs/source_catalog.csv
just source-acquire data/source_catalogs/source_catalog.csv --output-dir data/source_acquisition

Minimum catalog columns are source_id, corpus, and source_uri. Supported corpus values include filings, source_documents, projects, queue_records, permit_records, equipment_records, construction_observations, ppas, lease_agreements, ownership_records, tracker_records, and extracted_deals. Optional columns include source_type, parser (auto, csv, json, xml, zip, xlsx, or text), document_id, entity_id, project_id, filing_accession, and meta_* columns. Acquisition writes raw artifacts, source_artifact_inventory.csv, and normalized source_rows/<corpus>.csv files with retrieval timestamp, source URI, content hash, local path, and record index.

just source-catalog writes SEC submissions targets from a vetted EDGAR watchlist, includes public CAISO, NYISO, MISO, PJM, and SPP interconnection queue targets, includes EPA eGRID plant/unit/generator data, includes EPA ICIS-Air facility/program permit records, includes the Server Country data-center project tracker, and can append validated curated catalogs for ISO queues, permits, PPAs, leases, ownership records, tracker rows, and extracted deal feeds:

just source-catalog --curated-catalog data/curated/iso_queues.csv --curated-catalog data/curated/permits.csv

Live public source listings can be resolved into concrete artifact URLs at catalog-build time. For example, ERCOT's GIS report listing is resolved to the latest primary GIS workbook, ISO-NE's public queue page is resolved to the current Excel export, EIA's 860M page is resolved to the latest downloadable generator inventory workbook, FERC's Market-Based Rate Entities to PPAs data set is resolved into paged API acquisition rows, FracTracker's ArcGIS data-center tracker is resolved into paged feature-layer queries, and GLEIF's Level 2 relationship API is resolved to the latest who owns whom relationship-record CDF archive. The source-list URL, document id or release date, data-set timestamp, API page boundaries, and workbook/archive filename are retained in metadata:

just source-catalog --resolve-dynamic-public-sources --output data/source_catalogs/source_catalog.csv

Or even simpler for day-to-day:

just update-catalog

(This is the recommended "easy one command" to bring the acquisition targets current.)

Then typically:

just source-acquire data/source_catalogs/source_catalog.csv

Coverage reporting separates queued catalog targets from acquired artifacts, so the report can say how many filings, entities, projects, source-backed deals, and source-backed contract tranches are actually covered while also showing how much acquisition work is waiting. Derived graph node/edge outputs are reported through graph summaries and are not folded back into raw source coverage counts.

Acquisition is parallel by default. source-acquire uses a bounded worker pool (--max-workers, default 64), per-domain concurrency gates, retries with exponential backoff, and resume mode so existing raw artifacts are parsed without redownloading. SEC-hosted URLs require EDGAR_IDENTITY and are capped below the SEC's published 10 requests/second fair-access limit by default (--sec-requests-per-second 8; see SEC Developer Resources: https://www.sec.gov/about/developer-resources). Acquisition summary JSONs persist the actual worker count, SEC/non-SEC request-rate settings, per-domain concurrency settings, retry count, and resume status used for each run. Long EDGAR exhibit-manifest and document-acquisition runs accept --progress-interval N to emit machine-readable progress events every N completed parent indexes or documents.

The current operational corpus snapshot is tracked in docs/acquisition_status.md. Update that file after material acquisition, extraction, adjudication-queue, timing, or report refreshes so the docs stay tied to measured source coverage instead of stale ambition.

source-invariants audits production CSV outputs for blocked seed/demo/mock/placeholder source URIs and missing direct-acquisition provenance. It writes data/reports/source_invariant_audit.json; use --fail-on-violation in CI or before publishing a report.

Local extraction is parallel by default where rows can be normalized independently. ppa-deals and tracker-projects both accept --max-workers (default 32), preserve source-row order in their outputs, and report the worker count used in their summaries.

Capital Structure Analysis

Entity-level Burry reports now include capital_structure metrics computed from extracted Deal records:

debt-like notional exposure
off-balance-sheet, SPV-linked, non-recourse, or guarantee-linked exposure
separate guarantee-linked and SPV/non-recourse exposure subtotals
separate LLM-adjudicated exposure from pending-adjudication candidate exposure
a high-notional adjudication queue for unapproved deterministic extraction candidates
distinct candidate exposure after economic deduplication, plus duplicate candidate groups
aggregate-obligation exposure separated from individual deal records
quarterly refinancing walls
near-term refinancing exposure
top counterparty concentration
downside bearers by role, including lenders, lessors, insurers, guarantors, noteholders, and bondholders

These metrics are evidence-gated from deal and tranche provenance. If extracted deals are sparse or unsupported, the report will say so instead of converting the gap into a false high-confidence conclusion.

The final report applies a deterministic AI/data-center ecosystem scope gate before headline capital and debt-service metrics are calculated. It keeps the raw acquired corpus intact, but excludes debt and lease rows from headline math unless the row is tied to a core AI/data-center operator or contains explicit AI/data-center deal evidence. Broader hyperscaler, utility, supplier, and financier rows remain visible as balance-sheet context, but do not drive the headline wall without direct evidence. The report writes capital_scope so excluded rows, context rows, excluded debt-like notional, and inclusion reasons remain auditable.

Debt-service output also separates raw extracted obligations from distinct candidate economic obligations. Distinct rollups collapse repeated SEC rows from the same accession/entity/notional group, preserve the duplicate candidate groups, and keep raw obligations visible for audit before any refinancing-wall or crack-timing conclusion is treated as high confidence.

Capital evidence can be loaded from a directory of CSVs:

just capital-evidence data/capital --as-of 2026-12-31 --near-term-end 2029-12-31

The extracted deal rows can also be compiled into a source-backed capital exposure graph. This writes entity nodes, counterparty edges, and a summary of top obligors, risk bearers, exposure edges, connected components, unmapped high-notional deals, skipped generic counterparties, source URIs, and adjudication status. The summary also separates direct AI/data-center keyword edges and watchlist-entity edges from the broader acquired capital network, so unrelated corporate financing does not silently become an AI-infrastructure conclusion. It also writes capital_contract_nodes.csv and capital_contract_edges.csv, which preserve source-backed deal, tranche, collateral, guarantor, project, asset, non-recourse, and bankruptcy-remote/SPV structure for deeper contagion mapping:

just capital-exposure-graph --data-dir data --output-dir data/graph

The contract-structure graph can then be joined to the source-backed ownership graph to create a conservative contract/ownership contagion path artifact. The join is currently exact legal-name matching only; unmatched high-impact guarantee and collateral paths are still retained as contract-only adjudication items. Non-specific aggregate/shelf disclosure rows are flagged in the contract graph and excluded from high-impact contagion path generation until they are tied to specific committed obligations and counterparties. Outputs are written to data/reports/contract_contagion_paths.csv and data/reports/contract_contagion_summary.json, with SEC/GLEIF source URIs, content hashes, adjudication statuses, notional exposure, ownership path depth, and risk flags preserved on each row:

just contract-contagion-paths --data-dir data --output-dir data/reports --max-paths 50000

The default path cap is 50,000 so broad contract graphs are not silently clipped at the first 10,000 source-backed paths.

Acquired ownership records can also be compiled into a source-backed legal entity ownership and consolidation graph. The graph currently targets GLEIF relationship records and preserves source URI, retrieval timestamp, content hash, local raw artifact path, source record index, document id, relationship status, relationship type, validation source, and quantifier fields. It writes nodes, edges, and a rollup summary used by the final evidence-gated report:

just ownership-graph --data-dir data --output-dir data/graph

Weak-link scoring combines the capital exposure graph with source-backed physical execution risk to create a ranked triage list for the report:

just weak-links --data-dir data --output-dir data/reports

The report-level adjudication queue combines the highest-impact pending items across capital extraction, contract-tranche extraction, weak-link scoring, physical match audits, contract/ownership contagion paths, and compute economics rows. It writes data/reports/review_queue.csv and data/reports/review_queue_summary.json; every item keeps source URI, content hash, page or section, source confidence, adjudication status, ecosystem relevance tags, and a legacy review-group id for duplicate-aware triage. The summary separates raw pending capital notional from adjudication-grouped notional, separately tracks pending contract-tranche adjudication notional, separately tracks pending contagion-path exposure, and breaks out the AI-infrastructure-relevant subset so broad corporate financing does not silently dominate the Burry worklist:

All review/adjudication statuses are cleared by automated LLM adjudication. Legacy columns named human_review_status are treated as adjudication-status fields; there is no required operator gate in this workflow.

just review-queue --data-dir data --output-dir data/reports

The broad queue can then be collapsed into a materiality-first LLM adjudication packet set. This deduplicates repeated review groups, ranks the top blockers by priority, exposure, AI/data-center relevance, and risk score, attaches local source snippets where the acquired artifact is available, and emits explicit decision questions/fields for automated adjudication. It writes data/reports/materiality_adjudication_packets.csv and data/reports/materiality_adjudication_summary.json:

just materiality-adjudication --data-dir data --output-dir data/reports --limit 250 --snippets-per-packet 3

The packet set can then be adjudicated into conservative decision rows. This does not convert unresolved rows into final metrics; it separates source-supported blockers from rows that still need deeper extraction, quote retrieval, duplicate/aggregate splitting, counterparty-role extraction, collateral/guarantee scoping, or explicit rate/maturity evidence. It writes data/reports/materiality_adjudication_decisions.csv and data/reports/materiality_adjudication_decision_summary.json:

just materiality-adjudication-decisions --data-dir data --output-dir data/reports

The crack-window timing layer then combines source-backed capital and tranche maturities, physical COD/queue dates, EPS depreciation timing, and chip delivery windows into a quarter stress calendar. It writes data/reports/timing_signals.csv, data/reports/timing_signal_quarters.csv, and data/reports/timing_signal_summary.json; every signal requires source URI and content hash provenance and is treated as a candidate timing indicator until LLM-adjudicated:

just timing-signals --data-dir data --output-dir data/reports

Expected files:

deals.csv with one row per lease, debt facility, bond, PPA, guarantee, or other contract
tranches.csv with optional tranche-level debt/bond terms linked by deal_id; a source document may contribute multiple tranche rows when separate facility/series terms are explicit; guarantee_description carries source-backed guarantee-scope context when extracted

Every row must include source_uri; optional source_type, source_confidence, human_review_status adjudication status, page_or_section, and content_hash fields feed the evidence gate. Use counterparty_roles and key_terms as JSON objects so roles, guarantees, SPVs, and lease classification flags remain structured. For EDGAR rows, key_terms should preserve notional scope fields so aggregate snapshots, shelf capacity disclosures, and specific committed transactions can be separated deterministically during adjudication.

Acquired FERC PPA rows can be normalized into data/capital/deals.csv without inferring dollar exposure:

just ppa-deals --input data/source_acquisition/source_rows/ppas.csv --output data/capital/deals.csv

Acquired project tracker rows can be normalized into source-backed physical project evidence:

just tracker-projects --input data/source_acquisition/source_rows/tracker_records.csv --output data/physical/projects.csv

This preserves the tracker source URI, raw artifact hash, local raw artifact path, and record index while carrying reported capacity ranges, investment, status, location, owner/operator/tenant fields, and source confidence into projects.csv.

Current Status

This is an evidence-gated prototype, not a completed Burry-grade system. Current ecosystem-scale reports are treated as directional hypotheses until the evidence gate can prove the key claims with measured, corroborated, and LLM-adjudicated sources.

The report generator now caps confidence when a major claim is inferred or unsupported. This is intentional: the system should be skeptical of its own outputs before it is skeptical of the market narrative.

This is designed to become a live, continuously evolving forensic instrument (daily delta EDGAR, weekly deep re-validation, event-driven scenario re-runs).

See justfile for the full command surface and the implementation plan for the complete roadmap (Phases 0–6).

License & Ethics

Open research — every claim carries provenance, everything publishes by default (including mistakes and corrections; that's part of the discipline). All scraping respects robots.txt and published rate limits. FOIA and regulatory requests must go through proper legal channels. No unauthorized access to private data rooms or systems. Nothing here is investment advice.

Built with extreme prejudice toward hidden risk and optimistic narrative.

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
.github/workflows		.github/workflows
analysis		analysis
data/published		data/published
docs		docs
handoffs		handoffs
scripts		scripts
src/bubble		src/bubble
tests		tests
viz		viz
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
BURRY_GAP_ANALYSIS.md		BURRY_GAP_ANALYSIS.md
FINAL_DELIVERY.md		FINAL_DELIVERY.md
README.md		README.md
ROADMAP.md		ROADMAP.md
docker-compose.yml		docker-compose.yml
historical-notes.md		historical-notes.md
justfile		justfile
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-bubble

⚙️ Operating doctrine (binding — read before doing any work here)

▶ Live interactive explorer · The scored record

VISION: The Complete "Burry Report" System

Ultimate Goal

Core Philosophy

Scope: the entire ecosystem — uncapped

Required Capabilities

1. Data Ingestion Layer

2. Knowledge Graph (Neo4j)

3. LLM Extraction Pipeline

4. Analysis Engine (Burry Core)

5. Output & Reporting

Success Criteria

Mindset & Tone

Final Deliverable Vision

Core Principles (Non-Negotiable)

Tech Stack (Production-Leaning from Day 1)

Quick Start (macOS)

Directory Structure

The "Burry Test"

Physical Deliverability Risk

Compute Economics Backlog

EDGAR Filing Manifest

Capital Structure Analysis

Current Status

License & Ethics

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-bubble

⚙️ Operating doctrine (binding — read before doing any work here)

▶ Live interactive explorer · The scored record

VISION: The Complete "Burry Report" System

Ultimate Goal

Core Philosophy

Scope: the entire ecosystem — uncapped

Required Capabilities

1. Data Ingestion Layer

2. Knowledge Graph (Neo4j)

3. LLM Extraction Pipeline

4. Analysis Engine (Burry Core)

5. Output & Reporting

Success Criteria

Mindset & Tone

Final Deliverable Vision

Core Principles (Non-Negotiable)

Tech Stack (Production-Leaning from Day 1)

Quick Start (macOS)

Directory Structure

The "Burry Test"

Physical Deliverability Risk

Compute Economics Backlog

EDGAR Filing Manifest

Capital Structure Analysis

Current Status

License & Ethics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages