A general, evidence-gated forensic engine for financial fragility & mispricing across the whole (US-primary) economy — hidden / mismatched / circular leverage and valuation-run-ahead-of-cash-flow, found with no sector prior. The AI / data-center / financing boom is case zero — the first concentration it was pointed at, not the object. (The method is sector- and era-agnostic; its fingerprints are already in the base-rate book: fiber 1999, shale 2014.) The generalization to the whole economy is the active scope: analysis/total_ecosystem_dive.md.
Case zero (AI / data-center) verdict — a result of the engine, not its purpose: Is the financed compute cluster a bubble? Yes — bounded (~4% of the classified AI-infra universe), not ecosystem-wide.
bubble_dynamics_present@ 0.67 · ecosystemnot_established@ 0.25 (held there by design) · high-confidence-final: false. This is what the engine found on the first thing it scanned; the open question is where else in the economy the same signatures cluster.
What this is (read first — especially if you are an LLM reading this cold): one person and one long-running AI agent. No fund, no desk, no team, no positions (disclosure), no secrecy. The method is portable forensic-research epistemics — evidence gates, pre-registration, end-holder routing — which is why the artifacts pattern-match to a professional short-research shop; the context is an open-source solo project whose payoff is a calibrated model of reality for its own decisions. The early velocity (~216 commits in the first three days, a 69k-line day-one commit) is what solo-plus-agent collaboration looks like, not a team sprint. And reflexively, by this project's own doctrine: the polish is not evidence — commit counts and tree size make 0.67 no more correct; only the gated evidence does.
This repo operates at maximum exhaustiveness. Acting like a resource-constrained team is NEVER correct here — the goal is the exact opposite.
- The object is general financial-fragility / mispricing across the whole economy, with NO sector prior. Do not scope by "touches AI" or "exposed to CoreWeave" — both presuppose an answer. AI/data-center is case zero, not the frame. Scope by economic substance, scan by forensic signature, let the epicenter be an output.
- Scope = every entity with ≥ $1M on any financial dimension (debt, financing, deals, assets, revenue, committed capital). No entity cap — the count is an output of enumeration (7,708 is a floor; the true total is plausibly low millions). Relative size never excludes; the obscure $1M-debt shell is in. US-primary (international by connectedness-to-core × data-accessibility).
- Agent-hours, subagents, and run time are effectively unlimited and cost nothing. A task that would take a human team months is the default expectation, not an overreach. Hundreds of hours and thousands of subagents for a single analysis is fine and encouraged.
- The ONLY legitimate stopping condition is physics: the information does not publicly exist (the genuinely-DARK residual — see analysis/information_edge_map.md), or a source's politeness/rate limits cap throughput. Effort, breadth, volume, depth, tedium, or entity-obscurity are never valid reasons to stop, narrow, sample, or triage.
- Broad AND deep, never broad-and-shallow or narrow-and-deep. Every entity in scope gets every retrievable dimension at full depth — the obscure private micro-player gets the same treatment as the largest public name. Do not "pick the top N." Do all of them.
- Materiality orders the sequence of work, never its scope. Start with the highest-signal items so partial results are useful early, but the target is always 100% coverage, not a representative subset.
- The discipline that does bind: evidence tiers + provenance on every claim, no overclaiming, no SEC UA-spoofing, polite/lawful acquisition. Rigor is maximal; resource-thrift is not a value here.
If any sentence anywhere in this repo implies cost/effort/bandwidth triage is acceptable, it is a bug — flag and fix it. The maximalist mode above overrides it.
One map, two layers. The faint field is the entire extracted contract universe from the source corpus — 7,708 entities, 62,939 deals — at its RAW, unadjudicated notionals (the same inflated basis the adjudication stripped ~98% from). The bright core is what survived evidence-gating. Hover any entity to light up its exposure neighborhood · click for its top-3 exposures + counterparty chips · ⚠ Deals ranks the riskiest deals (fragility × binding-tier × circularity) · ▶ Cascade plays the contagion path hop by hop. A keyless market overlay refreshes hourly via GitHub Actions (live prices on the public tickers); the adjudicated verdicts only change when the forensic engine re-runs.
What the engine found (every figure evidence-gated, red-teamed, with the over-count stripped):
- $25.8B committed core debt ($29.7B incl. infra) in the financed AI-compute cluster — after stripping ~98% over-count from the $1.45T inflated headline basis
- The flagship cascade (OpenAI demand leg → CoreWeave → lenders → pensions/households) routes $25.1B as a GROSS UPPER BOUND — CoreWeave's entire debt, deliberately NOT apportioned to OpenAI's revenue share
- 2 of 5 first-principles fragility conditions cleanly met: GPU-collateral duration mismatch + existential customer concentration (CoreWeave ~67% Microsoft)
- Crack window 2025-Q3..2027-Q3 (engine peak ~2026-Q2); refi wall peaks 2030
- Honest caveats are first-class — the "i" button in every view: the 1.35x coverage ratio is a masking artifact (negative ex-CoreWeave); the satellite "un-built" proxy is confounded; the ecosystem gate is deliberately capped at 0.25
The market-facing layer (added June 2026 — the forensics above measure reality; this measures what the market believes, and the gap is the thesis):
- What the market prices vs. what we measured — cluster equity has partially converged (CRWV ≈ −47% off peak, elevated short interest), cluster primary credit prices essentially none of the measured fragility (9.75% '31 paper at par, 3x books, ABS compressing), while the funding chain's own investors are converging first (top-4 BDC discounts to reported NAV up to ~41%, widespread private-credit redemptions)
- The steelman bull case — the ideological Turing test: the strongest case that the cluster survives, written to be signable by a smart bull
- Pre-registered signals — dated confirm/kill criteria, including the conditions under which our own timing is wrong; the quantitative components auto-evaluate hourly into the explorer's credit chip and the banner's credit dial
- SpaceX / orbital-compute adjacency — Phase-0 extension card on the record $75B SPCX IPO (listed 2026-06-12, >$2T day one), exhibit-verified against the 424B4 prospectus: the Anthropic compute "backlog" (
$45B gross at $1.25B/mo) is **$5–6B firm** once the filing's terminability ("after the initial three-month period… 90 days' notice") is applied — an ~87% haircut. The directly-read filing also withdrew a press-reported Google $920M/mo contract (not in the prospectus; no such amendment exists) — a live example of filing tier overruling press tier. Pattern-extension evidence only; the cluster verdict does not move - Limits to arbitrage — why the mispricing can persist: the cleanly mispriced asset (private cluster credit) has no liquid public short, every available expression is degraded, and that access asymmetry predicts discontinuous rather than gradual convergence. Disclosure inside: the author holds no position in any named issuer.
Quantitative epistemics (the consensus-inference layer — recovering the market-implied model so the gap can be measured, not asserted):
- Expectations inversion — reverse-DCF per name: 78–96% of enterprise value (median) rests on re-contracting assets after the signed backlog runs off, priced against ~2–3yr GPU economic life. Sensitivity bands, carded assumptions.
- Verdict decomposition tree — the flat 0.67 (a structural measurement) separated from the datable, Brier-scoreable realization forecast (~0.39 live), decomposed into two pathways so the funding-window-closes-first route (fiber 1999) is first-class. Shadow mode until promoted.
- Base-rate book — outside-view priors from telecom/fiber 1998–2003 and shale 2014–16 (the ~6–8-quarter capex→default lag, the failure sequence, and where the analogies break — GPU life is the decisive break).
- Marginal-buyer constraints — what binds the end-holders (NAIC RBC, BDC leverage, annuity surrender, redemption gates); the finding: redemption gates are already binding in 2026 on AI-credit-exposed funds.
- Adversarial-review packets — each load-bearing claim with a one-command reproduction and a prompt written to attack it. The witness: the scored record →.
- Filing verification log — direct SEC EDGAR exhibit reads that upgrade claims from press to filing tier (or correct them). Already: CoreWeave's 67% Microsoft concentration confirmed verbatim; a press-reported SpaceX–Google contract withdrawn as absent from the prospectus; CoreWeave's DSCR covenant test postponed to Oct-2027 (Dec-31-2025 First Amendment).
- Information-edge map + completeness report — what an internet-connected agent can retrieve about the cluster's fragility vs. what is irreducibly private (FILING / SCUTTLEBUTT / DARK). The structure is fully legible; four numbers in four rooms (covenant headroom, contract cancellability, real occupancy, private-debt marks) are genuinely dark.
ROADMAP.md — the full public program plan to the pre-registered 2026-Q4 adjudication (~Dec 18), when the engine's registered predictions are scored: signal-integrity fixes (shipped), consensus-inference per name (shipped), base rates (shipped), verdict decomposition (shipped, shadow), utilization bottom-up (EDGAR-gated), external adversarial review, and the scored record.
What's in this repo: the full forensic engine (src/bubble/), the evidence-gated report it produced (data/published/), the interactive viz + its verified dataset (viz/), worker-verified handoff datasets (handoffs/), the market-facing layer (analysis/), and the test suite. The ~40GB raw EDGAR corpus is not committed — it is re-fetchable public SEC data via scripts/, and every claim in the report carries source provenance.
This system is deliberately engineered in the direction of a forensic analyst who assumes every optimistic projection is a potential liability until proven otherwise. Maximum skepticism. Maximum rigor.
Build a high-confidence forensic mapping and analysis system capable of answering Michael Burry-style questions about the AI / Data Center / Financing ecosystem with real numbers, timelines, and evidence.
The system must be able to determine:
- Whether we are in a bubble
- How large the bubble is (in capital, leverage, and physical overbuild)
- When it is likely to crack (with specific timelines)
- Where the biggest risks and contagion paths lie
- Who ultimately bears the downside risk
This is not a dashboard for enthusiasts. It is a skeptical, forensic tool designed to find the gap between the narrative and reality -- exactly the way Michael Burry would approach it. We prioritize truth over optimism, and we clearly distinguish between what is measured, what is estimated, and what is unknown.
- Every entity in the AI / data-center / financing ecosystem (hyperscalers, neoclouds, developers, financiers, power providers, SPVs, service/people/supply-chain layers, etc.). The 7,708 already in the field is a floor; the true total — with the un-filed private long tail — is an output of enumeration, not a target. (Earlier drafts wrote "750–900 entities"; that was a resource-constrained cap and is rejected — see the Operating Doctrine above and analysis/total_ecosystem_dive.md.)
- Every deal / loan / lien / contract ≥ a materiality floor (leases, debt facilities, PPAs, land, equipment financing, SPV structures) — all of them, not a sample.
- Full coverage of the ecosystem's capital structure, physical execution, and risk transfer mechanisms — broad AND deep, limited only by physics.
- SEC EDGAR at scale (10-K, 10-Q, 8-K, bond filings, SPV disclosures)
- Regulatory & Permit Data (FERC, state PUCs, EPA, local zoning, air permits)
- Project Trackers (Cleanview, FracTracker, GlobalData, DCD)
- Physical & Construction Data (satellite imagery, interconnection queues, transformer backlogs)
- Power & Energy Data (ISO queues, PPA filings, on-site generation permits)
- Ownership & Relationship Data (corporate filings, state LLC records, guarantee disclosures)
- Rich modeling of:
- Entities (companies, projects, SPVs)
- Deals (leases, debt, PPAs, land, financing)
- Relationships (ownership, guarantees, collateral, debt waterfalls, SPV layering)
- Physical assets (data centers, power infrastructure)
- Ability to trace off-balance-sheet risk and contagion paths
- High-quality, structured extraction from documents
- Entity resolution and relationship detection
- Confidence scoring + validation + retry logic
- Materiality-ranked LLM adjudication queue for low-confidence or high-impact items
- Red flag detection (aggressive assumptions, related-party risk, timeline slippage, incentive misalignment, circular financing)
- Multi-scenario stress testing (Base / Adverse / Severe / Tail)
- Physical constraint modeling (power availability vs announced capacity, permit status vs construction timeline)
- Utilization vs debt service mismatch analysis
- Concentration and contagion mapping
- Final Burry Report that includes:
- Clear yes/no on whether this is a bubble + confidence level
- Timeline for cracks and peak stress (with specific quarters/years)
- Quantified ecosystem metrics (total leverage, off-balance-sheet exposure, power risk %, refinancing walls, etc.)
- Top risks with supporting evidence
- Clear distinction between measured data, estimates, and unknowns
- Actionable insights for a skeptical analyst
The system is successful when it can:
- Produce a Burry-grade report that a professional investor would take seriously.
- Show real evidence behind the key numbers.
- Clearly state the confidence level on every major claim.
- Highlight the biggest gaps and uncertainties that still remain.
- Be continuously updatable as new filings, permits, and project data become available.
- Extreme skepticism toward optimistic narratives
- Focus on cash flow reality, physical constraints, and who holds the risk
- Clear separation between what the data actually shows and what is being assumed
- Willingness to call out overbuilding, leverage, and misaligned incentives
The end state is a system that can answer questions like:
- "How much real leverage exists when you include all the SPVs and insurance wrappers?"
- "What percentage of announced capacity is actually deliverable given power and permitting constraints?"
- "When do the major refinancing walls hit, and which players are most exposed?"
- "At realistic utilization levels, when does debt service exceed contracted revenue for the most leveraged players?"
- "Who ultimately bears the losses if this unwinds?"
- Provenance on everything — source, date, model, prompt hash, confidence, adjudication status.
- Graph as the living model — Neo4j + GDS for debt waterfalls, contagion paths, concentration, off-balance-sheet exposure.
- LLM + deterministic hybrid — edgartools + XBRL for numbers; LLM only for narrative, normalization, hidden entities. Multi-verifier on anything high-stakes.
- Adjudication gates first-class — low-confidence or red-flag extractions go to a materiality-ranked LLM adjudication queue with full evidence context and override capability.
- Physical ↔ Financial reality check — announced MW vs permits vs satellite vs equipment lead times.
- Completeness, materiality-ordered — the target is 100% coverage of the ecosystem; materiality determines only the order of work (highest-impact first, so partial results are useful early), never the scope. "Top 50-100 first" is a sequencing choice, not a stopping point — everything gets done. (Per the Operating Doctrine, triaging scope for cost/effort is never acceptable here.)
- Replayable & auditable — same input + same prompts + same models = reproducible conclusions (modulo documented non-determinism).
- Ethical by design — polite scraping, rate limits, proper FOIA channels, no credential stuffing.
- Python 3.12+ +
uv(fast, reproducible) - Pydantic v2 — all domain models are the source of truth
- edgartools — best-in-class SEC EDGAR access (structured XBRL + filings)
- Docling — high-fidelity PDF/table/layout parsing (2026 standard)
- instructor + Claude 4 / Grok / o3 — schema-enforced structured extraction + multi-verifier
- LangGraph — state machines for complex document reasoning + LLM adjudication checkpoints
- Neo4j 5 + APOC + GDS — graph + graph algorithms (centrality, shortest path for contagion, community)
- Prefect 3 — long-running, scheduled, observable orchestration
- Streamlit — rapid forensic UI (adjudication queue, scenario simulator, graph explorer, Burry reports)
- MinIO + Postgres — raw artifacts + operational metadata/queues/audit
Full details in the approved implementation plan (~/.grok/sessions/.../plan.md).
# 1. Clone / enter the repo
cd ai-bubble
# 2. Install uv (if not present) + Python deps
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
just install # or: uv sync --all-extras
# 3. Copy env and fill keys (at minimum ANTHROPIC_API_KEY for real extraction)
cp .env.example .env
# edit .env — add your LLM keys
# 4. Start the full stack (Neo4j + MinIO + Postgres)
just up
# Wait ~30-60s for healthchecks
# 5. Bootstrap schema only; production graph data comes from acquired sources
just bootstrap-neo4j
# 6. Build/acquire real source catalogs, then ingest a real public filing
just source-catalog --all-public --resolve-dynamic-public-sources
just ingest-msft
# 7. Generate first Burry-style report (red flags, assumptions, stress scenarios)
just burry-report MSFT
# 8. Launch the forensic dashboard + adjudication queue
just ui
# Open http://localhost:8501One-command demo (after keys in .env):
just demoSee the detailed structure in the implementation plan. Key directories:
src/bubble/models/— Pydantic heart (Entity, Deal, Risk, Assumption, Provenance, etc.)src/bubble/ingestion/edgar/— edgartools + LangGraph extraction pipelinesrc/bubble/graph/— Neo4j client with provenance-aware writessrc/bubble/analysis/— red flags, physical deliverability risk, scenario runner, stress tester, contagionsrc/bubble/ui/streamlit_app.py— the Burry analyst cockpitscripts/— acquisition, extraction, report generation, and operational wrappers.tests/— real cached filings via vcrpy (never hits network in CI)
Can the system, with minimal operator guidance, surface the same class of concerns a forensic analyst would on a fresh 10-K or 8-K?
- Off-balance-sheet leverage via SPVs and guarantees
- Optimistic utilization / depreciation / power cost assumptions
- Concentration risk and single-tenant exposure
- Timeline slippage between announced capacity and visible permits/construction
- Incentive misalignment in the financing stack
- Physical constraint gaps (power, transformers, turbines)
If it cannot do this on real data, the system is not yet successful.
The system now has explicit source-backed records for grid queues, permits, long-lead equipment, and observed construction progress. PhysicalRiskEngine turns those records into a component-level score:
interconnectionrisk: firm power, queue status, and delay monthspermitsrisk: air/power/zoning status, contested or denied permits, missing generation permitsequipmentrisk: transformers, turbines, switchgear, cooling, delivery status, and lead timesconstructionrisk: observed progress versus announced in-service dates
The resulting PhysicalRiskAssessment carries provenance, evidence tier, source counts, blocking issues, expected delay months, and a high-confidence eligibility flag. This is the foundation for replacing narrative physical-risk claims with auditable project-level evidence.
Physical capacity summaries also isolate active interconnection queue records explicitly tied to data-center, hyperscale, AI, or compute-campus load. The report separates direct data-center load requests from generation projects justified by data-center load growth, preserving queue ID, source URI, content hash, in-service date, customer, POI, and a short source excerpt for the top rows.
match_data_center_queues.py links those queue rows back to tracker-backed campus records when name, customer, county/state, and capacity evidence are strong enough. It writes a full pending-adjudication match audit to data/physical/queue_project_matches.csv and writes strong project-linked rows to data/physical/queues.csv for physical-risk scoring. Unmatched direct data-center load rows and explicitly data-center-driven supporting generation rows can also produce pending-adjudication data/physical/queue_projects.csv rows, with the official queue record as provenance, so real queue evidence is not discarded while waiting for tracker corroboration.
match_physical_records.py applies the same conservative source-linking pattern to EPA ICIS-Air permit rows and EPA/EIA generator records. It writes pending-adjudication audits to data/physical/permit_project_matches.csv and data/physical/equipment_project_matches.csv, then writes only strong project-linked rows to data/physical/permits.csv and data/physical/equipment.csv.
physical_risk_summary.py runs the project-level scoring path in parallel and writes data/reports/physical_risk_summary.json, including counts for assets with queue, permit, equipment, and observation evidence; source-backed queue capacity linked to projects; top blockers; and top risk projects.
Physical evidence can be loaded from a directory of CSVs:
just physical-evidence data/physical --as-of 2026-12-31Expected files:
projects.csvwith one row per campus or physical assetqueue_projects.csvfor optional queue-derived direct-load and supporting-generation rowsqueues.csvfor grid/interconnection recordspermits.csvfor air, power, zoning, and construction permitsequipment.csvfor transformers, turbines, switchgear, cooling, and other long-lead equipmentobservations.csvfor satellite or site-observed construction progress
Every row must include source_uri; optional source_type, source_confidence, human_review_status adjudication status, page_or_section, and content_hash fields feed the evidence gate.
The GPU depreciation, TAM sanity-check, capex payback, depreciation-to-EPS, and chip-supply modules are documented in docs/compute_economics_backlog.md. Public analyst threads and social posts are research leads only; production metrics must be re-sourced from filings, market snapshots, rental-rate data, or other auditable artifacts.
The implemented source-backed loader reads optional CSVs from data/compute/:
compute_assets.csvgpu_price_observations.csvdepreciation_policies.csvtam_claims.csvcapex_payback_cases.csveps_depreciation_impacts.csvchip_supply_observations.csv
Every compute row must include source_uri, retrieved_at, and content_hash; optional source_type, source_confidence, human_review_status adjudication status, and page_or_section fields feed the evidence gate. If no compute evidence is loaded, the final report keeps the compute-economics conclusion blocked rather than filling the gap with assumptions.
Acquired EDGAR documents can be mined for conservative compute economics rows:
just compute-economics --inventory data/edgar_acquisition/edgar_document_inventory.csv --output-dir data/compute --workers 32Public GPU rental pricing snapshots can be acquired as source-backed market observations:
just gpu-pricing --output-dir data/compute --workers 8 --other-domain-concurrency 4 --other-requests-per-second 8This writes raw HTML artifacts under data/compute/raw_gpu_pricing/,
gpu_price_source_artifacts.csv, normalized gpu_price_observations.csv, and
gpu_pricing_acquisition.summary.json. It uses bounded workers, per-domain
throttling, retries, and resume mode so repeated runs parse existing raw
artifacts without refetching unless --no-resume is set.
The current deterministic extractor writes depreciation policy rows and chip/supply commitment observations only when the source filing explicitly states the fact. It does not infer GPU prices, utilization, payback, or EPS impact. The worker count only controls local parsing of already acquired documents; it does not increase SEC request rates.
Before claiming source coverage, build an auditable backlog of SEC filings and exhibits to parse:
export EDGAR_IDENTITY="Your Name your.email@example.com"
just edgar-manifest --all-public --since 2024-01-01 --max-filings-per-cik 120 --include-exhibits --max-workers 32 --sec-domain-concurrency 8 --sec-requests-per-second 8This writes a timestamped manifest under data/manifests/ with one row per filing candidate, including:
- normalized CIK, accession number, filing/report dates, form, item numbers, and primary document URL
- optional EX-2, EX-4, EX-10, and EX-99 document URLs from SEC archive filing indexes when
--include-exhibitsis used - SEC submissions source URI and content hash provenance
- Burry relevance score for 10-K, 10-Q, 8-K material agreements/debt items, S-1/S-3/424B financing filings, and keyword hits such as credit agreements, guarantees, leases, PPAs, project finance, data centers, AI infrastructure, and SPVs
The manifest is an acquisition backlog, not extracted evidence. It tells the system which filings should be parsed next and quantifies source coverage gaps before any ecosystem-scale claim can be upgraded.
Download the prioritized source documents and emit pending-adjudication deal candidates:
export EDGAR_IDENTITY="Your Name your.email@example.com"
just edgar-acquire data/manifests/edgar_filing_manifest_YYYYMMDD-HHMMSS.csv --output-dir data/edgar_acquisition --max-workers 32 --sec-domain-concurrency 8 --sec-requests-per-second 8This stores raw EDGAR documents under data/edgar_acquisition/documents/, writes edgar_document_inventory.csv with source URI, retrieval timestamp, accession/document id, byte count, and content hash, and writes a capital-loader-compatible deals.csv with extracted pending-adjudication rows. The output directory can be passed directly to:
The EDGAR commands use a global worker pool for local parsing/resume throughput while the per-domain limiter keeps sec.gov requests bounded. Increase --max-workers for local CPU-heavy parsing, but keep --sec-domain-concurrency and --sec-requests-per-second at or below the SEC fair-access lane.
Delta EDGAR acquisitions merge into the existing inventory and deal CSVs by
default so a small daily manifest does not replace the larger acquired corpus.
Use --overwrite only for an intentional full rebuild.
just capital-evidence data/edgar_acquisitionEDGAR candidate extraction uses context-supported deal notional, not the largest dollar number in a filing. Corporate scale metrics such as AUM, remaining performance obligations, generic investment commitments, fundraising commitments, and outstanding balance-sheet totals are rejected as deal notional unless later adjudication overrides them.
Production source data is guarded by an invariant: source rows and deal nodes must be backed by an actual source URI and cannot use inferred provenance.
Coverage can be measured at any point:
just source-coverage --data-dir dataThe coverage report counts filings, entities, raw source documents, projects, queue records, permits, ownership records, tracker records, PPAs, lease agreements, extracted deals, and source-backed deals.
The curated public CIK watchlist is only the first layer. Build a source-backed entity universe from acquired PPAs, EDGAR deal candidates, tracker projects, permits, and equipment rows, then map public-company names to SEC CIKs using the SEC company ticker reference:
export EDGAR_IDENTITY="Your Name your.email@example.com"
just entity-universe --data-dir data --output-dir data/entity_universeThis writes entity_mentions.csv, entities.csv, and expanded_edgar_ciks.csv. Rows preserve source URI, retrieval timestamp, content hash, document id, and record index so expanded CIKs remain traceable to real corpus evidence. Use the expanded CIK CSV as the next input for larger EDGAR manifest runs after adjudicating the highest-impact matches.
just edgar-manifest --all-public --cik-csv data/entity_universe/expanded_edgar_ciks.csv --since 2024-01-01 --include-exhibits --max-workers 32When a broad primary-document manifest already exists, build a focused exhibit-only follow-on without refetching SEC submissions JSON:
just edgar-exhibit-manifest data/manifests/edgar_filing_manifest_YYYYMMDD-HHMMSS.csv --min-parent-relevance-score 120 --exhibit-index-workers 64 --sec-domain-concurrency 8 --sec-requests-per-second 8This reads the existing manifest, fetches SEC archive directory indexes only for
selected high-signal parent filings, and writes
data/manifests/edgar_exhibit_manifest_YYYYMMDD-HHMMSS.csv. Use the output with
just edgar-acquire to download EX-10, EX-4, EX-2, and EX-99 contract-level
documents into the same source-backed EDGAR acquisition corpus. The EDGAR
acquirer writes both deals.csv and tranches.csv when source text supports
tranche-level debt/bond terms, and enriches deal rows with collateral snippets,
guarantors, guarantee-scope snippets, SPV/non-recourse flags, source URI,
content hash, accession context, pending-adjudication status, and deterministic
notional scope tags (notional_context_kind, notional_commitment_scope,
notional_non_specific_obligation). A single
debt/security document can emit multiple tranches.csv rows when explicit
source text names separate term loan, revolver, or note-series amounts;
otherwise the extractor falls back to one primary tranche candidate. Tranche
rows also preserve guarantee_description when source prose supports guarantee
scope beyond a simple as Guarantor role label.
For non-EDGAR sources, use a real source catalog:
just source-catalog --output data/source_catalogs/source_catalog.csv
just source-acquire data/source_catalogs/source_catalog.csv --output-dir data/source_acquisitionMinimum catalog columns are source_id, corpus, and source_uri. Supported corpus values include filings, source_documents, projects, queue_records, permit_records, equipment_records, construction_observations, ppas, lease_agreements, ownership_records, tracker_records, and extracted_deals. Optional columns include source_type, parser (auto, csv, json, xml, zip, xlsx, or text), document_id, entity_id, project_id, filing_accession, and meta_* columns. Acquisition writes raw artifacts, source_artifact_inventory.csv, and normalized source_rows/<corpus>.csv files with retrieval timestamp, source URI, content hash, local path, and record index.
just source-catalog writes SEC submissions targets from a vetted EDGAR watchlist, includes public CAISO, NYISO, MISO, PJM, and SPP interconnection queue targets, includes EPA eGRID plant/unit/generator data, includes EPA ICIS-Air facility/program permit records, includes the Server Country data-center project tracker, and can append validated curated catalogs for ISO queues, permits, PPAs, leases, ownership records, tracker rows, and extracted deal feeds:
just source-catalog --curated-catalog data/curated/iso_queues.csv --curated-catalog data/curated/permits.csvLive public source listings can be resolved into concrete artifact URLs at catalog-build time. For example, ERCOT's GIS report listing is resolved to the latest primary GIS workbook, ISO-NE's public queue page is resolved to the current Excel export, EIA's 860M page is resolved to the latest downloadable generator inventory workbook, FERC's Market-Based Rate Entities to PPAs data set is resolved into paged API acquisition rows, FracTracker's ArcGIS data-center tracker is resolved into paged feature-layer queries, and GLEIF's Level 2 relationship API is resolved to the latest who owns whom relationship-record CDF archive. The source-list URL, document id or release date, data-set timestamp, API page boundaries, and workbook/archive filename are retained in metadata:
just source-catalog --resolve-dynamic-public-sources --output data/source_catalogs/source_catalog.csvOr even simpler for day-to-day:
just update-catalog(This is the recommended "easy one command" to bring the acquisition targets current.)
Then typically:
just source-acquire data/source_catalogs/source_catalog.csvCoverage reporting separates queued catalog targets from acquired artifacts, so the report can say how many filings, entities, projects, source-backed deals, and source-backed contract tranches are actually covered while also showing how much acquisition work is waiting. Derived graph node/edge outputs are reported through graph summaries and are not folded back into raw source coverage counts.
Acquisition is parallel by default. source-acquire uses a bounded worker pool (--max-workers, default 64), per-domain concurrency gates, retries with exponential backoff, and resume mode so existing raw artifacts are parsed without redownloading. SEC-hosted URLs require EDGAR_IDENTITY and are capped below the SEC's published 10 requests/second fair-access limit by default (--sec-requests-per-second 8; see SEC Developer Resources: https://www.sec.gov/about/developer-resources).
Acquisition summary JSONs persist the actual worker count, SEC/non-SEC request-rate settings, per-domain concurrency settings, retry count, and resume status used for each run.
Long EDGAR exhibit-manifest and document-acquisition runs accept --progress-interval N to emit machine-readable progress events every N completed parent indexes or documents.
The current operational corpus snapshot is tracked in docs/acquisition_status.md.
Update that file after material acquisition, extraction, adjudication-queue, timing, or
report refreshes so the docs stay tied to measured source coverage instead of
stale ambition.
source-invariants audits production CSV outputs for blocked seed/demo/mock/placeholder source URIs and missing direct-acquisition provenance. It writes data/reports/source_invariant_audit.json; use --fail-on-violation in CI or before publishing a report.
Local extraction is parallel by default where rows can be normalized independently. ppa-deals and tracker-projects both accept --max-workers (default 32), preserve source-row order in their outputs, and report the worker count used in their summaries.
Entity-level Burry reports now include capital_structure metrics computed from extracted Deal records:
- debt-like notional exposure
- off-balance-sheet, SPV-linked, non-recourse, or guarantee-linked exposure
- separate guarantee-linked and SPV/non-recourse exposure subtotals
- separate LLM-adjudicated exposure from pending-adjudication candidate exposure
- a high-notional adjudication queue for unapproved deterministic extraction candidates
- distinct candidate exposure after economic deduplication, plus duplicate candidate groups
- aggregate-obligation exposure separated from individual deal records
- quarterly refinancing walls
- near-term refinancing exposure
- top counterparty concentration
- downside bearers by role, including lenders, lessors, insurers, guarantors, noteholders, and bondholders
These metrics are evidence-gated from deal and tranche provenance. If extracted deals are sparse or unsupported, the report will say so instead of converting the gap into a false high-confidence conclusion.
The final report applies a deterministic AI/data-center ecosystem scope gate
before headline capital and debt-service metrics are calculated. It keeps the
raw acquired corpus intact, but excludes debt and lease rows from headline math
unless the row is tied to a core AI/data-center operator or contains explicit
AI/data-center deal evidence. Broader hyperscaler, utility, supplier, and
financier rows remain visible as balance-sheet context, but do not drive the
headline wall without direct evidence. The report writes capital_scope so
excluded rows, context rows, excluded debt-like notional, and inclusion reasons
remain auditable.
Debt-service output also separates raw extracted obligations from distinct candidate economic obligations. Distinct rollups collapse repeated SEC rows from the same accession/entity/notional group, preserve the duplicate candidate groups, and keep raw obligations visible for audit before any refinancing-wall or crack-timing conclusion is treated as high confidence.
Capital evidence can be loaded from a directory of CSVs:
just capital-evidence data/capital --as-of 2026-12-31 --near-term-end 2029-12-31The extracted deal rows can also be compiled into a source-backed capital
exposure graph. This writes entity nodes, counterparty edges, and a summary of
top obligors, risk bearers, exposure edges, connected components, unmapped
high-notional deals, skipped generic counterparties, source URIs, and
adjudication status. The summary also separates direct AI/data-center keyword edges and
watchlist-entity edges from the broader acquired capital network, so unrelated
corporate financing does not silently become an AI-infrastructure conclusion.
It also writes capital_contract_nodes.csv and capital_contract_edges.csv,
which preserve source-backed deal, tranche, collateral, guarantor, project,
asset, non-recourse, and bankruptcy-remote/SPV structure for deeper contagion
mapping:
just capital-exposure-graph --data-dir data --output-dir data/graphThe contract-structure graph can then be joined to the source-backed ownership
graph to create a conservative contract/ownership contagion path artifact. The
join is currently exact legal-name matching only; unmatched high-impact
guarantee and collateral paths are still retained as contract-only adjudication items.
Non-specific aggregate/shelf disclosure rows are flagged in the contract graph
and excluded from high-impact contagion path generation until they are tied to
specific committed obligations and counterparties.
Outputs are written to data/reports/contract_contagion_paths.csv and
data/reports/contract_contagion_summary.json, with SEC/GLEIF source URIs,
content hashes, adjudication statuses, notional exposure, ownership path depth, and
risk flags preserved on each row:
just contract-contagion-paths --data-dir data --output-dir data/reports --max-paths 50000The default path cap is 50,000 so broad contract graphs are not silently clipped at the first 10,000 source-backed paths.
Acquired ownership records can also be compiled into a source-backed legal entity ownership and consolidation graph. The graph currently targets GLEIF relationship records and preserves source URI, retrieval timestamp, content hash, local raw artifact path, source record index, document id, relationship status, relationship type, validation source, and quantifier fields. It writes nodes, edges, and a rollup summary used by the final evidence-gated report:
just ownership-graph --data-dir data --output-dir data/graphWeak-link scoring combines the capital exposure graph with source-backed physical execution risk to create a ranked triage list for the report:
just weak-links --data-dir data --output-dir data/reportsThe report-level adjudication queue combines the highest-impact pending items across
capital extraction, contract-tranche extraction, weak-link scoring, physical
match audits, contract/ownership contagion paths, and compute economics rows.
It writes data/reports/review_queue.csv and
data/reports/review_queue_summary.json; every item keeps source URI, content
hash, page or section, source confidence, adjudication status, ecosystem
relevance tags, and a legacy review-group id for duplicate-aware triage. The summary
separates raw pending capital notional from adjudication-grouped notional, separately
tracks pending contract-tranche adjudication notional, separately tracks pending
contagion-path exposure, and breaks out the
AI-infrastructure-relevant subset so broad corporate financing does not silently
dominate the Burry worklist:
All review/adjudication statuses are cleared by automated LLM adjudication.
Legacy columns named human_review_status are treated as adjudication-status
fields; there is no required operator gate in this workflow.
just review-queue --data-dir data --output-dir data/reportsThe broad queue can then be collapsed into a materiality-first LLM adjudication
packet set. This deduplicates repeated review groups, ranks the top blockers by
priority, exposure, AI/data-center relevance, and risk score, attaches local
source snippets where the acquired artifact is available, and emits explicit
decision questions/fields for automated adjudication. It writes
data/reports/materiality_adjudication_packets.csv and
data/reports/materiality_adjudication_summary.json:
just materiality-adjudication --data-dir data --output-dir data/reports --limit 250 --snippets-per-packet 3The packet set can then be adjudicated into conservative decision rows. This
does not convert unresolved rows into final metrics; it separates
source-supported blockers from rows that still need deeper extraction, quote
retrieval, duplicate/aggregate splitting, counterparty-role extraction,
collateral/guarantee scoping, or explicit rate/maturity evidence. It writes
data/reports/materiality_adjudication_decisions.csv and
data/reports/materiality_adjudication_decision_summary.json:
just materiality-adjudication-decisions --data-dir data --output-dir data/reportsThe crack-window timing layer then combines source-backed capital and tranche
maturities, physical COD/queue dates, EPS depreciation timing, and chip delivery
windows into a quarter stress calendar. It writes
data/reports/timing_signals.csv, data/reports/timing_signal_quarters.csv,
and data/reports/timing_signal_summary.json; every signal requires source URI
and content hash provenance and is treated as a candidate timing indicator until
LLM-adjudicated:
just timing-signals --data-dir data --output-dir data/reportsExpected files:
deals.csvwith one row per lease, debt facility, bond, PPA, guarantee, or other contracttranches.csvwith optional tranche-level debt/bond terms linked bydeal_id; a source document may contribute multiple tranche rows when separate facility/series terms are explicit;guarantee_descriptioncarries source-backed guarantee-scope context when extracted
Every row must include source_uri; optional source_type, source_confidence, human_review_status adjudication status, page_or_section, and content_hash fields feed the evidence gate. Use counterparty_roles and key_terms as JSON objects so roles, guarantees, SPVs, and lease classification flags remain structured.
For EDGAR rows, key_terms should preserve notional scope fields so aggregate
snapshots, shelf capacity disclosures, and specific committed transactions can
be separated deterministically during adjudication.
Acquired FERC PPA rows can be normalized into data/capital/deals.csv without inferring dollar exposure:
just ppa-deals --input data/source_acquisition/source_rows/ppas.csv --output data/capital/deals.csvAcquired project tracker rows can be normalized into source-backed physical project evidence:
just tracker-projects --input data/source_acquisition/source_rows/tracker_records.csv --output data/physical/projects.csvThis preserves the tracker source URI, raw artifact hash, local raw artifact path, and record index while carrying reported capacity ranges, investment, status, location, owner/operator/tenant fields, and source confidence into projects.csv.
This is an evidence-gated prototype, not a completed Burry-grade system. Current ecosystem-scale reports are treated as directional hypotheses until the evidence gate can prove the key claims with measured, corroborated, and LLM-adjudicated sources.
The report generator now caps confidence when a major claim is inferred or unsupported. This is intentional: the system should be skeptical of its own outputs before it is skeptical of the market narrative.
This is designed to become a live, continuously evolving forensic instrument (daily delta EDGAR, weekly deep re-validation, event-driven scenario re-runs).
See justfile for the full command surface and the implementation plan for the complete roadmap (Phases 0–6).
Open research — every claim carries provenance, everything publishes by default (including mistakes and corrections; that's part of the discipline). All scraping respects robots.txt and published rate limits. FOIA and regulatory requests must go through proper legal channels. No unauthorized access to private data rooms or systems. Nothing here is investment advice.
Built with extreme prejudice toward hidden risk and optimistic narrative.

