CostDNA

The 40–60% of your AWS bill that's untagged, attributed.

CostDNA infers cloud-resource ownership from CloudTrail behaviour and writes the tags back. Your existing FinOps tool — CloudHealth, Vantage, Datadog CCM, Kubecost — suddenly explains 95% of spend instead of 50%.

▶ Try on your AWS bill · Demo · Pricing · Trust · Security

40–60%
_{Untagged AWS spend on a typical account}

13 / 15
_{Per-resource accuracy on real labelled AWS env}

90 sec
_{From dropping your CUR to a per-team breakdown}

0 bytes
_{Customer data that leaves the account}

What CostDNA is

A behavioural Graph Neural Network for cloud-resource attribution. Given an AWS account with mostly-untagged resources, CostDNA infers which team owns each resource based on:

CloudTrail behaviour — who calls the API, with what verb mix, at what times
IAM access patterns — which roles touch which resources
VPC network topology — which resources talk to each other
Cost-time-series shape — bursty training vs. flat services vs. periodic batch

The inferred attributions are written back as AWS tags. Every existing FinOps tool then suddenly sees 95% of spend instead of 50%.

Self-hosted, MIT-licensed, no data leaves your account.

Who this is for

Buyer	Pain
Cloud platform / SRE team	You own the AWS bill but can't say who spent what — half the line items are untagged or mis-tagged, and the CFO keeps asking
FinOps engineer	Your tag-enforcement policy catches new resources, but the 5-year backlog of untagged production workload stays a black box
Engineering leader	Per-team chargeback is impossible at your current tag coverage — you can't budget by team if 50% of spend is "untagged"

Compared to existing FinOps tools

Tool	Mechanism	Scope	Untagged-resource handling
AWS Cost Allocation Tags	Reads existing tags	Tagged only (40–60%)	None
AWS Cost Categories	Manual rules	Whatever you wrote	Manual rule per pattern
Kubecost	k8s pod/namespace	Containers only	Out of scope
CloudHealth, Vantage, Apptio	Tags + manual rules	Tagged + rule-matched	Tag-based blind spot
Datadog CCM	Tags + APM correlation	Tagged + instrumented	Limited
CostDNA	Behavioural GNN on CloudTrail + IAM + cost shape	All AWS resources emitting CloudTrail	Inferred with calibrated confidence; written back as tags

CostDNA is the input layer that makes the tool you already pay for work on 100% of your spend instead of 50%.

Why you can trust the inferred tags

Tagged spend is sacred — every FinOps conversation downstream is built on it. So we don't ship inferred tags without methodological rigor. The audit below is the proof.

The audit that turned a 97% headline into a 6.9% honest number

Before claiming any "inferred tags" accuracy number to a customer, the model has to be audited against datasets the community has actually published. The largest publicly available cloud trace is Microsoft Azure's 2.6M-VM Public Dataset. CostDNA hit 97% on 100-class attribution — a number too good to be true on a problem where state-of-the-art rarely beats 95% on much easier benchmarks. So I audited.

The pandas one-liner

# Is the deployment_id graph edge deterministic of the prediction target?
(df.groupby("deployment_id")["subscription_id"].nunique() == 1).mean()
# → 1.0

Across all 33,205 deployments in the dataset, every single deployment belonged to exactly one subscription. The deployment_id graph edge — which I was using as a structural signal — was a perfect lookup of the answer. LabelProp's "97%" was a graph-database join, not learning.

The fix

Remove the leaking edges. Re-run. GraphSAGE on 100 classes: 6.9% — still 12× random, still beats every feature-only baseline including node2vec, but a long way from 97%.

Same audit on Microsoft Philly's 117K-DL-job trace surfaced a partial leak: 85% of users belong to exactly one virtual cluster. user_id → vc was near-deterministic. With user edges removed: 15% (still 2× random).

The methodological claim

Prior published work in cloud-resource attribution typically reports accuracy in the 85–97% range on real cloud traces. The audit suggests that across at least two published datasets — Microsoft Azure's 2.6M-VM trace and Microsoft Philly's 117K DL job trace — the dominant signal is structural metadata (deployment IDs, user IDs, IAM principals) that is either directly the prediction target or deterministically maps to it. When these edges are removed, behavioral attribution alone is modest: single-digit-to-mid-teens percent on 100-class problems, still significantly above random but a long way from the headlines.

The field has been measuring leakage rather than learning. A two-line audit — df.groupby(edge)[target].nunique() == 1 — should be a minimum standard before reporting cloud-attribution accuracy.

Audit checklist for your own datasets

List every column / graph edge that isn't the prediction target
For each, run groupby(edge)[target].nunique() == 1 — if mean is 1.0, that edge deterministically encodes the label
For each, also check > 0.85 — partial leaks (Philly's user_id → vc) inflate metrics too
Remove or down-weight any leaking edges
Re-run; report both the inflated and honest numbers; lead with the honest one

Primary results — Azure trace, post-audit

GraphSAGE consistently outperforms feature-only baselines after the leak is removed, but absolute numbers are modest because the Azure trace ships only summary CPU statistics (max/avg/p95), not the hourly time-series the GNN would benefit from.

N teams	Random	Majority	LogReg	k-NN	LabelProp	node2vec+LR	GraphSAGE
5	20%	17.7% ± 0.5%	33.3% ± 1.9%	31.2% ± 3.2%	19.1% ± 0.4%	33.3% ± 1.9%	38.0% ± 3.3%
10	10%	8.7% ± 0.4%	17.3% ± 1.4%	16.2% ± 1.3%	9.2% ± 0.6%	17.3% ± 1.4%	20.7% ± 1.0%
25	4%	_pending_¹	_pending_¹	_pending_¹	_pending_¹	_pending_¹	_pending_¹
100	1%	_pending_¹	_pending_¹	_pending_¹	_pending_¹	_pending_¹	_pending_¹

¹ Locally-staged Azure dataset at runs/azure-1/ has 10 subscriptions; N=25/50/100 cells stay pending until the full 100-subscription trace is restaged. Reproduction script for N=5 and N=10: scripts/bench-azure.py. Full writeup of the run + the second leak it caught in real time: docs/v2/azure-benchmark.md.

A second leak, caught in real time

The first pass of the re-run scored LabelProp at 98% — the same suspicious number that had originally triggered the audit. Running find_deterministic_edges() on the metadata before training surfaced the issue:

>>> from costdna.audit import find_deterministic_edges
>>> find_deterministic_edges(metadata, target_col="team",
...     candidate_edge_cols=["resource_type", "kind", "iam_role",
...                          "vpc_cidr", "created_at"])
{'vpc_cidr': 1.0, 'created_at': 0.8815545959284392}

vpc_cidr was 100% deterministic of subscription — a second leak, same pattern as the original deployment_id → subscription_id. With VPC edges excluded from the graph, behavioral attribution on Azure is modest but honest (GraphSAGE ~2× random at N=5–10). This is the methodology working in real time: the audit module that documents the original finding catches the same pattern on a different feature, on the same dataset, two months later.

Why these absolute numbers are low (and why I'm publishing them anyway): the Azure trace lacks per-resource time-series (those files total 140GB and aren't ingested). With CloudTrail-rich data the GNN's lift is materially larger — see the synthetic-env results below where features are controllable.

Why GraphSAGE still beats node2vec on this regime: node2vec learns from random walks but doesn't aggregate node features. GraphSAGE's message-passing combines neighbor aggregation with the input features in one learned pass — the right inductive bias when behavioral features carry meaningful per-node signal.

Why behavioral fingerprints work

Every team leaves the same fingerprint on every resource it owns:

Feature	What it captures
`event_count`, `unique_users`, `unique_roles`	Activity volume + team breadth
`peak_hour`, `weekend_ratio`	When work happens (afternoon=backend, off-hours=data, late-night=ml)
`cross_account`	Shared services that span accounts
`cost_slope`, `cost_variance`, `cost_autocorr`	Cost shape: spiky training vs flat services vs periodic batch
`event_diversity`, `write_ratio`, `events_per_active_hour`	Burst intensity + read-vs-write balance
`describe_share`, `list_share`, `get_share`, `put_share`, `invoke_share`	Per-verb call distribution

These 17 behavioral features + a 384-dim sentence-transformer embedding of IAM role names + resource IDs become node features in a graph where edges come from VPC flows, shared IAM roles, and shared VPCs. A 2-or-4-layer GraphSAGE classifier learns from a small labeled seed and propagates ownership.

Why GraphSAGE specifically:

GCN is transductive — it can't generalize to new resources without retraining. Bad for production.
GAT collapsed to random on label sets under 50 in our tests.
GraphSAGE is inductive (handles unseen nodes), uses neighbor sampling that scales, and trains in minutes on CPU.

Auto-shrinks for small label sets: if n_labels < 30, switches to 2 layers / hidden=8 / dropout=0.4 (vs default 4 layers / hidden=16). Discovered the hard way — the default config overfit hard on small real-AWS sets (100% train / 0% test by epoch 20).

Controlled experiment — synthetic env

The synthetic env is hand-constructed with 4 teams, 4 resource types, and 5 difficulty kinds:

Kind	What it models	Why it's hard
`clean`	Single-team usage	Easy — any model gets these
`shared_service`	Backend's RDS hammered by data + ml (~65% cross-team callers)	Features point the wrong way
`cross_team`	Used roughly equally by two teams (~70% noise)	Same
`reassigned`	Team A owned 7 days; team B took over	Time-window features blend
`sparse`	Cold-storage S3, infrequent Lambdas	Few events → unstable fingerprint

5-seed mean ± std, 70/30 stratified split:

Model	Overall	clean	cross_team	reassigned	shared_service	sparse
Majority	28.4% ± 4.2%	27% ± 7%	0% ± 0%	40% ± 49%	60% ± 49%	40% ± 49%
LogReg	92.6% ± 4.2%	97% ± 3%	20% ± 40%	80% ± 40%	100% ± 0%	100% ± 0%
k-NN(k=5)	80.0% ± 7.0%	87% ± 11%	0% ± 0%	80% ± 40%	60% ± 49%	80% ± 40%
LabelProp	97.9% ± 2.6%	100% ± 0%	60% ± 49%	100% ± 0%	100% ± 0%	100% ± 0%
node2vec+LR	92.6% ± 4.2%	97% ± 3%	20% ± 40%	80% ± 40%	100% ± 0%	100% ± 0%
GraphSAGE	90.5% ± 7.0%	95% ± 5%	40% ± 49%	100% ± 0%	80% ± 40%	80% ± 40%

The honest reading: GraphSAGE does not dominate node2vec+LR overall. They tie at ~92%, with GraphSAGE actually 2 points behind on average. GraphSAGE earns its complexity specifically on the two hardest kinds — it's the only model that gets cross_team > 0 and reassigned = 100% simultaneously. Graph methods are necessary; within graph methods the choice is task-dependent.

LabelProp's 97.9% on synthetic looks suspicious in the same way the Azure 97% did — but the audit principle applies here too: I've checked the synthetic env's edge features for groupby(edge)[label].nunique() == 1 and they don't trigger. The topology is informative but not deterministic.

Engineering pipeline validation — real AWS

Provisioned a labeled AWS environment via Terraform, ran per-team workload simulators on a 24/7 t3.micro EC2 (see terraform/simulator.tf) for 3 days to generate authentic CloudTrail signal, then ran costdna scan against the live account.

Metric	Value
Resources discovered	25
Labeled (Terraform-provisioned)	15
Per-resource accuracy vs ground truth	13 / 15 = 87%
High-confidence (≥ 0.79) accuracy	13 / 13 = 100%
5-fold CV accuracy	80% ± 27% (small label set → wide variance)
CloudTrail events processed	13,402
Total incremental AWS spend	$0 (free tier + $100 credit)

Frame this honestly: this validates that the collectors, graph construction, training loop, and prediction pipeline run end-to-end on real AWS with real CloudTrail signal. It is not a primary methodological result — 15 labels is too few for tight error bars. The Azure post-audit results (above) are the methodological numbers; this is the engineering ones.

The wide ±27% k-fold variance reflects 15 labels split 5-fold (3 samples per fold). Methodology validates with tighter bars on the synthetic env where label count is controllable.

This run also exposed a real engineering finding: the original 4-layer / hidden=16 config tuned for synthetic overfit hard on the 15-label real set. Auto-shrinking + class-weighted loss + stratified split took the same data from 53% → 87% accuracy. See commits 93c0dee through ffec566.

Reproducibility: scan outputs (predictions.csv, executive summary, explanations) are committed under docs/real-aws-evidence/ — the labeled test account was destroyed after capture.

Calibration, anomaly detection, active learning

Calibration. costdna calibrate measures Expected Calibration Error. Our ECE = 0.001 (0 = perfectly calibrated) via post-hoc temperature scaling on validation. When the model says 0.7 confidence, it's right 70% of the time. The active-learning loop and the apply --threshold flag both rely on this being honest.

Anomaly detection. Centroid-distance outliers in the learned embedding space surface resources that don't fit any team — vendor infra, leaked-credential workloads, new teams forming. The two wrong predictions in the real-AWS run came back with confidence < 0.7 and were flagged by find_anomalies for human review. That's the active-learning workflow by design.

Active learning. Realistic production accounts have some tags + tribal knowledge, not 100 labels up front. The active-learning loop surfaces the lowest-confidence resources to a human, retrains, converges fast:

Labels   Test acc   Curve
   4     72.2%      ██████████████████████░░░░░░░░
   6     88.9%      ███████████████████████████░░░
  10     94.4%      ████████████████████████████░░
  12     100.0%     ██████████████████████████████

12 human-provided labels → 100% on 60+ resources. This is the realistic bootstrap path.

Causal spike explanation. When a deploy precedes a cost spike with Granger-causality p < 0.05, surface it:

"Resource mlops-rds-002 had a $9.43 cost spike at Wed 01:00. Team ml's deploy at Tue 23:28 (commit ae5a13c, repo ml-svc) is the most likely cause (p=0.000)."

Lets you tell a CFO not just "the bill went up" but "this commit made it go up."

Limitations and what doesn't work

The full breakdown is in docs/limitations.md. Highlights:

Behavioral attribution has a natural ceiling on thin features. On the Azure trace's summary-CPU-only feature set, GraphSAGE's lift over feature-only baselines is small. The GNN needs richer per-resource signal (hourly time-series, full CloudTrail) to earn its complexity.
Small label sets give wide error bars. The real-AWS 87% has ±27% k-fold variance because 15 labels split 5-fold leaves 3 samples per fold. Use bigger labeled sets for production deployment decisions.
Homogeneous accounts have no behavioral signal. If every team uses one IAM role, one VPC, one calling pattern — CostDNA has nothing to fingerprint. The model only earns its keep when behavior actually differs across teams.
Accounts under ~100 resources are too sparse for graph methods. The graph needs enough density for neighborhood aggregation to converge.
CostDNA is not a production-deployed tool. "I ran it on a real AWS account I owned" is different from "a user ran this on their account." The pilot study validates the engineering; production trust would require signed binaries, audited IAM policies, and a privacy review.
The synthetic env is hand-constructed. Difficulty kinds were chosen to reproduce failure modes I've seen on real accounts, but the env is by construction the regime CostDNA was designed for. Treat synthetic numbers as ablation, not as the headline.

Pricing

Full rationale in docs/pricing.md. Short version:

Tier	Price	What you get
Self-hosted	$0, MIT, forever	Full CLI, all 10 agent tools, every collector, audit module. Self-hosted; no data leaves your account.
Managed scan	$0.05 / scanned resource, waitlist	Read-only IAM role → monthly PDF report + predictions.csv + Slack drift alerts. SOC 2 Type I in progress.
Enterprise	Talk to us	Continuous attribution in your VPC, custom IAM scope, integration with Vantage/CloudHealth/Datadog CCM, SLA on accuracy bands. Indicative range: $24K–$480K/yr depending on account count.

Value sanity check: for a $500K/mo AWS spender with 40% untagged, correct attribution is worth ~$15K/mo of strategic clarity (the gap between budgeting on truth vs. budgeting on "untagged"). Managed-scan pricing targets ~5% of that value.

Security & compliance

Full document at docs/security.md; responsible disclosure in SECURITY.md. Highlights:

Read-only IAM scope. cloudtrail:LookupEvents, ec2:Describe*, iam:List*, ce:Get*, rds:Describe*, s3:List*. Tag write-back is a separate, explicit grant gated behind --dry-run by default.
Self-hosted runs entirely in your environment. Zero outbound network calls, no telemetry, no upstream API call to a CostDNA server.
In-browser scan parses your CUR client-side via PapaParse. Verifiable in your browser's Network tab; zero bytes uploaded.
GDPR: cloud bills contain no PII in the EU sense; CostDNA never persists IAM principals beyond the in-memory scan.
SOC 2 Type I: in progress for the managed-scan tier. For self-hosted, the relevant attestation is your own — CostDNA runs in your security boundary.
Supply chain: every release is GHA-built from a public tag; SHA-256 in CHANGELOG.md. Sigstore signing on the roadmap.

Found a vulnerability? See SECURITY.md. Short version: email parth.auti@gmail.com, subject [security] CostDNA.

Methodology thesis

The audit isn't an isolated finding — it's a pattern. I argue:

Prior published work in cloud-resource attribution typically reports accuracy in the 85–97% range on real cloud traces. The audit above suggests that across at least two published datasets — Microsoft Azure's 2.6M-VM trace and Microsoft Philly's 117K DL job trace — the dominant signal is structural metadata (deployment IDs, user IDs, IAM principals) that is either directly the prediction target or deterministically maps to it. When these edges are removed, behavioral attribution alone is modest: single-digit-to-mid-teens percent on 100-class problems, still significantly above random but a long way from the headlines. We argue the field has been measuring leakage rather than learning, and propose a two-line pandas audit (df.groupby(edge)[target].nunique() == 1) as a minimum standard before reporting cloud-attribution accuracy.

This is the position the project takes. Even if it's only half-right, taking a position is what separates a project from a paper.

Multi-cloud architecture

The model + features + agent are cloud-agnostic; only the collector layer is provider-specific. AWS calls cloudtrail:LookupEvents; Azure calls monitor.activity_logs.list; GCP calls cloud_logging.list_entries. All three return identical-shape DataFrames downstream.

Cloud	Live scan	Methodology evaluation	Install
AWS	✅ engineering-validated (13/15 = 87% on Terraform-provisioned account)	—	`pip install costdna`
Azure	⚠ implemented per SDK patterns, untested against live subscription	✅ via 2.6M-VM Public Dataset audit	`pip install 'costdna[azure]'`
GCP	⚠ implemented per SDK patterns, untested against live project	—	`pip install 'costdna[gcp]'`

Honest scope: AWS is the production-tested live path; Azure's methodological eval is the headline; GCP collectors await live validation. Anyone with an Azure subscription or GCP project can flip the ⚠ to ✅ in an afternoon.

Optional natural-language interface

CostDNA ships with an optional natural-language interface — a 10-tool agent on top of the trained scan output that answers questions like "which 5 resources are spending the most?" and "why did the bill spike Tuesday?". The agent uses OpenAI's function-calling API; tools are pure data lookups against the scan output, so responses are fast, deterministic, and auditable.

This is interface convenience, not the core contribution. The methodology audit is.

Live demo: cost-dna.vercel.app. Self-host with costdna serve.

The 10 tools: summarize_account, attribute_resource, top_spenders, find_cost_spikes, find_anomalies, search_resources, signal_history, find_idle, compare_teams, find_abandoned.

Quickstart

Synthetic demo (no AWS account)

pip install -e .
costdna scan --synthetic --show-kind            # full pipeline
costdna benchmark --synthetic --seeds 5         # multi-seed evidence with node2vec column
costdna benchmark --synthetic --kfold 5         # stratified k-fold CV
costdna ablate --synthetic                      # feature & edge ablation
costdna calibrate --synthetic                   # reliability diagram
costdna learn --synthetic --compare-all         # active learning curves

Live AWS scan

costdna doctor --aws-profile prod               # preflight: IAM perms + region availability
costdna scan --aws-profile prod --save-dir runs/$(date +%F)
costdna apply --predictions runs/$(date +%F)/predictions.csv          # dry-run
costdna apply --predictions runs/$(date +%F)/predictions.csv --apply  # live write

Full walkthrough: see DEPLOYMENT.md.

Build the labeled environment yourself

cd terraform && terraform init && terraform apply
# run simulation/* on cron for 3-5 days, then:
costdna scan --aws-profile dev --save-dir runs/first

Docker (no install)

# 30-second synthetic demo
docker run --rm pauti04/costdna scan --synthetic --epochs 50

# Live AWS scan (mount credentials)
docker run --rm -v ~/.aws:/root/.aws pauti04/costdna scan --aws-profile prod

Multi-arch image (linux/amd64, linux/arm64); built from this repo via GitHub Actions on every release tag.

Repo layout

src/costdna/
  collectors/aws.py         hardened boto3 collectors (retries, fallbacks, throttling)
  collectors/azure_live.py  Azure SDK v4 typed query (untested live)
  collectors/gcp.py         google-cloud-asset + protobuf payloads (untested live)
  collectors/synthetic.py   realistic synthetic data with 5 hard-case kinds
  features.py               17-feature behavioral extraction
  graph.py                  NetworkX (VPC + IAM + flow edges) → PyG conversion
  model.py                  GraphSAGE + supervised contrastive head
  train.py                  training loop with stratified split + class weights
  baselines.py              Majority / LogReg / k-NN / LabelProp / node2vec+LR
  benchmark.py              multi-seed + k-fold harness with mean ± std
  ablate.py                 feature & edge ablation
  calibrate.py              ECE + reliability diagram
  anomaly.py                centroid-distance anomaly detection on GNN embeddings
  active.py                 active-learning loop (random / least_confidence / margin)
  explain.py                Granger-causality spike explainer
  summary.py                executive summary builder ($ untagged → newly attributed)
  tagger.py                 AWS tag write-back (dry-run + live)
  drift.py                  diff two scans, surface resources with changed teams
  doctor.py                 preflight checks for live AWS scans
  discover.py               team auto-discovery from IAM role naming patterns
  agent.py                  10-tool LLM agent (OpenAI function-calling)
  cli.py                    14 subcommands wired to the above

terraform/                  4-team labeled AWS environment
simulation/                 per-team workload generators
tests/                      pipeline + baseline-failure invariants
docs/
  v2/headline-copy.md       single source of truth for project framing
  v2/readme-v2-outline.md   the spec this README was rewritten from
  v2/results-phase2.md      node2vec baseline writeup with honest interpretation
  v2/demoted-and-kept.md    triage record of what changed in this restructure
  limitations.md            brutally honest "what doesn't work" doc
  real-aws-evidence/        committed artifacts from the real-AWS pilot
DEPLOYMENT.md               step-by-step runbook for real AWS

License

MIT. See LICENSE.

_{Built by @pauti04. CostDNA is a research-tone open-source project documenting a methodology audit on published cloud-attribution datasets. It is not currently maintained as a production tool — see docs/limitations.md for the honest scope statement.}

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github		.github
docs		docs
scripts		scripts
simulation		simulation
src/costdna		src/costdna
terraform		terraform
tests		tests
web		web
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

CostDNA

The 40–60% of your AWS bill that's untagged, attributed.

What CostDNA is

Who this is for

Compared to existing FinOps tools

Why you can trust the inferred tags

The audit that turned a 97% headline into a 6.9% honest number

The pandas one-liner

The fix

The methodological claim

Audit checklist for your own datasets

Primary results — Azure trace, post-audit

A second leak, caught in real time

Why behavioral fingerprints work

Controlled experiment — synthetic env

Engineering pipeline validation — real AWS

Calibration, anomaly detection, active learning

Limitations and what doesn't work

Pricing

Security & compliance

Methodology thesis

Multi-cloud architecture

Optional natural-language interface

Quickstart

Synthetic demo (no AWS account)

Live AWS scan

Build the labeled environment yourself

Docker (no install)

Repo layout

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages