Skip to content

Calibration FPR data: top-1000 npm + top-500 PyPI corpus over 12 months #36

@Metbcy

Description

@Metbcy

Context

Several bomdrift defaults are calibration-tunable knobs: --typosquat-similarity-threshold (0.92), --young-maintainer-days (90), --recently-published-days (14), --multi-major-delta (2). The defaults were picked from intuition + small-corpus testing. To pick them rigorously, we need data: what's the false-positive rate across a representative corpus of historical npm / PyPI releases?

This is the data-collection prerequisite for a v1.0+ defaults review.

Scope

A reproducible offline pipeline that:

  • Picks a corpus: top ~1,000 npm + top ~500 PyPI packages (by download count) over the last 12 months of releases.
  • For each release: fetches the SBOM (or generates one via syft).
  • Runs bomdrift diff between consecutive releases (Nx → N+1x).
  • Captures --debug-calibration jsonl output to a file.
  • Aggregates: per-signal false-positive rate at default thresholds, plus 25/50/75th percentile distributions of the underlying scores.

Not a CI job — this is offline tooling, probably runs once per month and produces a static report.

The output is a markdown page: docs/src/calibration-report.md showing:

  • Corpus methodology (which packages, how selected, how many SBOMs total).
  • Per-signal FPR at current default thresholds.
  • Suggested default-value adjustments (with confidence intervals if feasible).
  • Reproduction instructions: anyone with a checkout + ~6h compute should be able to re-run.

Acceptance criteria

  • scripts/calibration/ directory with the data-collection scripts (any language; bash + jq + gh is fine).
  • docs/src/calibration-report.md with the first run's findings.
  • If the report suggests changing defaults, that's a separate follow-up issue/PR — this issue is data collection only.

Constraints

  • No telemetry from production. This is an offline corpus; users running bomdrift in their own CI don't send data anywhere.
  • Reproducible. Anyone should be able to re-run with cargo run + a corpus list. Pin the corpus to a snapshot date (e.g. data/calibration-corpus-2026-04.txt).
  • Honest about caveats. Top-N-by-downloads is biased toward established packages; FPR may underestimate noisy categories like newly-published-and-quickly-churned scopes.

A note on commit signing

main requires verified signatures (the repo ships cosign-signed releases — we hold our own commits to the same bar).

You usually don't need to set up signing as a contributor — when a maintainer merges via "Merge" or "Squash", GitHub auto-signs the resulting commit and your unsigned PR-branch commits are fine. The friendlier path for everyone.

If you'd like your individual commits to land verbatim on main (so your name shows up in git blame), set up local signing once and your PR can be rebase-merged:

git config --global gpg.format ssh
git config --global user.signingkey ~/.ssh/id_ed25519.pub
git config --global commit.gpgsign true

Then add the same SSH public key under GitHub → Settings → SSH and GPG keys → Signing keys.

See CONTRIBUTING.md → Commit signing on main for the full picture. Either way, please don't sweat it — if your PR is otherwise great, the maintainer will pick a merge mode that works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions