Skip to content

feat(extract/microsoft/bulletin): fill xlsx-missing CVE attributions#818

Merged
MaineK00n merged 2 commits into
nightlyfrom
MaineK00n/bulletin-archive-cve-additions
May 25, 2026
Merged

feat(extract/microsoft/bulletin): fill xlsx-missing CVE attributions#818
MaineK00n merged 2 commits into
nightlyfrom
MaineK00n/bulletin-archive-cve-additions

Conversation

@MaineK00n
Copy link
Copy Markdown
Owner

@MaineK00n MaineK00n commented May 22, 2026

Summary

Stacked on #816. Fills in CVE attributions that BulletinSearch.xlsx leaves off a bulletin's rows but the bulletin's archive markdown documents.

Symmetric to the existing bulletinArchiveCVECorrections (which fixes wrong CVE tokens already present in xlsx); this PR's bulletinArchiveCVEAdditions fills the gap where xlsx omitted CVEs entirely.

Why

Microsoft occasionally publishes a bulletin to BulletinSearch.xlsx without populating the per-row cves field — every row of the bulletin has an empty cves cell, despite the markdown documenting one or more CVE IDs. Without intervention, the bulletin ID itself leaks into the cveID field downstream (e.g. MS16-137 appearing as a scannedCves key on Win 7 — vuls-only data-shape artifact surfaced by the 26-host vuls-compare benchmark).

This is the xlsx-all-empty case, addressed by union-merging the bulletin's full markdown CVE set into every row at extract time. Per-(KB, CVE) NA precision continues to be enforced by bulletinArchiveKBNotApplicable afterward.

Scope

This PR covers only the all-empty-xlsx pattern (20 bulletins / 256 (bulletin, CVE) pairs).

A symmetric "partial xlsx" pattern (xlsx carries some CVEs but the markdown documents more) was audited as part of this work and intentionally not included — the audit found that nearly every "markdown extra" in those bulletins is a markdown-side typo (digit transposition, year transposition, URL-anchor missing leading zero, deprecated CVEs in version-history notes, etc.) rather than a real missing attribution. Adding them mechanically would introduce FPs. The handful of real partial-xlsx variants need entry-by-entry manual review and will live in a follow-up.

Commits

commit What Coverage
4042a279 Initial sweep: 15 all-empty bulletins discovered by xlsx audit. Adds bulletinArchiveCVEAdditions map + applyCVEAdditions extract preprocessor + unit test. 15 bulletins / 239 (bulletin, CVE) pairs
59e4d15d Follow-up sweep catches 5 more all-empty bulletins missed in the initial pass: MS02-019, MS02-038, MS06-007, MS08-059, MS16-105. +5 bulletins / +17 pairs
Total 20 bulletins / 256 (bulletin, CVE) pairs

How

  • New map bulletinArchiveCVEAdditions: bulletinID → []CVE storing the bulletin's full markdown CVE set.
  • New extract preprocessor applyCVEAdditions(rows) unions per-bulletin CVEs into matching rows. Idempotent against any tokens already present in row.CVEs.
  • Runs before applyComponentReattributions so synthesized rows also see the additions, and before the per-CVE NA filter loop so the NA filter takes the final word on per-row applicability.
  • Generator extension (gen_static_map.py, local-only): hand-curated bulletin list, CVE tokens harvested per-bulletin from the markdown body (markdown typos like CVE-2077-* or 5-digit suffix anomalies or 3-digit suffix anomalies are filtered out automatically).

Example: MS16-137

Configuration CVE-2016-7220 (Auth Info) CVE-2016-7237 (LSA DoS) CVE-2016-7238 (VSM EoP)
Win 10 RTM 1507 (KB3198585) Important Important Important
Vista / 2008 / 7 / 8.1 / Server 2012 / 2012 R2 / RT 8.1 / Win 10 1511 / 1607 / Server 2016 NA Important Important

Unions [7220, 7237, 7238] into every MS16-137 row. The KB-keyed NA filter then drops CVE-2016-7220 from the non-Win10-RTM rows (existing entries in bulletinArchiveKBNotApplicable already cover that), so the final attribution matches the markdown's per-(KB, CVE) table exactly.

Testing

  • New unit test TestApplyCVEAdditions covers 5 cases: pass-through, MS16-137 fill, idempotency, case-insensitive bulletin ID match, MS17-023 Flash Player.
  • The MS17-023 golden (the only one of the 20 bulletins currently in the fixture set) is regenerated to carry the 7 Flash Player CVE attributions.
  • GOEXPERIMENT=jsonv2 go test ./pkg/extract/microsoft/bulletin/... — passes
  • GOEXPERIMENT=jsonv2 go test ./... — passes

Test plan

  • Build with GOEXPERIMENT=jsonv2 go build ./...
  • Test with GOEXPERIMENT=jsonv2 go test ./...
  • Spot-check 2-3 entries (e.g. MS16-137 / CVE-2016-7220, MS17-023 / CVE-2017-2997) against the corresponding markdown

🤖 Generated with Claude Code

@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from 8ac9ee7 to 59e4d15 Compare May 22, 2026 07:07
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-row-split branch from 87d8c4e to 8f64578 Compare May 22, 2026 07:12
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from 59e4d15 to ca93a8a Compare May 22, 2026 07:14
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-row-split branch from 8f64578 to 1ba8b86 Compare May 22, 2026 07:31
@MaineK00n MaineK00n self-assigned this May 22, 2026
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from ca93a8a to f0cd9c1 Compare May 22, 2026 07:33
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-row-split branch from 1ba8b86 to be9d9a3 Compare May 22, 2026 08:36
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from f0cd9c1 to 65ded22 Compare May 22, 2026 08:36
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-row-split branch from be9d9a3 to 87ba8a7 Compare May 22, 2026 08:55
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from 65ded22 to 4b031a8 Compare May 22, 2026 08:55
Base automatically changed from MaineK00n/bulletin-archive-row-split to nightly May 22, 2026 09:04
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from 4b031a8 to 3c2703c Compare May 22, 2026 14:56
@MaineK00n MaineK00n marked this pull request as ready for review May 22, 2026 15:15
Copilot AI review requested due to automatic review settings May 22, 2026 15:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an extract-time “CVE fill” mechanism for Microsoft bulletins whose BulletinSearch.xlsx rows have an empty cves cell, by unioning in the authoritative CVE set from the bulletin archive markdown, preventing downstream leakage of bulletin IDs into CVE fields.

Changes:

  • Introduces bulletinArchiveCVEAdditions and applyCVEAdditions to union per-bulletin CVEs into each matching row prior to component reattribution / NA filtering.
  • Updates extraction pipeline ordering to apply CVE additions before applyComponentReattributions.
  • Adds unit coverage for CVE additions behavior and regenerates the MS17-023 golden fixture to include Flash Player CVEs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/extract/microsoft/bulletin/bulletin.go Adds applyCVEAdditions, wires it into extraction, and introduces the bulletinArchiveCVEAdditions static map.
pkg/extract/microsoft/bulletin/bulletin_test.go Adds TestApplyCVEAdditions for pass-through/add/idempotency/case-insensitive bulletin ID scenarios.
pkg/extract/microsoft/bulletin/export_test.go Exposes new map/function to external tests via _test exports.
pkg/extract/microsoft/bulletin/testdata/golden/data/17/MS17-023.json Updates golden data to include newly attributed CVE vulnerabilities for MS17-023.

Comment thread pkg/extract/microsoft/bulletin/bulletin.go
Comment thread pkg/extract/microsoft/bulletin/bulletin.go
Comment thread pkg/extract/microsoft/bulletin/bulletin.go
Comment thread pkg/extract/microsoft/bulletin/bulletin_test.go
MaineK00n and others added 2 commits May 23, 2026 00:24
…from archive markdown

Some Microsoft bulletins ship to BulletinSearch.xlsx with the cves
cell empty on every row, even though the bulletin's archive markdown
documents one or more CVE IDs. Without intervention, these rows reach
detection with no CVE attribution at all, causing the bulletin ID
itself to leak into the cveID field downstream (e.g. "MS16-137"
appearing as a scannedCves key on Win 7 — a vuls-only data-shape
artifact surfaced by the 26-host vuls-compare benchmark).

Adds:

- New static map `bulletinArchiveCVEAdditions: bulletinID → []CVE`.
  Lists every CVE the bulletin's markdown mentions; the per-(KB, CVE)
  applicability is then enforced by the existing
  `bulletinArchiveKBNotApplicable` filter after the union, so listing
  every CVE here is safe.

- New extract preprocessor `applyCVEAdditions(rows)`. Unions the
  per-bulletin CVE list into each matching row's CVEs string,
  idempotent against any tokens already present. Runs before
  `applyComponentReattributions` so the synthesized rows it produces
  also see the additions, and before the per-CVE NA filter loop so
  the NA filter takes the final word on per-row applicability.

- Generator extension (`gen_static_map.py`, local-only): emits the
  map from a hand-curated bulletin list (15 entries — empty-cves
  detection lives on the xlsx side which the generator doesn't read,
  so the bulletin list is curated; the per-bulletin CVE set is
  harvested from each bulletin's markdown body and regenerates over
  time as the upstream markdown changes).

Map covers 15 bulletins (no CVEs in xlsx, CVEs in markdown):
MS16-022/036/050/064/083/093/117/127/128/137/141/154,
MS17-003/005/023 — totaling 239 (bulletin, CVE) pairs.
MS01-005/017 are pre-CVE-era and have no CVE tokens in the markdown
either; left out of the map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pattern as the parent commit (BulletinSearch.xlsx leaves cves
empty on every row of the bulletin, markdown carries the CVE list)
but missed in the initial sweep.

Discovered by a corpus-wide audit of (xlsx.cves union per bulletin)
vs (markdown CVE tokens):

  MS02-019: CVE-2002-0153                                  (11 rows)
  MS02-038: CVE-2002-0644, CVE-2002-0645                   (2 rows)
  MS06-007: CVE-2006-0021                                  (9 rows)
  MS08-059: CVE-2008-3466                                  (8 rows)
  MS16-105: 12 CVEs (CVE-2016-3247/3291/3294/3295/3297/    (6 rows)
            3325/3330/3350/3351/3370/3374/3377)

Total: +17 (bulletin, CVE) pairs across 5 bulletins. No fixture
overlap, so no golden regeneration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-cve-additions branch from 3c2703c to fb0b76d Compare May 22, 2026 15:25
@MaineK00n MaineK00n requested a review from Copilot May 22, 2026 15:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Collaborator

@shino shino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥂

@MaineK00n MaineK00n merged commit b811faf into nightly May 25, 2026
5 checks passed
@MaineK00n MaineK00n deleted the MaineK00n/bulletin-archive-cve-additions branch May 25, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants