feat(extract/microsoft/bulletin): fill xlsx-missing CVE attributions#818
Merged
Conversation
8ac9ee7 to
59e4d15
Compare
87d8c4e to
8f64578
Compare
59e4d15 to
ca93a8a
Compare
8f64578 to
1ba8b86
Compare
ca93a8a to
f0cd9c1
Compare
3 tasks
1ba8b86 to
be9d9a3
Compare
f0cd9c1 to
65ded22
Compare
be9d9a3 to
87ba8a7
Compare
65ded22 to
4b031a8
Compare
4b031a8 to
3c2703c
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an extract-time “CVE fill” mechanism for Microsoft bulletins whose BulletinSearch.xlsx rows have an empty cves cell, by unioning in the authoritative CVE set from the bulletin archive markdown, preventing downstream leakage of bulletin IDs into CVE fields.
Changes:
- Introduces
bulletinArchiveCVEAdditionsandapplyCVEAdditionsto union per-bulletin CVEs into each matching row prior to component reattribution / NA filtering. - Updates extraction pipeline ordering to apply CVE additions before
applyComponentReattributions. - Adds unit coverage for CVE additions behavior and regenerates the MS17-023 golden fixture to include Flash Player CVEs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/extract/microsoft/bulletin/bulletin.go | Adds applyCVEAdditions, wires it into extraction, and introduces the bulletinArchiveCVEAdditions static map. |
| pkg/extract/microsoft/bulletin/bulletin_test.go | Adds TestApplyCVEAdditions for pass-through/add/idempotency/case-insensitive bulletin ID scenarios. |
| pkg/extract/microsoft/bulletin/export_test.go | Exposes new map/function to external tests via _test exports. |
| pkg/extract/microsoft/bulletin/testdata/golden/data/17/MS17-023.json | Updates golden data to include newly attributed CVE vulnerabilities for MS17-023. |
…from archive markdown Some Microsoft bulletins ship to BulletinSearch.xlsx with the cves cell empty on every row, even though the bulletin's archive markdown documents one or more CVE IDs. Without intervention, these rows reach detection with no CVE attribution at all, causing the bulletin ID itself to leak into the cveID field downstream (e.g. "MS16-137" appearing as a scannedCves key on Win 7 — a vuls-only data-shape artifact surfaced by the 26-host vuls-compare benchmark). Adds: - New static map `bulletinArchiveCVEAdditions: bulletinID → []CVE`. Lists every CVE the bulletin's markdown mentions; the per-(KB, CVE) applicability is then enforced by the existing `bulletinArchiveKBNotApplicable` filter after the union, so listing every CVE here is safe. - New extract preprocessor `applyCVEAdditions(rows)`. Unions the per-bulletin CVE list into each matching row's CVEs string, idempotent against any tokens already present. Runs before `applyComponentReattributions` so the synthesized rows it produces also see the additions, and before the per-CVE NA filter loop so the NA filter takes the final word on per-row applicability. - Generator extension (`gen_static_map.py`, local-only): emits the map from a hand-curated bulletin list (15 entries — empty-cves detection lives on the xlsx side which the generator doesn't read, so the bulletin list is curated; the per-bulletin CVE set is harvested from each bulletin's markdown body and regenerates over time as the upstream markdown changes). Map covers 15 bulletins (no CVEs in xlsx, CVEs in markdown): MS16-022/036/050/064/083/093/117/127/128/137/141/154, MS17-003/005/023 — totaling 239 (bulletin, CVE) pairs. MS01-005/017 are pre-CVE-era and have no CVE tokens in the markdown either; left out of the map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pattern as the parent commit (BulletinSearch.xlsx leaves cves empty on every row of the bulletin, markdown carries the CVE list) but missed in the initial sweep. Discovered by a corpus-wide audit of (xlsx.cves union per bulletin) vs (markdown CVE tokens): MS02-019: CVE-2002-0153 (11 rows) MS02-038: CVE-2002-0644, CVE-2002-0645 (2 rows) MS06-007: CVE-2006-0021 (9 rows) MS08-059: CVE-2008-3466 (8 rows) MS16-105: 12 CVEs (CVE-2016-3247/3291/3294/3295/3297/ (6 rows) 3325/3330/3350/3351/3370/3374/3377) Total: +17 (bulletin, CVE) pairs across 5 bulletins. No fixture overlap, so no golden regeneration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3c2703c to
fb0b76d
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #816. Fills in CVE attributions that BulletinSearch.xlsx leaves off a bulletin's rows but the bulletin's archive markdown documents.
Symmetric to the existing
bulletinArchiveCVECorrections(which fixes wrong CVE tokens already present in xlsx); this PR'sbulletinArchiveCVEAdditionsfills the gap where xlsx omitted CVEs entirely.Why
Microsoft occasionally publishes a bulletin to BulletinSearch.xlsx without populating the per-row
cvesfield — every row of the bulletin has an emptycvescell, despite the markdown documenting one or more CVE IDs. Without intervention, the bulletin ID itself leaks into thecveIDfield downstream (e.g.MS16-137appearing as ascannedCveskey on Win 7 — vuls-only data-shape artifact surfaced by the 26-host vuls-compare benchmark).This is the xlsx-all-empty case, addressed by union-merging the bulletin's full markdown CVE set into every row at extract time. Per-(KB, CVE) NA precision continues to be enforced by
bulletinArchiveKBNotApplicableafterward.Scope
This PR covers only the all-empty-xlsx pattern (20 bulletins / 256 (bulletin, CVE) pairs).
A symmetric "partial xlsx" pattern (xlsx carries some CVEs but the markdown documents more) was audited as part of this work and intentionally not included — the audit found that nearly every "markdown extra" in those bulletins is a markdown-side typo (digit transposition, year transposition, URL-anchor missing leading zero, deprecated CVEs in version-history notes, etc.) rather than a real missing attribution. Adding them mechanically would introduce FPs. The handful of real partial-xlsx variants need entry-by-entry manual review and will live in a follow-up.
Commits
4042a279bulletinArchiveCVEAdditionsmap +applyCVEAdditionsextract preprocessor + unit test.59e4d15dHow
bulletinArchiveCVEAdditions: bulletinID → []CVEstoring the bulletin's full markdown CVE set.applyCVEAdditions(rows)unions per-bulletin CVEs into matching rows. Idempotent against any tokens already present inrow.CVEs.applyComponentReattributionsso synthesized rows also see the additions, and before the per-CVE NA filter loop so the NA filter takes the final word on per-row applicability.gen_static_map.py, local-only): hand-curated bulletin list, CVE tokens harvested per-bulletin from the markdown body (markdown typos likeCVE-2077-*or 5-digit suffix anomalies or 3-digit suffix anomalies are filtered out automatically).Example: MS16-137
Unions
[7220, 7237, 7238]into every MS16-137 row. The KB-keyed NA filter then drops CVE-2016-7220 from the non-Win10-RTM rows (existing entries inbulletinArchiveKBNotApplicablealready cover that), so the final attribution matches the markdown's per-(KB, CVE) table exactly.Testing
TestApplyCVEAdditionscovers 5 cases: pass-through, MS16-137 fill, idempotency, case-insensitive bulletin ID match, MS17-023 Flash Player.GOEXPERIMENT=jsonv2 go test ./pkg/extract/microsoft/bulletin/...— passesGOEXPERIMENT=jsonv2 go test ./...— passesTest plan
GOEXPERIMENT=jsonv2 go build ./...GOEXPERIMENT=jsonv2 go test ./...🤖 Generated with Claude Code