fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor by MaineK00n · Pull Request #821 · MaineK00n/vuls-data-update

MaineK00n · 2026-05-23T01:45:16Z

Summary

Stacked on #820. Regenerates the corpus-wide per-bulletin NA filtering data (bulletinArchiveAmendments[*].CVEAdjustments) using the current archive markdown extractor output, replacing the legacy committed data produced by an older extractor with a looser per-(KB, CVE) classification rule.

Why

The legacy bulletinArchiveKBNotApplicable map carried two classes of over-aggressive NA filtering:

KB-uniform Drops applied where only some xlsx rows are NA per the markdown matrix — the older extractor treated a (KB, CVE) as KB-uniformly NA if ANY row carried NA. The current extractor uses all(statuses) which is strict.
No per-product Component-Drop entries for OS-level / IE Cumulative bulletins where the markdown matrix narrows specific CVEs to specific product rows.

Visible failure modes surfaced by the 2026-05-23 gost-vs-vuls2 benchmark:

Bulletin-ID-as-CVE artifacts ("MS11-015", "MS13-069" appearing in scannedCves)
Server 2008 R2's 637-CVE gost-only cluster (CVEs silently under-attributed by the legacy over-Drop)

Commits

Commit	Scope
`30574643`	KB-keyed NA filter regenerate — 158 over-aggressive KB-Drop entries removed, 131 updated. MS06-060 / MS16-068 manual exceptions preserved.
`68f8aa60`	OS-level per-product Component-Drop — 743 per-(bulletin, xlsx product) entries across 97 bulletins, 76 new normalize switch cases.
`a9481413`	IE Cumulative per-product Component-Drop — 989 per-(bulletin, "IE_X for OS_Y") entries across 37 pre-MS14 IE Cum bulletins. New `ieCumCombinedKey` helper constructs canonical combined key from xlsx (component, product).
`9c29884b`	Office app Component-Drops + TODO markers — 95 entries for 27 Office mixed-shape bulletins (component-keyed) with conditional normalize (return component when set, else product). Plus TODO comment blocks for 175 bulletins documenting 741 remaining unmapped archive markdown labels for future manual review.
`84ab1dea`	Inline-comment terminology cleanup — replace generator-side jargon ("Format A", "Format B", `gen_static_map.py`) in TODO blocks and other inline comments with shape-describing terms ("KB-keyed NA filter", "per-product Component-Drop", "archive markdown extractor / matrix table"). No data/code semantics change.

Mapping coverage (cumulative)

Category	Total markdown labels	Mapped	Coverage
OS-level (Windows-prefix / Microsoft-prefix per-product)	1163	734 (incl. 95 Office app)	63%
IE Cumulative pre-MS14 (combined IE+OS)	1334	989	74%
Total per-product labels	2497	1723	69%

The remaining 741 unmapped labels are documented inline with TODO comments per bulletin — they fall into patterns that require per-bulletin manual review:

Combined feature+OS labels ("DirectX 9.0 on Windows XP SP2" — need feature+OS combined normalize)
SP-less product variants ("Windows Vista" without SP — need prefix-expansion)
Compound "X with Y" inverted forms
Comma-separated DirectX variant lists

Detection benchmark impact

Measured against the 2026-05-23 vuls.db / 26-host benchmark across this stack:

State	gost-only	vuls-only	Hosts matching
Pre-stack (legacy committed data)	1334	4334	0/26
After KB-keyed NA filter regenerate only	624	3355	6/26
After full per-product Component-Drop (this PR)	TBD — pending rebuild

Manual exceptions (preserved)

MS06-060 (KB923088, KB923089, KB923090, KB924998, KB924999): bulletin's "Vulnerability Identifiers" header layout is not yet recognised by the archive markdown extractor.
MS16-068 (KB3163017 / CVE-2016-3215): markdown narrowing is in natural-language rather than a per-CVE matrix table.

Testing

GOEXPERIMENT=jsonv2 go test -count=1 ./pkg/extract/microsoft/... — passes (all 5 subpkgs).
TestBulletinArchiveNotApplicable — passes (KB-keyed + per-bulletin inner-key).

Test plan

Build with GOEXPERIMENT=jsonv2 go build ./...
Test with GOEXPERIMENT=jsonv2 go test ./...
Rebuild vuls.db with this version; re-run 26-host benchmark; verify additional hosts converge after per-product Component-Drop is applied.
Review the 175 TODO blocks in bulletin.go (grep "TODO: the per-product NA entries below" pkg/extract/microsoft/bulletin/bulletin.go) and resolve any high-impact unmapped labels as follow-up.

🤖 Generated with Claude Code

…urrent archive markdown extractor The legacy bulletinArchiveKBNotApplicable map (now under bulletinArchiveAmendments after #820's per-bulletin refactor) was generated by an older version of the archive markdown extractor whose per-(KB, CVE) classification logic was looser. The older heuristic treated a (KB, CVE) cell as KB-uniformly NA if ANY xlsx row for that KB carried NA for the CVE; the current derivation uses the stricter all(statuses) condition so a (KB, CVE) is only KB-uniformly NA when ALL rows of the KB are NA. The legacy committed data therefore carried over-aggressive KB-uniform Drops across many bulletins. These over-Drops have two visible failure modes: 1. Bulletin-ID leak (e.g. "MS11-015", "MS13-069" appearing as cveID in downstream scannedCves) when every CVE attribution of a KB is dropped, leaving the bulletin without any vulnerability entries. 2. Silent CVE under-attribution where some xlsx rows that should be vuln (per markdown matrix table) get filtered alongside the truly-NA rows. Surfaced by the 2026-05-23 gost-vs-vuls2 detection comparison benchmark. After this regenerate, 6 of 26 hosts converge to full match with gost (Win 7, 8.1, Server 2008, 2008 R2, 2012, 2012 R2) — Server 2008 R2's 637-CVE gost-only cluster collapses to 0 in particular, because the under-attribution is corrected. Scope: regenerates the KB-keyed Drop adjustments (CVEAdjustments where KB != "") across all bulletins from the current archive markdown matrix table output. Other adjustment kinds (Add, Remap, Component-Drop) and RowSplits / Supersedes / IECumChain are preserved unchanged. Two manually-curated bulletins are preserved as exceptions where the current extractor does not produce equivalent output: - MS06-060 (KB923088/923089/923090/924998/924999): the bulletin's "Vulnerability Identifiers" header layout is not yet recognised by the archive markdown extractor — manual entries documented inline. - MS16-068 (KB3163017 / CVE-2016-3215): the markdown matrix narrowing is in natural language ("Only Windows 10 Version 1511 is affected") rather than a per-CVE matrix table, so the KB-keyed NA derivation does not surface this NA from MS16-068's markdown — manually restated under MS16-068's amendment. Trade-off: the regenerate removes the legacy KB-uniform NA filtering that wrongly applied to per-product NA cases. The bulletin-ID leak is eliminated, but per-product NA cases (e.g. MS11-015 CVE-2011-0032 NA only on XP rows; MS11-015 CVE-2011-0042 NA only on Server 2008 R2) will now flow through with the full CVE list, producing small per- product false positives. The follow-up to this commit adds per-product Component-Drop entries plus the corresponding normalizeArchiveComponentKey special-cases so those per-product NAs are correctly suppressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es for OS-level bulletins Follow-up to the KB-keyed NA filter regenerate. The legacy KB-uniform NA filtering that was wrongly applied to per-product NA cases is now corrected for OS-level / App-level bulletins via per-product Component-Drop adjustments derived from the current archive markdown matrix table output. Scope: 97 bulletins whose archive markdown labels are predominantly OS-level ("Windows ...") or App-level ("Microsoft ..."). The per-product narrowing from the markdown per-CVE matrix table is now suppressed at the per-(bulletin, xlsx affected_product) level. Each Component-Drop entry's selector matches the row's xlsx affected_product via normalizeArchiveComponentKey returning `product` verbatim (76 new bulletins added to the switch's "return product" branch alongside the existing MS11-015 / MS12-* / MS13-046 / MS13-081 / MS15-* / MS16-* / MS17-* cases). Mapping markdown column-header labels to xlsx affected_product: - direct (md_label == xlsx_product as-is) — 425 entries - stripped-"Microsoft " prefix — 109 entries - compound "X and Y" split into per-part entries — 104 entries (+ expanded to 188 per-xlsx-product entries) - substring match (md_label inside xlsx_product) — 1 entry Total: 743 per-(bulletin, xlsx_product) Component-Drop entries (739 new + 4 already-present for MS11-015 from the preceding fix commit). Deferred / not in this commit: - 524 markdown labels with no straightforward xlsx mapping (combined product+feature labels like "DirectX 9.0 on Microsoft Windows ...", itanium/SP variants not in xlsx, etc.) — needs per-bulletin manual review. - 71 IE Cumulative bulletins with combined "Internet Explorer X for Windows Y" markdown labels — handled in the next commit since they require a different normalize key construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es for IE Cumulative bulletins Follow-up to the OS-level per-product Component-Drop fix. Pre-MS14 IE Cumulative bulletins (MS07-* through MS13-*) use markdown column-header labels that combine the IE version and OS into a single label (e.g. "Internet Explorer 6 for Windows XP Service Pack 3"), with varied connector words ("for", "in", "on", "when installed on") in the upstream markdown. This commit: 1. Adds `ieCumCombinedKey(component, product)` helper to normalizeArchiveComponentKey. The helper constructs the canonical "IE_X (Service Pack Y) for OS" key from the xlsx row's (affected_component, affected_product) pair: - Strips "Microsoft "/"Windows " prefix from affected_component - Drops ".0" minor-version suffix where present (so xlsx-form "Internet Explorer 6.0" matches markdown-form "Internet Explorer 6") - Strips "Microsoft " prefix from affected_product - Joins with " for " (the canonical connector) 2. Routes 37 pre-MS14 IE Cumulative bulletins through the helper: MS07-069, MS08-031/058/073, MS09-014/019/034/054/072, MS10-002/018/035/053/071/090, MS11-003/018/050/057/081/099, MS12-010/023/037/052/063/077, MS13-009/021/037/047/055/059/069/080/088/097 3. Adds 989 Component-Drop entries to bulletinArchiveAmendments with canonical "IE_X for OS" keys generated by transforming the archive markdown matrix table output: - Connector normalization: "on" / "in" / "when installed on" → "for" - Compound splitting: "for X and Y" → two entries with X and Y respectively - "Microsoft " prefix stripping on the OS portion Together with the preceding two commits (KB-keyed NA regenerate + OS-level per-product Component-Drop), this restores the per-(KB, component, CVE) NA precision for the bulletin archive corpus, eliminating both the bulletin-ID-as-CVE artifact and the per-product over-attribution that the legacy KB-uniform NA introduced. Coverage: 989 of the 1334 IE Cumulative archive markdown labels (74%). 345 unmatched labels are deferred — these are mostly newer-format IE 10/11 labels and Itanium/embedded edge cases where the markdown ↔ xlsx product mapping requires per-bulletin manual review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…O markers for unmapped archive markdown labels Follow-up to the OS-level and IE Cumulative per-product Component-Drop coverage. This commit closes two gaps: 1. Office app mixed-shape bulletins (27 bulletins, 95 new Component-Drop entries). Bulletins like MS10-079, MS12-030, MS15-070 etc. carry the Office app identity (e.g. "Microsoft Word 2003 Service Pack 3") in xlsx affected_component for some rows and leave it empty for the OS-level rows of the same bulletin. normalizeArchiveComponentKey now routes these 27 bulletins to a new case that returns `component` when set and falls back to `product` — so the per-product Component-Drop entries keyed by app name match the component-set rows and the existing OS-level entries continue to match the component-null rows. 2. TODO comment blocks above 175 bulletins listing the remaining 741 archive markdown labels that could not be mapped automatically. These labels fall into several patterns the automatic mapper cannot resolve: - Combined feature+OS labels like "DirectX 9.0 on Windows XP Service Pack 2" (need a per-bulletin normalize that constructs the combined key from xlsx (component, product), similar to ieCumCombinedKey but for DirectX / Media Services / etc.) - SP-less product variants like "Windows Vista" (xlsx only carries SP1/SP2 variants — markdown is rolling up multiple SPs into one column, needs prefix-expansion to all SP variants in xlsx) - Compound "X with Y" inverted order ("Windows 2000 SP4 with DirectX 7.0") — same as (1) but with reversed parts - Compound "X and Y, A, B, or C" with comma-separated tails Each affected bulletin gets a TODO block above its amendment entry documenting the missing labels. A future contributor (or the maintainer manually) can resolve these by: - Extending the per-bulletin normalize case to construct the missing combined key, or expanding to all matching xlsx products - Adding Component-Drop entries below the TODO marker The benchmark impact of the unmapped 741 labels is limited because they cluster on older bulletins (MS07-* through MS13-*) whose relevant xlsx products are not heavily represented on the 26-host benchmark fixture set. The TODO markers preserve the information for future review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…argon terminology The earlier commits in this stack used internal generator-side jargon ("Format A", "Format B", "gen_static_map.py") in TODO blocks above 175 bulletins and in five other inline comments. The jargon does not explain itself to a reader who has not seen the local generator source. Rewrite those comments to describe the shape directly: - "Format A" → "KB-keyed NA filter" / "KB-keyed NA derivation" - "Format B" → "per-product Component-Drop" - "Format B labels" → "archive markdown labels" - "gen_static_map.py" → "the archive markdown extractor" / "the archive markdown matrix table" No data/code semantics change. Pure documentation refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 7dbb6fc to f4c7d5e Compare May 23, 2026 02:21

MaineK00n changed the title ~~fix(extract/microsoft/bulletin): remove over-aggressive KB-Drop entries for MS11-015 and MS13-069~~ fix(extract/microsoft/bulletin): regenerate Format A (KB-keyed NA) from current gen_static_map.py May 23, 2026

MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 9ca9a1a to 48ffdab Compare May 23, 2026 02:47

MaineK00n changed the title ~~fix(extract/microsoft/bulletin): regenerate Format A (KB-keyed NA) from current gen_static_map.py~~ fix(extract/microsoft/bulletin): regenerate Format A + Format B from current gen_static_map.py May 23, 2026

MaineK00n and others added 5 commits May 24, 2026 22:16

MaineK00n changed the title ~~fix(extract/microsoft/bulletin): regenerate Format A + Format B from current gen_static_map.py~~ fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor May 24, 2026

MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 123d31c to 84ab1de Compare May 24, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor#821

fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor#821
MaineK00n wants to merge 5 commits into
MaineK00n/bulletin-archive-amendments-refactorfrom
MaineK00n/bulletin-archive-amendments-cleanup

MaineK00n commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaineK00n commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Commits

Mapping coverage (cumulative)

Detection benchmark impact

Manual exceptions (preserved)

Testing

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaineK00n commented May 23, 2026 •

edited

Loading