Skip to content

fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor#821

Draft
MaineK00n wants to merge 5 commits into
MaineK00n/bulletin-archive-amendments-refactorfrom
MaineK00n/bulletin-archive-amendments-cleanup
Draft

fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor#821
MaineK00n wants to merge 5 commits into
MaineK00n/bulletin-archive-amendments-refactorfrom
MaineK00n/bulletin-archive-amendments-cleanup

Conversation

@MaineK00n
Copy link
Copy Markdown
Owner

@MaineK00n MaineK00n commented May 23, 2026

Summary

Stacked on #820. Regenerates the corpus-wide per-bulletin NA filtering data (bulletinArchiveAmendments[*].CVEAdjustments) using the current archive markdown extractor output, replacing the legacy committed data produced by an older extractor with a looser per-(KB, CVE) classification rule.

Why

The legacy bulletinArchiveKBNotApplicable map carried two classes of over-aggressive NA filtering:

  1. KB-uniform Drops applied where only some xlsx rows are NA per the markdown matrix — the older extractor treated a (KB, CVE) as KB-uniformly NA if ANY row carried NA. The current extractor uses all(statuses) which is strict.

  2. No per-product Component-Drop entries for OS-level / IE Cumulative bulletins where the markdown matrix narrows specific CVEs to specific product rows.

Visible failure modes surfaced by the 2026-05-23 gost-vs-vuls2 benchmark:

  • Bulletin-ID-as-CVE artifacts ("MS11-015", "MS13-069" appearing in scannedCves)
  • Server 2008 R2's 637-CVE gost-only cluster (CVEs silently under-attributed by the legacy over-Drop)

Commits

Commit Scope
30574643 KB-keyed NA filter regenerate — 158 over-aggressive KB-Drop entries removed, 131 updated. MS06-060 / MS16-068 manual exceptions preserved.
68f8aa60 OS-level per-product Component-Drop — 743 per-(bulletin, xlsx product) entries across 97 bulletins, 76 new normalize switch cases.
a9481413 IE Cumulative per-product Component-Drop — 989 per-(bulletin, "IE_X for OS_Y") entries across 37 pre-MS14 IE Cum bulletins. New ieCumCombinedKey helper constructs canonical combined key from xlsx (component, product).
9c29884b Office app Component-Drops + TODO markers — 95 entries for 27 Office mixed-shape bulletins (component-keyed) with conditional normalize (return component when set, else product). Plus TODO comment blocks for 175 bulletins documenting 741 remaining unmapped archive markdown labels for future manual review.
84ab1dea Inline-comment terminology cleanup — replace generator-side jargon ("Format A", "Format B", gen_static_map.py) in TODO blocks and other inline comments with shape-describing terms ("KB-keyed NA filter", "per-product Component-Drop", "archive markdown extractor / matrix table"). No data/code semantics change.

Mapping coverage (cumulative)

Category Total markdown labels Mapped Coverage
OS-level (Windows-prefix / Microsoft-prefix per-product) 1163 734 (incl. 95 Office app) 63%
IE Cumulative pre-MS14 (combined IE+OS) 1334 989 74%
Total per-product labels 2497 1723 69%

The remaining 741 unmapped labels are documented inline with TODO comments per bulletin — they fall into patterns that require per-bulletin manual review:

  • Combined feature+OS labels ("DirectX 9.0 on Windows XP SP2" — need feature+OS combined normalize)
  • SP-less product variants ("Windows Vista" without SP — need prefix-expansion)
  • Compound "X with Y" inverted forms
  • Comma-separated DirectX variant lists

Detection benchmark impact

Measured against the 2026-05-23 vuls.db / 26-host benchmark across this stack:

State gost-only vuls-only Hosts matching
Pre-stack (legacy committed data) 1334 4334 0/26
After KB-keyed NA filter regenerate only 624 3355 6/26
After full per-product Component-Drop (this PR) TBD — pending rebuild

Manual exceptions (preserved)

  • MS06-060 (KB923088, KB923089, KB923090, KB924998, KB924999): bulletin's "Vulnerability Identifiers" header layout is not yet recognised by the archive markdown extractor.
  • MS16-068 (KB3163017 / CVE-2016-3215): markdown narrowing is in natural-language rather than a per-CVE matrix table.

Testing

  • GOEXPERIMENT=jsonv2 go test -count=1 ./pkg/extract/microsoft/... — passes (all 5 subpkgs).
  • TestBulletinArchiveNotApplicable — passes (KB-keyed + per-bulletin inner-key).

Test plan

  • Build with GOEXPERIMENT=jsonv2 go build ./...
  • Test with GOEXPERIMENT=jsonv2 go test ./...
  • Rebuild vuls.db with this version; re-run 26-host benchmark; verify additional hosts converge after per-product Component-Drop is applied.
  • Review the 175 TODO blocks in bulletin.go (grep "TODO: the per-product NA entries below" pkg/extract/microsoft/bulletin/bulletin.go) and resolve any high-impact unmapped labels as follow-up.

🤖 Generated with Claude Code

@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 7dbb6fc to f4c7d5e Compare May 23, 2026 02:21
@MaineK00n MaineK00n changed the title fix(extract/microsoft/bulletin): remove over-aggressive KB-Drop entries for MS11-015 and MS13-069 fix(extract/microsoft/bulletin): regenerate Format A (KB-keyed NA) from current gen_static_map.py May 23, 2026
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 9ca9a1a to 48ffdab Compare May 23, 2026 02:47
@MaineK00n MaineK00n changed the title fix(extract/microsoft/bulletin): regenerate Format A (KB-keyed NA) from current gen_static_map.py fix(extract/microsoft/bulletin): regenerate Format A + Format B from current gen_static_map.py May 23, 2026
MaineK00n and others added 5 commits May 24, 2026 22:16
…urrent archive markdown extractor

The legacy bulletinArchiveKBNotApplicable map (now under bulletinArchiveAmendments
after #820's per-bulletin refactor) was generated by an older version of the
archive markdown extractor whose per-(KB, CVE) classification logic was looser.
The older heuristic treated a (KB, CVE) cell as KB-uniformly NA if ANY xlsx row
for that KB carried NA for the CVE; the current derivation uses the stricter
all(statuses) condition so a (KB, CVE) is only KB-uniformly NA when ALL
rows of the KB are NA.

The legacy committed data therefore carried over-aggressive KB-uniform Drops
across many bulletins. These over-Drops have two visible failure modes:

1. Bulletin-ID leak (e.g. "MS11-015", "MS13-069" appearing as cveID in
   downstream scannedCves) when every CVE attribution of a KB is dropped,
   leaving the bulletin without any vulnerability entries.

2. Silent CVE under-attribution where some xlsx rows that should be vuln
   (per markdown matrix table) get filtered alongside the truly-NA rows.

Surfaced by the 2026-05-23 gost-vs-vuls2 detection comparison benchmark.
After this regenerate, 6 of 26 hosts converge to full match with gost
(Win 7, 8.1, Server 2008, 2008 R2, 2012, 2012 R2) — Server 2008 R2's
637-CVE gost-only cluster collapses to 0 in particular, because the
under-attribution is corrected.

Scope: regenerates the KB-keyed Drop adjustments (CVEAdjustments where
KB != "") across all bulletins from the current archive markdown matrix
table output. Other adjustment kinds (Add, Remap, Component-Drop) and
RowSplits / Supersedes / IECumChain are preserved unchanged.

Two manually-curated bulletins are preserved as exceptions where the
current extractor does not produce equivalent output:

- MS06-060 (KB923088/923089/923090/924998/924999): the bulletin's
  "Vulnerability Identifiers" header layout is not yet recognised by
  the archive markdown extractor — manual entries documented inline.

- MS16-068 (KB3163017 / CVE-2016-3215): the markdown matrix narrowing
  is in natural language ("Only Windows 10 Version 1511 is affected")
  rather than a per-CVE matrix table, so the KB-keyed NA derivation
  does not surface this NA from MS16-068's markdown — manually
  restated under MS16-068's amendment.

Trade-off: the regenerate removes the legacy KB-uniform NA filtering
that wrongly applied to per-product NA cases. The bulletin-ID leak is
eliminated, but per-product NA cases (e.g. MS11-015 CVE-2011-0032 NA
only on XP rows; MS11-015 CVE-2011-0042 NA only on Server 2008 R2)
will now flow through with the full CVE list, producing small per-
product false positives. The follow-up to this commit adds per-product
Component-Drop entries plus the corresponding
normalizeArchiveComponentKey special-cases so those per-product NAs
are correctly suppressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es for OS-level bulletins

Follow-up to the KB-keyed NA filter regenerate. The legacy KB-uniform
NA filtering that was wrongly applied to per-product NA cases is now
corrected for OS-level / App-level bulletins via per-product
Component-Drop adjustments derived from the current archive markdown
matrix table output.

Scope: 97 bulletins whose archive markdown labels are predominantly
OS-level ("Windows ...") or App-level ("Microsoft ..."). The per-product
narrowing from the markdown per-CVE matrix table is now suppressed at
the per-(bulletin, xlsx affected_product) level.

Each Component-Drop entry's selector matches the row's xlsx
affected_product via normalizeArchiveComponentKey returning `product`
verbatim (76 new bulletins added to the switch's "return product"
branch alongside the existing MS11-015 / MS12-* / MS13-046 / MS13-081
/ MS15-* / MS16-* / MS17-* cases).

Mapping markdown column-header labels to xlsx affected_product:
- direct (md_label == xlsx_product as-is)         — 425 entries
- stripped-"Microsoft " prefix                    — 109 entries
- compound "X and Y" split into per-part entries  — 104 entries (+
                                                    expanded to 188
                                                    per-xlsx-product
                                                    entries)
- substring match (md_label inside xlsx_product)  — 1 entry

Total: 743 per-(bulletin, xlsx_product) Component-Drop entries (739
new + 4 already-present for MS11-015 from the preceding fix commit).

Deferred / not in this commit:
- 524 markdown labels with no straightforward xlsx mapping (combined
  product+feature labels like "DirectX 9.0 on Microsoft Windows ...",
  itanium/SP variants not in xlsx, etc.) — needs per-bulletin manual
  review.
- 71 IE Cumulative bulletins with combined "Internet Explorer X for
  Windows Y" markdown labels — handled in the next commit since they
  require a different normalize key construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es for IE Cumulative bulletins

Follow-up to the OS-level per-product Component-Drop fix. Pre-MS14 IE
Cumulative bulletins (MS07-* through MS13-*) use markdown column-header
labels that combine the IE version and OS into a single label
(e.g. "Internet Explorer 6 for Windows XP Service Pack 3"), with
varied connector words ("for", "in", "on", "when installed on")
in the upstream markdown.

This commit:

1. Adds `ieCumCombinedKey(component, product)` helper to
   normalizeArchiveComponentKey. The helper constructs the canonical
   "IE_X (Service Pack Y) for OS" key from the xlsx row's
   (affected_component, affected_product) pair:
   - Strips "Microsoft "/"Windows " prefix from affected_component
   - Drops ".0" minor-version suffix where present (so xlsx-form
     "Internet Explorer 6.0" matches markdown-form "Internet Explorer 6")
   - Strips "Microsoft " prefix from affected_product
   - Joins with " for " (the canonical connector)

2. Routes 37 pre-MS14 IE Cumulative bulletins through the helper:
   MS07-069, MS08-031/058/073, MS09-014/019/034/054/072,
   MS10-002/018/035/053/071/090, MS11-003/018/050/057/081/099,
   MS12-010/023/037/052/063/077, MS13-009/021/037/047/055/059/069/080/088/097

3. Adds 989 Component-Drop entries to bulletinArchiveAmendments
   with canonical "IE_X for OS" keys generated by transforming the
   archive markdown matrix table output:
   - Connector normalization: "on" / "in" / "when installed on" → "for"
   - Compound splitting: "for X and Y" → two entries with X and Y respectively
   - "Microsoft " prefix stripping on the OS portion

Together with the preceding two commits (KB-keyed NA regenerate +
OS-level per-product Component-Drop), this restores the
per-(KB, component, CVE) NA precision for the bulletin archive corpus,
eliminating both the bulletin-ID-as-CVE artifact and the per-product
over-attribution that the legacy KB-uniform NA introduced.

Coverage: 989 of the 1334 IE Cumulative archive markdown labels (74%).
345 unmatched labels are deferred — these are mostly newer-format
IE 10/11 labels and Itanium/embedded edge cases where the
markdown ↔ xlsx product mapping requires per-bulletin manual
review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…O markers for unmapped archive markdown labels

Follow-up to the OS-level and IE Cumulative per-product Component-Drop
coverage. This commit closes two gaps:

1. Office app mixed-shape bulletins (27 bulletins, 95 new
   Component-Drop entries). Bulletins like MS10-079, MS12-030,
   MS15-070 etc. carry the Office app identity
   (e.g. "Microsoft Word 2003 Service Pack 3") in xlsx
   affected_component for some rows and leave it empty for the
   OS-level rows of the same bulletin. normalizeArchiveComponentKey
   now routes these 27 bulletins to a new case that returns
   `component` when set and falls back to `product` — so the
   per-product Component-Drop entries keyed by app name match the
   component-set rows and the existing OS-level entries continue to
   match the component-null rows.

2. TODO comment blocks above 175 bulletins listing the remaining
   741 archive markdown labels that could not be mapped
   automatically. These labels fall into several patterns the
   automatic mapper cannot resolve:

   - Combined feature+OS labels like "DirectX 9.0 on Windows XP
     Service Pack 2" (need a per-bulletin normalize that constructs
     the combined key from xlsx (component, product), similar to
     ieCumCombinedKey but for DirectX / Media Services / etc.)
   - SP-less product variants like "Windows Vista" (xlsx only
     carries SP1/SP2 variants — markdown is rolling up multiple
     SPs into one column, needs prefix-expansion to all SP
     variants in xlsx)
   - Compound "X with Y" inverted order ("Windows 2000 SP4 with
     DirectX 7.0") — same as (1) but with reversed parts
   - Compound "X and Y, A, B, or C" with comma-separated tails

   Each affected bulletin gets a TODO block above its amendment
   entry documenting the missing labels. A future contributor (or
   the maintainer manually) can resolve these by:
   - Extending the per-bulletin normalize case to construct the
     missing combined key, or expanding to all matching xlsx
     products
   - Adding Component-Drop entries below the TODO marker

The benchmark impact of the unmapped 741 labels is limited because
they cluster on older bulletins (MS07-* through MS13-*) whose
relevant xlsx products are not heavily represented on the 26-host
benchmark fixture set. The TODO markers preserve the information
for future review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…argon terminology

The earlier commits in this stack used internal generator-side jargon
("Format A", "Format B", "gen_static_map.py") in TODO blocks above
175 bulletins and in five other inline comments. The jargon does not
explain itself to a reader who has not seen the local generator
source. Rewrite those comments to describe the shape directly:

- "Format A" → "KB-keyed NA filter" / "KB-keyed NA derivation"
- "Format B" → "per-product Component-Drop"
- "Format B labels" → "archive markdown labels"
- "gen_static_map.py" → "the archive markdown extractor" /
  "the archive markdown matrix table"

No data/code semantics change. Pure documentation refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaineK00n MaineK00n changed the title fix(extract/microsoft/bulletin): regenerate Format A + Format B from current gen_static_map.py fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor May 24, 2026
@MaineK00n MaineK00n force-pushed the MaineK00n/bulletin-archive-amendments-cleanup branch from 123d31c to 84ab1de Compare May 24, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant