fix(extract/microsoft/bulletin): refresh bulletin archive NA filtering from current markdown extractor#821
Draft
MaineK00n wants to merge 5 commits into
Conversation
7dbb6fc to
f4c7d5e
Compare
9ca9a1a to
48ffdab
Compare
…urrent archive markdown extractor The legacy bulletinArchiveKBNotApplicable map (now under bulletinArchiveAmendments after #820's per-bulletin refactor) was generated by an older version of the archive markdown extractor whose per-(KB, CVE) classification logic was looser. The older heuristic treated a (KB, CVE) cell as KB-uniformly NA if ANY xlsx row for that KB carried NA for the CVE; the current derivation uses the stricter all(statuses) condition so a (KB, CVE) is only KB-uniformly NA when ALL rows of the KB are NA. The legacy committed data therefore carried over-aggressive KB-uniform Drops across many bulletins. These over-Drops have two visible failure modes: 1. Bulletin-ID leak (e.g. "MS11-015", "MS13-069" appearing as cveID in downstream scannedCves) when every CVE attribution of a KB is dropped, leaving the bulletin without any vulnerability entries. 2. Silent CVE under-attribution where some xlsx rows that should be vuln (per markdown matrix table) get filtered alongside the truly-NA rows. Surfaced by the 2026-05-23 gost-vs-vuls2 detection comparison benchmark. After this regenerate, 6 of 26 hosts converge to full match with gost (Win 7, 8.1, Server 2008, 2008 R2, 2012, 2012 R2) — Server 2008 R2's 637-CVE gost-only cluster collapses to 0 in particular, because the under-attribution is corrected. Scope: regenerates the KB-keyed Drop adjustments (CVEAdjustments where KB != "") across all bulletins from the current archive markdown matrix table output. Other adjustment kinds (Add, Remap, Component-Drop) and RowSplits / Supersedes / IECumChain are preserved unchanged. Two manually-curated bulletins are preserved as exceptions where the current extractor does not produce equivalent output: - MS06-060 (KB923088/923089/923090/924998/924999): the bulletin's "Vulnerability Identifiers" header layout is not yet recognised by the archive markdown extractor — manual entries documented inline. - MS16-068 (KB3163017 / CVE-2016-3215): the markdown matrix narrowing is in natural language ("Only Windows 10 Version 1511 is affected") rather than a per-CVE matrix table, so the KB-keyed NA derivation does not surface this NA from MS16-068's markdown — manually restated under MS16-068's amendment. Trade-off: the regenerate removes the legacy KB-uniform NA filtering that wrongly applied to per-product NA cases. The bulletin-ID leak is eliminated, but per-product NA cases (e.g. MS11-015 CVE-2011-0032 NA only on XP rows; MS11-015 CVE-2011-0042 NA only on Server 2008 R2) will now flow through with the full CVE list, producing small per- product false positives. The follow-up to this commit adds per-product Component-Drop entries plus the corresponding normalizeArchiveComponentKey special-cases so those per-product NAs are correctly suppressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es for OS-level bulletins
Follow-up to the KB-keyed NA filter regenerate. The legacy KB-uniform
NA filtering that was wrongly applied to per-product NA cases is now
corrected for OS-level / App-level bulletins via per-product
Component-Drop adjustments derived from the current archive markdown
matrix table output.
Scope: 97 bulletins whose archive markdown labels are predominantly
OS-level ("Windows ...") or App-level ("Microsoft ..."). The per-product
narrowing from the markdown per-CVE matrix table is now suppressed at
the per-(bulletin, xlsx affected_product) level.
Each Component-Drop entry's selector matches the row's xlsx
affected_product via normalizeArchiveComponentKey returning `product`
verbatim (76 new bulletins added to the switch's "return product"
branch alongside the existing MS11-015 / MS12-* / MS13-046 / MS13-081
/ MS15-* / MS16-* / MS17-* cases).
Mapping markdown column-header labels to xlsx affected_product:
- direct (md_label == xlsx_product as-is) — 425 entries
- stripped-"Microsoft " prefix — 109 entries
- compound "X and Y" split into per-part entries — 104 entries (+
expanded to 188
per-xlsx-product
entries)
- substring match (md_label inside xlsx_product) — 1 entry
Total: 743 per-(bulletin, xlsx_product) Component-Drop entries (739
new + 4 already-present for MS11-015 from the preceding fix commit).
Deferred / not in this commit:
- 524 markdown labels with no straightforward xlsx mapping (combined
product+feature labels like "DirectX 9.0 on Microsoft Windows ...",
itanium/SP variants not in xlsx, etc.) — needs per-bulletin manual
review.
- 71 IE Cumulative bulletins with combined "Internet Explorer X for
Windows Y" markdown labels — handled in the next commit since they
require a different normalize key construction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es for IE Cumulative bulletins
Follow-up to the OS-level per-product Component-Drop fix. Pre-MS14 IE
Cumulative bulletins (MS07-* through MS13-*) use markdown column-header
labels that combine the IE version and OS into a single label
(e.g. "Internet Explorer 6 for Windows XP Service Pack 3"), with
varied connector words ("for", "in", "on", "when installed on")
in the upstream markdown.
This commit:
1. Adds `ieCumCombinedKey(component, product)` helper to
normalizeArchiveComponentKey. The helper constructs the canonical
"IE_X (Service Pack Y) for OS" key from the xlsx row's
(affected_component, affected_product) pair:
- Strips "Microsoft "/"Windows " prefix from affected_component
- Drops ".0" minor-version suffix where present (so xlsx-form
"Internet Explorer 6.0" matches markdown-form "Internet Explorer 6")
- Strips "Microsoft " prefix from affected_product
- Joins with " for " (the canonical connector)
2. Routes 37 pre-MS14 IE Cumulative bulletins through the helper:
MS07-069, MS08-031/058/073, MS09-014/019/034/054/072,
MS10-002/018/035/053/071/090, MS11-003/018/050/057/081/099,
MS12-010/023/037/052/063/077, MS13-009/021/037/047/055/059/069/080/088/097
3. Adds 989 Component-Drop entries to bulletinArchiveAmendments
with canonical "IE_X for OS" keys generated by transforming the
archive markdown matrix table output:
- Connector normalization: "on" / "in" / "when installed on" → "for"
- Compound splitting: "for X and Y" → two entries with X and Y respectively
- "Microsoft " prefix stripping on the OS portion
Together with the preceding two commits (KB-keyed NA regenerate +
OS-level per-product Component-Drop), this restores the
per-(KB, component, CVE) NA precision for the bulletin archive corpus,
eliminating both the bulletin-ID-as-CVE artifact and the per-product
over-attribution that the legacy KB-uniform NA introduced.
Coverage: 989 of the 1334 IE Cumulative archive markdown labels (74%).
345 unmatched labels are deferred — these are mostly newer-format
IE 10/11 labels and Itanium/embedded edge cases where the
markdown ↔ xlsx product mapping requires per-bulletin manual
review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…O markers for unmapped archive markdown labels
Follow-up to the OS-level and IE Cumulative per-product Component-Drop
coverage. This commit closes two gaps:
1. Office app mixed-shape bulletins (27 bulletins, 95 new
Component-Drop entries). Bulletins like MS10-079, MS12-030,
MS15-070 etc. carry the Office app identity
(e.g. "Microsoft Word 2003 Service Pack 3") in xlsx
affected_component for some rows and leave it empty for the
OS-level rows of the same bulletin. normalizeArchiveComponentKey
now routes these 27 bulletins to a new case that returns
`component` when set and falls back to `product` — so the
per-product Component-Drop entries keyed by app name match the
component-set rows and the existing OS-level entries continue to
match the component-null rows.
2. TODO comment blocks above 175 bulletins listing the remaining
741 archive markdown labels that could not be mapped
automatically. These labels fall into several patterns the
automatic mapper cannot resolve:
- Combined feature+OS labels like "DirectX 9.0 on Windows XP
Service Pack 2" (need a per-bulletin normalize that constructs
the combined key from xlsx (component, product), similar to
ieCumCombinedKey but for DirectX / Media Services / etc.)
- SP-less product variants like "Windows Vista" (xlsx only
carries SP1/SP2 variants — markdown is rolling up multiple
SPs into one column, needs prefix-expansion to all SP
variants in xlsx)
- Compound "X with Y" inverted order ("Windows 2000 SP4 with
DirectX 7.0") — same as (1) but with reversed parts
- Compound "X and Y, A, B, or C" with comma-separated tails
Each affected bulletin gets a TODO block above its amendment
entry documenting the missing labels. A future contributor (or
the maintainer manually) can resolve these by:
- Extending the per-bulletin normalize case to construct the
missing combined key, or expanding to all matching xlsx
products
- Adding Component-Drop entries below the TODO marker
The benchmark impact of the unmapped 741 labels is limited because
they cluster on older bulletins (MS07-* through MS13-*) whose
relevant xlsx products are not heavily represented on the 26-host
benchmark fixture set. The TODO markers preserve the information
for future review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…argon terminology
The earlier commits in this stack used internal generator-side jargon
("Format A", "Format B", "gen_static_map.py") in TODO blocks above
175 bulletins and in five other inline comments. The jargon does not
explain itself to a reader who has not seen the local generator
source. Rewrite those comments to describe the shape directly:
- "Format A" → "KB-keyed NA filter" / "KB-keyed NA derivation"
- "Format B" → "per-product Component-Drop"
- "Format B labels" → "archive markdown labels"
- "gen_static_map.py" → "the archive markdown extractor" /
"the archive markdown matrix table"
No data/code semantics change. Pure documentation refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
123d31c to
84ab1de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #820. Regenerates the corpus-wide per-bulletin NA filtering data (
bulletinArchiveAmendments[*].CVEAdjustments) using the current archive markdown extractor output, replacing the legacy committed data produced by an older extractor with a looser per-(KB, CVE) classification rule.Why
The legacy
bulletinArchiveKBNotApplicablemap carried two classes of over-aggressive NA filtering:KB-uniform Drops applied where only some xlsx rows are NA per the markdown matrix — the older extractor treated a (KB, CVE) as KB-uniformly NA if ANY row carried NA. The current extractor uses
all(statuses)which is strict.No per-product Component-Drop entries for OS-level / IE Cumulative bulletins where the markdown matrix narrows specific CVEs to specific product rows.
Visible failure modes surfaced by the 2026-05-23 gost-vs-vuls2 benchmark:
Commits
3057464368f8aa60a9481413ieCumCombinedKeyhelper constructs canonical combined key from xlsx (component, product).9c29884b84ab1deagen_static_map.py) in TODO blocks and other inline comments with shape-describing terms ("KB-keyed NA filter", "per-product Component-Drop", "archive markdown extractor / matrix table"). No data/code semantics change.Mapping coverage (cumulative)
The remaining 741 unmapped labels are documented inline with TODO comments per bulletin — they fall into patterns that require per-bulletin manual review:
Detection benchmark impact
Measured against the 2026-05-23 vuls.db / 26-host benchmark across this stack:
Manual exceptions (preserved)
Testing
GOEXPERIMENT=jsonv2 go test -count=1 ./pkg/extract/microsoft/...— passes (all 5 subpkgs).Test plan
GOEXPERIMENT=jsonv2 go build ./...GOEXPERIMENT=jsonv2 go test ./...bulletin.go(grep "TODO: the per-product NA entries below" pkg/extract/microsoft/bulletin/bulletin.go) and resolve any high-impact unmapped labels as follow-up.🤖 Generated with Claude Code