Skip to content

fix(meta-analyzer): LLM-confirmed findings dropped when model returns end_line #67

@JiayingHuang

Description

@JiayingHuang

Meta-analyzer silently drops LLM-confirmed findings when the model returns end_line (security: false negatives, e.g. dropped CVEs)

Summary

LLMMetaAnalyzer.apply_filter (src/skillspector/nodes/meta_analyzer.py) matches each static finding against the LLM's confirmation using a key that includes end_line. When the LLM confirms a finding as a real vulnerability but returns a non-null end_line (e.g. end_line == start_line) while the static finding carries end_line = None, none of the three lookup keys match and the confirmed finding is silently dropped (continue).

Because the meta-analyzer is a drop-by-default whitelist filter, the failure mode is a false negative: real, LLM-confirmed findings — including live OSV/CVE supply-chain findings — disappear from the report, and the skill's risk score can collapse from CRITICAL to SAFE.

Impact / Severity

  • Security-relevant false negative. A skill with known-vulnerable dependencies scored 100 / CRITICAL in static-only mode but 0 / SAFE once LLM analysis was enabled, because all 7 supply-chain findings (5 of them live OSV CVEs, e.g. PyYAML==5.1) were dropped by the filter even though the LLM confirmed 6 of them with is_vulnerability=True, confidence≈1.0.
  • This is not a model-quality problem — the LLM classified correctly. It is a key-matching defect in apply_filter that is triggered by a legitimate, schema-valid response shape.

Environment

  • SkillSpector v2.1.4 (commit cff7ecc)
  • Python 3.13
  • Triggered when the configured LLM populates end_line in MetaAnalyzerFinding. Observed with DeepSeek via the OpenAI-compatible path (deepseek-chat); any model that fills end_line for single-line findings will trigger it. Stock OpenAI models tend to leave end_line unset, which is why this has stayed latent.

Root cause

MetaAnalyzerFinding.end_line is optional (the schema explicitly allows the model to provide it). Static analyzers commonly emit findings with end_line = None (single-line or dependency findings).

In apply_filter, the confirmation index is built with end_line (current meta_analyzer.py:301-312):

if start_line is not None:
    end_line = item.get("end_line")
    confirmed_granular[(file_path, pattern_id, int(start_line),
                        int(end_line) if end_line is not None else None)] = enrichment
else:
    confirmed_coarse[(file_path, pattern_id)] = enrichment

and looked up with three keys (meta_analyzer.py:316-326):

exact_key      = (f.file, f.rule_id, f.start_line, f.end_line)   # static end_line is None
start_only_key = (f.file, f.rule_id, f.start_line, None)         # forces None
coarse_key     = (f.file, f.rule_id)                             # only used if LLM omitted start_line
if   exact_key      in confirmed_granular: ...
elif start_only_key in confirmed_granular: ...
elif coarse_key     in confirmed_coarse:   ...
else:
    continue   # <-- finding dropped

Concrete mismatch for a SC4 finding on line 4:

source tuple
static finding ("requirements.txt", "SC4", 4, None)
stored in confirmed_granular (LLM filled end_line=4) ("requirements.txt", "SC4", 4, 4)
exact_key lookup (..., 4, None) → miss
start_only_key lookup (..., 4, None) → miss
coarse_key lookup confirmed_coarse is empty (LLM provided start_line) → miss

The three branches cover "LLM end_line equals static end_line" and "LLM omitted end_line", but not "LLM provided an end_line while the static finding's is None".

Minimal reproduction (no API key / network)

apply_filter can be exercised directly with a simulated LLM response:

from skillspector.models import Finding
from skillspector.nodes.meta_analyzer import LLMMetaAnalyzer


class _FakeBatch:
    def __init__(self, file_path):
        self.file_path = file_path


def make_finding(rule_id, start_line):
    return Finding(rule_id=rule_id, message=f"Vuln ({rule_id})", severity="CRITICAL",
                   confidence=0.9, file="requirements.txt",
                   start_line=start_line, end_line=None, remediation="")


findings = [make_finding("SC4", 4), make_finding("SC4", 5)]
llm_items = [
    {"pattern_id": "SC4", "is_vulnerability": True, "confidence": 1.0,
     "start_line": 4, "end_line": 4, "_file": "requirements.txt"},
    {"pattern_id": "SC4", "is_vulnerability": True, "confidence": 1.0,
     "start_line": 5, "end_line": 5, "_file": "requirements.txt"},
]
batch_results = [(_FakeBatch("requirements.txt"), llm_items)]

analyzer = LLMMetaAnalyzer.__new__(LLMMetaAnalyzer)  # skip __init__ (no LLM needed)
kept = analyzer.apply_filter(findings, batch_results)
print(f"confirmed={sum(i['is_vulnerability'] for i in llm_items)} kept={len(kept)}")

Output on v2.1.4:

confirmed=2 kept=0      # both LLM-confirmed findings dropped

Expected:

confirmed=2 kept=2

Suggested fix

Add an end_line-agnostic fallback keyed by (file, rule_id, start_line). It only relaxes the line-matching; the is_vulnerability / confidence >= 0.6 gating upstream is unchanged, so it cannot resurrect findings the LLM rejected (verified: a finding the LLM marked is_vulnerability=False stays dropped).

         confirmed_granular: dict[tuple[str, str, int, int | None], _enrichment] = {}
+        # end_line-agnostic index: some models populate end_line==start_line while
+        # static findings carry end_line=None, which made all three lookups miss
+        # and silently drop confirmed findings.
+        confirmed_by_start: dict[tuple[str, str, int], _enrichment] = {}
         confirmed_coarse: dict[tuple[str, str], _enrichment] = {}
@@
                     ] = enrichment
+                    confirmed_by_start[(file_path, pattern_id, int(start_line))] = enrichment
                 else:
                     confirmed_coarse[(file_path, pattern_id)] = enrichment
@@
             coarse_key = (f.file, f.rule_id)
+            start_key = (f.file, f.rule_id, f.start_line) if f.start_line is not None else None
             if exact_key in confirmed_granular:
                 expl, rem, conf = confirmed_granular[exact_key]
             elif start_only_key in confirmed_granular:
                 expl, rem, conf = confirmed_granular[start_only_key]
+            elif start_key is not None and start_key in confirmed_by_start:
+                expl, rem, conf = confirmed_by_start[start_key]
             elif coarse_key in confirmed_coarse:
                 expl, rem, conf = confirmed_coarse[coarse_key]
             else:
                 continue

After the fix the reproduction prints confirmed=2 kept=2, and the end-to-end supply-chain scan returns to 100 / CRITICAL (the LLM-rejected typosquatting finding correctly stays dropped, so the result is the 6 confirmed findings, not a blanket pass-through).

Suggested hardening (optional, separate from the fix)

Given that this is a security tool, consider making high-assurance static findings (OSV/CVE supply-chain, secrets, dangerous-code/AST) non-suppressible by the LLM filter — i.e. the LLM may add or annotate findings but never remove a deterministic static finding. That would bound the blast radius of any future matching defect to "extra noise" rather than "dropped CVE".

Suggested regression test

A unit test asserting apply_filter keeps a confirmed finding when the LLM returns end_line != None and the static finding has end_line == None would lock this in. The minimal reproduction above can serve as the basis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions