[REVIEW] post-incident-review: add root cause depth scoring, blast radius metrics, and detection engineering feedback loop

## Skill Being Reviewed
**Skill name:** post-incident-review
**Skill path:** `skills/incident-response/post-incident-review/`

## False Positive Analysis

**Benign PIR that can be scored higher than warranted by shallow root cause analysis:**
```yaml
root_cause_analysis:
  method: "5 Whys"
  chain:
    - "Why: Attacker exploited unpatched Apache Struts CVE-2023-XXXXX"
    - "Because: Vulnerability was published 30 days before exploitation"
    - "Because: Patch was not applied within the 7-day SLA"
    - "Because: The system was excluded from the automated patch cycle"
    - "Because: The system owner classified it as 'low priority' in the CMDB"
  root_cause_statement: "Apache Struts vulnerability not patched within SLA due to incorrect CMDB classification."
remediation:
  - Ensure all Struts systems are correctly classified in CMDB
  - Reduce patching SLA from 7 to 3 days for internet-facing systems
```
**Why this is a false positive:**
The 5 Whys analysis stops at "incorrect CMDB classification" without asking the next-level questions: Why was the classification incorrect? Was the CMDB inaccurate because there is no automated discovery? Was there a process gap in system onboarding? Did the patch management team have visibility into the correct classification but no enforcement authority? The root cause statement is a proximate cause (misclassification) rather than a systemic root cause (missing automated discovery and classification governance). The skill's RCA guidance says "Stop when you reach a cause that is within the organization's control to change," but this stopping rule is ambiguous — an organization can change a CMDB entry, but that does not prevent the misclassification pattern from recurring for other systems. The skill should require that the 5 Whys chain demonstrate that the identified root cause prevents recurrence of the class of incident, not just this specific instance.

**Benign PIR with comprehensive remediation but no blast radius quantification:**
```yaml
remediation_plan:
  - Finding: Missing EDR on Linux servers
    Action: Deploy CrowdStrike Falcon to all Linux servers
    Owner: Platform Team
    Priority: P1
    Deadline: 30 days
  - Finding: Insufficient log retention
    Action: Extend CloudTrail retention to 365 days
    Owner: Security Engineering
    Priority: P2
    Deadline: 90 days
metrics:
  dwell_time: "28 days"
  mttd: "28 days"
  mttc: "4 hours"
  mttr: "48 hours"
```
**Why this is a false positive:**
The skill's metrics section (Step 4) covers MTTD, MTTC, MTTR, Dwell Time, and related metrics, but it does not include blast radius quantification. Without a blast radius metric (number of systems affected, data records exposed, business processes disrupted, revenue impact), the PIR cannot distinguish a low-severity incident with good detection from a high-severity incident with good detection. The remediation priority is context-dependent: a P1 for a critical-system incident is more urgent than a P1 for a sandbox incident, but the output format does not capture the blast radius context that justifies priority assignments.

## Coverage Gaps

**Missed variant 1: Root cause depth score — 5 Whys without recurrence prevention evidence.**
```yaml
root_cause_analysis:
  method: "5 Whys"
  chain:
    - "Why: Attacker exploited vulnerable library in production"
    - "Because: Library version was 2 years old"
    - "Because: Dependabot PR was created but never merged"
    - "Because: PR required manual approval and was deprioritized"
    - "Because: No SLA enforcement for dependency update PRs"
  root_cause: "Missing automated dependency update policy with SLA enforcement"
  recurrence_risk: |
    The remediations address Dependabot auto-merge for this specific repo,
    but 15 other repos have similar Dependabot PRs waiting for manual approval.
    No org-wide dependency update policy has been created.
```
**Why it should be caught:**
The skill's root cause analysis guidance and output format do not require a `recurrence_risk` assessment or `scope_of_root_cause` field that identifies whether the root cause is specific to one system/team/process or is an organizational pattern. Without this field, remediation actions may be scoped too narrowly (fixing one CMDB entry or one repo's merge process) while the same pattern exists in many other places. The skill should require that every root cause analysis output includes a `scope` field (single-instance / team-pattern / org-wide) and a `recurrence_likelihood` assessment.

**Missed variant 2: Detection engineering feedback loop not documented in PIR output.**
```yaml
detection_improvement:
  new_rules_created:
    - rule_name: "Apache Struts CVE-2023-XXXXX exploitation attempt"
      status: deployed (SIEM + WAF)
  existing_rules_tuned:
    - rule: "Outbound SMB connection detection"
      action: threshold lowered from 100MB to 10MB
  rules_not_updated:
    - reason: "Rule for this attack technique already exists but was evaded; no update identified"
  detection_coverage_map:
    mitre_ttps_covered: ["T1190", "T1505", "T1078", "T1021"]
    mitre_ttps_missed: []
```
**Why it should be caught:**
NIST SP 800-61 Rev 2 Section 3.4.2 ("Using Collected Incident Data") recommends using post-incident data to improve detection capability. The current PIR output format includes a "Detection Rule Updates Required" checkbox in the Follow-Up Schedule, but it does not have a structured section for the detection engineering feedback loop — what specific rules were created/tuned as a result of this incident, and what ATT&CK techniques were covered or remain uncovered. Without this structured section, the PIR may identify detection gaps but not translate them into verifiable detection engineering actions.

**Missed variant 3: Cross-team communication and escalation path not evaluated.**
```yaml
communication_failures:
  - event: SOC identified suspicious activity at T+2h
    delay: T+8h before contacting system owner
    cause: SOC did not have on-call contact for the affected system
    escalation_matrix: outdated (system ownership changed 3 months ago)
  - event: Legal notification required due to data breach notification law
    delay: T+24h after confirmation
    cause: No pre-established notification template or legal contact workflow
```
**Why it should be caught:**
The PIR process includes communication logs and escalation times in the timeline, and common pitfalls mention "Communication failure — stakeholders were not notified, or notification was delayed." However, the structured output format does not require a dedicated communication and coordination section that evaluates notification timeliness, escalation matrix accuracy, external notification SLA compliance, and coordination quality across teams. Without this section, the PIR may document communication delays but not capture the systemic pattern (e.g., out-of-date escalation matrix) or the compliance implications (e.g., GDPR 72-hour notification breach).

## Edge Cases

- Incidents involving third-party or managed security service providers (MSSP) where handoff between internal and external teams creates detection/response gaps not captured in a single-organization PIR format.
- Incidents that span multiple cloud providers or jurisdictions where data localization, privacy law, and law enforcement access create coordination complexity not reflected in the communication assessment.
- PIR for a "near miss" (incident prevented by defense-in-depth) where no actual compromise occurred — the current PIR format assumes a confirmed incident with measurable dwell time and blast radius.

## Remediation Quality

- [x] Fix resolves the vulnerability
- [x] Fix doesn't introduce new security issues
- [x] Fix doesn't break functionality
- **Issues found:** Add root cause depth/scope scoring, blast radius quantification, detection engineering feedback loop section, and communication coordination assessment.

Recommended additions:
1. Add a root cause `scope` field (single-instance / team-pattern / org-wide) and `recurrence_prevention_evidence` requirement to ensure RCA goes beyond proximate cause.
2. Add blast radius metrics: affected system count, data records exposed, business process impact, regulatory notification requirement.
3. Add a dedicated "Detection Engineering Feedback Loop" section with new rules created, existing rules tuned, and ATT&CK coverage map.
4. Add a "Communication and Coordination Assessment" section with escalation matrix accuracy, notification SLA compliance, and cross-team coordination quality evaluation.

## Comparison to Other Tools

| Tool | Catches this? | Notes |
|------|:---:|-------|
| NIST SP 800-61 Rev 2 | Partial | Recommends using incident data for detection improvement but does not define a structured output format |
| Google SRE postmortem culture | Partial | Emphasizes blamelessness but does not require blast radius metrics or recurrence scope scoring |
| Jeli / FireHydrant (incident analysis platforms) | Partial | Commercial tools offer timeline and action tracking but leave root cause depth and detection feedback to reviewer judgment |
| PagerDuty Incident Response | Partial | Focuses on response coordination; post-incident analysis depth depends on reviewer |

## Overall Assessment

**Strengths:**
- Strong adherence to NIST SP 800-61 Rev 2 methodology with blameless retrospective, timeline reconstruction, RCA, and metrics.
- Good control failure mapping with common pattern reference table.
- Clear remediation prioritization (P0-P3) with SLA deadlines.

**Needs improvement:**
- Root cause analysis lacks a depth/scope scoring mechanism. The 5 Whys stopping rule is ambiguous and can produce proximate-cause-level RCA.
- Blast radius quantification is absent from the metrics and remediation sections.
- Detection engineering feedback loop is limited to a checkbox rather than a structured output section.
- Communication and coordination assessment is not a distinct section in the PIR output format.

**Priority recommendations:**
1. Add root cause scope and recurrence prevention evidence requirements to the RCA output format.
2. Add blast radius metrics to the incident metrics section.
3. Add a dedicated detection engineering feedback loop output section with rule creation/tuning and ATT&CK coverage mapping.
4. Add a communication and coordination assessment section with escalation matrix accuracy and notification SLAs.

## Sources Checked
- Current skill reviewed locally: `skills/incident-response/post-incident-review/SKILL.md`
- Existing reviews: #1398 (remediation verification gates), #1370 (require remediation closure)
- NIST SP 800-61 Rev 2: https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final
- Google SRE book — Postmortem Culture: https://sre.google/sre-book/postmortem-culture/

## Bounty Info
- [x] I have read and agree to the [CONTRIBUTING.md](../../CONTRIBUTING.md) bounty terms
- **Preferred payment method:** PayPal, to be provided privately after acceptance


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] post-incident-review: add root cause depth scoring, blast radius metrics, and detection engineering feedback loop #1534

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Sources Checked

Bounty Info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tool	Catches this?	Notes
NIST SP 800-61 Rev 2	Partial	Recommends using incident data for detection improvement but does not define a structured output format
Google SRE postmortem culture	Partial	Emphasizes blamelessness but does not require blast radius metrics or recurrence scope scoring
Jeli / FireHydrant (incident analysis platforms)	Partial	Commercial tools offer timeline and action tracking but leave root cause depth and detection feedback to reviewer judgment
PagerDuty Incident Response	Partial	Focuses on response coordination; post-incident analysis depth depends on reviewer

[REVIEW] post-incident-review: add root cause depth scoring, blast radius metrics, and detection engineering feedback loop #1534

Description

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Sources Checked

Bounty Info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions