Skip to content

False-positive success ticket filed when pod eviction prevents artifact upload #278

@amp-rh

Description

@amp-rh

Summary

When a test pod is evicted (e.g., due to node DiskPressure), the GCS upload sidecar is also killed, resulting in no finished.json or JUnit XML artifacts for that step. Firewatch treats the absence of failure artifacts as "no failure" and files a success ticket, even though Prow reports the overall job as failed.

Observed Behavior

  • Job: periodic-ci-stolostron-policy-collection-main-ocp4.22-interop-opp-aws
  • Build ID: 2053568376499867648
  • Prow result: FAILURE (posted "ended with failure" to Slack)
  • Firewatch result: Filed CSPIT-3198 as "Job ... passed" with success label, immediately closed

The quay-tests-quay-interop-test pod was evicted due to DiskPressure after 0s. Because the sidecar never uploaded artifacts, firewatch's _find_pod_failures and _find_test_failures found zero failures, triggering report_success().

Expected Behavior

Firewatch should detect that the overall job failed and NOT file a success ticket. Ideally it should file a failure ticket indicating an infrastructure issue (pod eviction).

Root Cause

In src/objects/job.py, failure detection relies solely on:

  1. _find_pod_failures: checks finished.json per step (missing file = not checked)
  2. _find_test_failures: parses JUnit XML per step (missing file = not checked)

Neither method detects the case where a step's artifacts are entirely absent due to pod eviction.

Suggested Fix

Before calling report_success() in src/report/report.py, check the top-level Prow job finished.json at:

gs://test-platform-results/logs/{job_name}/{build_id}/finished.json

This file is written by Prow (not the step sidecar) and contains the authoritative "passed": true/false. If it shows passed: false but zero per-step failures were found, do NOT file a success ticket.

Alternative or complementary fix: compare discovered step artifacts against expected steps and flag any step with no finished.json as suspicious.

Impact

This causes false confidence in job health. Downstream consumers (watchers, Kanban boards, reporting) see "passed" and skip investigation, while the actual test was never executed.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions