Summary
When a test pod is evicted (e.g., due to node DiskPressure), the GCS upload sidecar is also killed, resulting in no finished.json or JUnit XML artifacts for that step. Firewatch treats the absence of failure artifacts as "no failure" and files a success ticket, even though Prow reports the overall job as failed.
Observed Behavior
- Job:
periodic-ci-stolostron-policy-collection-main-ocp4.22-interop-opp-aws
- Build ID:
2053568376499867648
- Prow result: FAILURE (posted "ended with failure" to Slack)
- Firewatch result: Filed CSPIT-3198 as "Job ... passed" with
success label, immediately closed
The quay-tests-quay-interop-test pod was evicted due to DiskPressure after 0s. Because the sidecar never uploaded artifacts, firewatch's _find_pod_failures and _find_test_failures found zero failures, triggering report_success().
Expected Behavior
Firewatch should detect that the overall job failed and NOT file a success ticket. Ideally it should file a failure ticket indicating an infrastructure issue (pod eviction).
Root Cause
In src/objects/job.py, failure detection relies solely on:
_find_pod_failures: checks finished.json per step (missing file = not checked)
_find_test_failures: parses JUnit XML per step (missing file = not checked)
Neither method detects the case where a step's artifacts are entirely absent due to pod eviction.
Suggested Fix
Before calling report_success() in src/report/report.py, check the top-level Prow job finished.json at:
gs://test-platform-results/logs/{job_name}/{build_id}/finished.json
This file is written by Prow (not the step sidecar) and contains the authoritative "passed": true/false. If it shows passed: false but zero per-step failures were found, do NOT file a success ticket.
Alternative or complementary fix: compare discovered step artifacts against expected steps and flag any step with no finished.json as suspicious.
Impact
This causes false confidence in job health. Downstream consumers (watchers, Kanban boards, reporting) see "passed" and skip investigation, while the actual test was never executed.
References
Summary
When a test pod is evicted (e.g., due to node DiskPressure), the GCS upload sidecar is also killed, resulting in no
finished.jsonor JUnit XML artifacts for that step. Firewatch treats the absence of failure artifacts as "no failure" and files a success ticket, even though Prow reports the overall job as failed.Observed Behavior
periodic-ci-stolostron-policy-collection-main-ocp4.22-interop-opp-aws2053568376499867648successlabel, immediately closedThe
quay-tests-quay-interop-testpod was evicted due to DiskPressure after 0s. Because the sidecar never uploaded artifacts, firewatch's_find_pod_failuresand_find_test_failuresfound zero failures, triggeringreport_success().Expected Behavior
Firewatch should detect that the overall job failed and NOT file a success ticket. Ideally it should file a failure ticket indicating an infrastructure issue (pod eviction).
Root Cause
In
src/objects/job.py, failure detection relies solely on:_find_pod_failures: checksfinished.jsonper step (missing file = not checked)_find_test_failures: parses JUnit XML per step (missing file = not checked)Neither method detects the case where a step's artifacts are entirely absent due to pod eviction.
Suggested Fix
Before calling
report_success()insrc/report/report.py, check the top-level Prow jobfinished.jsonat:This file is written by Prow (not the step sidecar) and contains the authoritative
"passed": true/false. If it showspassed: falsebut zero per-step failures were found, do NOT file a success ticket.Alternative or complementary fix: compare discovered step artifacts against expected steps and flag any step with no
finished.jsonas suspicious.Impact
This causes false confidence in job health. Downstream consumers (watchers, Kanban boards, reporting) see "passed" and skip investigation, while the actual test was never executed.
References