diff --git a/.claude/commands/analyze-ci-results.md b/.claude/commands/analyze-ci-results.md new file mode 100644 index 000000000..1c6574b23 --- /dev/null +++ b/.claude/commands/analyze-ci-results.md @@ -0,0 +1,280 @@ +--- +name: analyze-ci-results +description: Analyze OpenShift CI (Prow) test results from a gcsweb URL - identifies infra vs test/code failures and correlates with git commits +parameters: + - name: ci-url + description: > + The gcsweb URL for a CI run. Can be any level of the artifact tree: + - Job root: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID}/ + - Test artifacts: .../{RUN_ID}/artifacts/e2e-incidents/monitoring-plugin-tests-incidents-ui/ + - Prow UI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID} + required: true + - name: focus + description: "Optional: focus analysis on specific test file or area (e.g., 'regression', '01.incidents', 'filtering')" + required: false +--- + +# Analyze OpenShift CI Test Results + +Fetch, parse, and classify failures from an OpenShift CI (Prow) test run. This skill is designed to be the **first step** in an agentic test iteration workflow — it produces a structured diagnosis that the orchestrator can act on. + +## Instructions + +### Step 1: Normalize the URL + +The user may provide a URL at any level. Normalize it to the **job root**: + +``` +https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID}/ +``` + +If the user provides a Prow UI URL (`prow.ci.openshift.org/view/gs/...`), convert it: +- Replace `https://prow.ci.openshift.org/view/gs/` with `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/` +- Append trailing `/` if missing + +Derive these base paths: +- **Job root**: `{normalized_url}` +- **Test artifacts root**: `{normalized_url}artifacts/e2e-incidents/monitoring-plugin-tests-incidents-ui/` +- **Screenshots root**: `{test_artifacts_root}artifacts/screenshots/` +- **Videos root**: `{test_artifacts_root}artifacts/videos/` + +### Step 2: Fetch Job Metadata (parallel) + +Fetch these files from the **job root** using WebFetch: + +| File | What to extract | +|------|----------------| +| `started.json` | `timestamp`, `pull` (PR number), `repos` (commit SHAs) | +| `finished.json` | `passed` (bool), `result` ("SUCCESS"/"FAILURE"), `revision` (PR HEAD SHA) | +| `prowjob.json` | PR title, PR author, PR branch, base branch, base SHA, PR SHA, job name, cluster, duration | + +From `started.json` `repos` field, extract: +- **Base commit**: the SHA after `main:` (before the comma) +- **PR commit**: the SHA after `{PR_NUMBER}:` + +Present a summary: +``` +CI Run Summary: + PR: #{PR_NUMBER} - {PR_TITLE} + Author: {AUTHOR} + Branch: {PR_BRANCH} -> {BASE_BRANCH} + PR commit: {PR_SHA} (short: first 7 chars) + Base commit: {BASE_SHA} (short: first 7 chars) + Result: PASSED / FAILED + Duration: {DURATION} + Job: {JOB_NAME} +``` + +### Step 3: Fetch and Parse Test Results + +Fetch `{test_artifacts_root}build-log.txt` using WebFetch. + +#### Cypress Output Format + +The build log contains Cypress console output. Parse these sections: + +**Per-spec results block** — appears after each spec file runs: +``` + (Results) + + ┌──────────────────────────────────────────────────────────┐ + │ Tests: N │ + │ Passing: N │ + │ Failing: N │ + │ Pending: N │ + │ Skipped: N │ + │ Screenshots: N │ + │ Video: true │ + │ Duration: X minutes, Y seconds │ + │ Spec Ran: {spec-file-name}.cy.ts │ + └──────────────────────────────────────────────────────────┘ +``` + +**Final summary table** — appears at the very end: +``` + (Run Finished) + + ┌──────────────────────────────────────────────────────────┐ + │ Spec Tests Passing Failing Pending │ + ├──────────────────────────────────────────────────────────┤ + │ ✓ spec-file.cy.ts 5 5 0 0 │ + │ ✗ other-spec.cy.ts 3 1 2 0 │ + └──────────────────────────────────────────────────────────┘ +``` + +**Failure details** — appear inline during test execution: +``` + 1) Suite Name + "before all" hook for "test description": + ErrorType: error message + > detailed error + at stack trace... + + N failing +``` + +Or for test-level (not hook) failures: +``` + 1) Suite Name + test description: + AssertionError: Timed out retrying after Nms: Expected to find element: .selector +``` + +Extract per-spec: +- Spec file name +- Pass/fail/skip counts +- For failures: test name, error type, error message, whether it was in a hook + +### Step 4: Fetch Failure Screenshots + +For each failing spec, navigate to `{screenshots_root}{spec-file-name}/` and list available screenshots. + +**Screenshot naming convention:** +``` +{Suite Name} -- {Test Title} -- before all hook (failed).png +{Suite Name} -- {Test Title} (failed).png +``` + +Fetch each screenshot URL and **read it using the Read tool** (multimodal) to understand the visual state at failure time. Describe what you see: +- What page/view is shown? +- Are there error dialogs, loading spinners, empty states? +- Is the expected UI element visible? If not, what's in its place? +- Are there console errors visible in the browser? + +### Step 5: Classify Each Failure + +For every failing test, classify it into one of these categories: + +#### Infrastructure Failures (not actionable by test code changes) + +| Classification | Indicators | +|---------------|------------| +| `INFRA_CLUSTER` | Certificate expired, API server unreachable, node not ready, cluster version mismatch | +| `INFRA_OPERATOR` | COO/CMO installation timeout, operator pod not running, CRD not found | +| `INFRA_PLUGIN` | Plugin deployment unavailable, dynamic plugin chunk loading error, console not accessible | +| `INFRA_AUTH` | Login failed, kubeconfig invalid, RBAC permission denied (for expected operations) | +| `INFRA_CI` | Pod eviction, OOM killed, timeout at infrastructure level (not test timeout) | + +**Key signals for infra issues:** +- Errors in `before all` hooks related to cluster setup +- Certificate/TLS errors +- `oc` command failures with connection errors +- Element `.co-clusterserviceversion-install__heading` not found (operator install UI) +- Errors mentioning pod names, namespaces, or k8s resources +- `e is not a function` or similar JS errors from the console application itself (not test code) + +#### Test/Code Failures (actionable) + +| Classification | Indicators | +|---------------|------------| +| `TEST_BUG` | Wrong selector, incorrect assertion logic, race condition / timing issue, test assumes wrong state | +| `FIXTURE_ISSUE` | Mock data doesn't match expected structure, missing alerts/incidents in fixture, edge case not covered | +| `PAGE_OBJECT_GAP` | Page object method missing, selector outdated, doesn't match current DOM | +| `MOCK_ISSUE` | cy.intercept not matching the actual API call, response shape incorrect, query parameter mismatch | +| `CODE_REGRESSION` | Test was passing before, UI behavior genuinely changed — the source code has a bug | + +**Key signals for test/code issues:** +- `AssertionError: Timed out retrying` on application-specific selectors (not infra selectors) +- `Expected X to equal Y` where the assertion logic is wrong +- Failures only in specific test scenarios, not across the board +- Screenshot shows the UI rendered correctly but test expected something different + +### Step 6: Correlate with Git Commits + +Using the PR commit SHA and base commit SHA from Step 2: + +1. **Check local git history**: Run `git log {base_sha}..{pr_sha} --oneline` to see what changed in the PR +2. **Identify relevant changes**: Run `git diff {base_sha}..{pr_sha} --stat` to see which files were modified +3. **For CODE_REGRESSION failures**: Check if the failing component's source code was modified in the PR +4. **For TEST_BUG failures**: Check if the test itself was modified in the PR (new test might have a bug) + +Present the correlation: +``` +Commit correlation for {test_name}: + PR modified: src/components/incidents/IncidentChart.tsx (+45, -12) + Test file: cypress/e2e/incidents/01.incidents.cy.ts (unchanged) + Verdict: CODE_REGRESSION - chart rendering changed but test expectations not updated +``` + +Or: +``` +Commit correlation for {test_name}: + PR modified: cypress/e2e/incidents/regression/01.reg_filtering.cy.ts (+30, -5) + Source code: src/components/incidents/ (unchanged) + Verdict: TEST_BUG - new test code has incorrect assertion +``` + +### Step 7: Produce Structured Report + +Output a structured report with this format: + +``` +# CI Analysis Report + +## Run: PR #{PR} - {TITLE} +- Commit: {SHORT_SHA} by {AUTHOR} +- Branch: {BRANCH} +- Result: {RESULT} +- Duration: {DURATION} + +## Summary +- Total specs: N +- Passed: N +- Failed: N (M infra, K test/code) + +## Infrastructure Issues (not actionable via test changes) + +### INFRA_CLUSTER: Certificate expired +- Affected: ALL tests (cascade failure) +- Detail: x509 certificate expired at {timestamp} +- Action needed: Cluster certificate renewal (outside test scope) + +## Test/Code Issues (actionable) + +### TEST_BUG: Selector timeout in filtering test +- Spec: regression/01.reg_filtering.cy.ts +- Test: "should filter incidents by severity" +- Error: Timed out retrying after 80000ms: Expected to find element: [data-test="severity-filter"] +- Screenshot: [description of what screenshot shows] +- Commit correlation: Test file was modified in this PR (+30 lines) +- Suggested fix: Update selector to match current DOM structure + +### CODE_REGRESSION: Chart not rendering after component refactor +- Spec: regression/02.reg_ui_charts_comprehensive.cy.ts +- Test: "should display incident bars in chart" +- Error: Expected 5 bars, found 0 +- Screenshot: Chart area is empty, no error messages visible +- Commit correlation: src/components/incidents/IncidentChart.tsx was refactored +- Suggested fix: Investigate chart rendering logic in the refactored component + +## Flakiness Indicators +- If a test failed with a timing-related error but similar tests in the same suite passed, + flag it as potentially flaky +- If the error message contains "Timed out retrying" on an element that should exist, + it may be a race condition rather than a missing element + +## Recommendations +- List prioritized next steps +- For infra issues: what needs to happen before tests can run +- For test/code issues: which fixes to attempt first (quick wins vs complex) +- Whether local reproduction is recommended +``` + +### Step 8: If `focus` parameter is provided + +Filter the analysis to only the relevant tests. For example: +- `focus=regression` -> only analyze `regression/*.cy.ts` specs +- `focus=filtering` -> only analyze tests with "filter" in their name +- `focus=01.incidents` -> only analyze `01.incidents.cy.ts` + +Still fetch all metadata and provide the full context, but limit detailed diagnosis to the focused area. + +## Notes for the Orchestrator + +When this skill is used as the first step of `/iterate-incident-tests`: + +1. **If all failures are INFRA_***: Report to user and STOP. No test changes will help. +2. **If mixed INFRA_* and TEST/CODE**: Report infra issues to user, proceed with test/code fixes only. +3. **If all failures are TEST/CODE**: Proceed with the full iteration loop. +4. **The commit correlation** tells the orchestrator whether to focus on fixing tests or investigating source code changes. +5. **Screenshots** give the Diagnosis Agent a head start — it can reference the CI screenshot analysis instead of reproducing the failure locally first. diff --git a/.claude/commands/cypress/scripts/notify-slack.py b/.claude/commands/cypress/scripts/notify-slack.py new file mode 100644 index 000000000..49e3c6bfd --- /dev/null +++ b/.claude/commands/cypress/scripts/notify-slack.py @@ -0,0 +1,305 @@ +#!/usr/bin/env python3 +"""Send Slack notifications for agentic test iteration loops. + +Supports two modes based on environment variables: + +Option A (Webhook — one-way): + SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." + +Option B (Bot with thread replies — two-way): + SLACK_BOT_TOKEN="xoxb-..." + SLACK_CHANNEL_ID="C0123456789" + +If neither is set, prints the message to stdout and exits cleanly. + +Usage: + # Send a notification (both modes) + python3 notify-slack.py send [options] + + # Wait for thread reply (Option B only) + python3 notify-slack.py wait [--timeout 600] + +Event types: + fix_applied, ci_started, ci_complete, ci_failed, + review_needed, iteration_done, flaky_found, blocked + +Options: + --pr PR number (adds link to message) + --branch Branch name + --url CI run URL + --thread-ts Reply in a thread (Option B) + --timeout Review window timeout for 'wait' command (default: 600) +""" + +import argparse +import json +import os +import subprocess +import sys +import time +import urllib.request +import urllib.error + + +EMOJI = { + "fix_applied": ":wrench:", + "ci_started": ":hourglass_flowing_sand:", + "ci_complete": ":white_check_mark:", + "ci_failed": ":x:", + "review_needed": ":eyes:", + "iteration_done": ":checkered_flag:", + "flaky_found": ":warning:", + "blocked": ":octagonal_sign:", +} + + +def build_blocks(event_type, message, pr=None, branch=None, url=None): + """Build Slack Block Kit blocks for the notification.""" + emoji = EMOJI.get(event_type, ":robot_face:") + + blocks = [ + { + "type": "section", + "text": { + "type": "mrkdwn", + "text": f"{emoji} *Agent: {event_type.replace('_', ' ').title()}*", + }, + }, + { + "type": "section", + "text": {"type": "mrkdwn", "text": message}, + }, + ] + + context_parts = [] + if pr: + context_parts.append( + f"" + ) + if branch: + context_parts.append(f"Branch: `{branch}`") + if url: + context_parts.append(f"<{url}|CI Run>") + + if context_parts: + blocks.append( + { + "type": "context", + "elements": [ + {"type": "mrkdwn", "text": " | ".join(context_parts)} + ], + }, + ) + + return blocks + + +def send_webhook(webhook_url, blocks): + """Option A: Send via incoming webhook.""" + payload = json.dumps({"blocks": blocks}).encode("utf-8") + + req = urllib.request.Request( + webhook_url, + data=payload, + headers={"Content-Type": "application/json"}, + method="POST", + ) + + try: + with urllib.request.urlopen(req) as resp: + return {"ok": True, "status": resp.status} + except urllib.error.HTTPError as e: + print(f"Webhook failed: HTTP {e.code} — {e.read().decode()}", file=sys.stderr) + return {"ok": False, "error": str(e)} + + +def slack_api(token, method, payload): + """Call a Slack Web API method.""" + url = f"https://slack.com/api/{method}" + data = json.dumps(payload).encode("utf-8") + + req = urllib.request.Request( + url, + data=data, + headers={ + "Content-Type": "application/json; charset=utf-8", + "Authorization": f"Bearer {token}", + }, + method="POST", + ) + + try: + with urllib.request.urlopen(req) as resp: + return json.loads(resp.read().decode()) + except urllib.error.HTTPError as e: + body = e.read().decode() + print(f"Slack API {method} failed: HTTP {e.code} — {body}", file=sys.stderr) + return {"ok": False, "error": str(e)} + + +def send_bot(token, channel, blocks, thread_ts=None): + """Option B: Send via bot token.""" + payload = { + "channel": channel, + "blocks": blocks, + } + if thread_ts: + payload["thread_ts"] = thread_ts + + result = slack_api(token, "chat.postMessage", payload) + + if result.get("ok"): + ts = result.get("ts", "") + print(f"MESSAGE_TS={ts}") + return {"ok": True, "ts": ts} + else: + print(f"Bot send failed: {result.get('error')}", file=sys.stderr) + return {"ok": False, "error": result.get("error")} + + +def wait_for_reply(token, channel, message_ts, timeout=600, poll_interval=30): + """Option B: Poll for thread replies within a review window. + + Returns the latest user reply text, or None if no reply within timeout. + Output format: + REPLY= + NO_REPLY + """ + # Get bot's own user ID to filter out its own messages + auth_result = slack_api(token, "auth.test", {}) + bot_user_id = auth_result.get("user_id", "") + + deadline = time.time() + timeout + seen_messages = set() + + # Seed with the original message to ignore it + seen_messages.add(message_ts) + + print(f"Waiting up to {timeout}s for reply in thread {message_ts}...", flush=True) + + while time.time() < deadline: + result = slack_api( + token, + "conversations.replies", + {"channel": channel, "ts": message_ts}, + ) + + if result.get("ok"): + messages = result.get("messages", []) + for msg in messages: + msg_ts = msg.get("ts", "") + user = msg.get("user", "") + + if msg_ts in seen_messages: + continue + seen_messages.add(msg_ts) + + # Skip bot's own messages + if user == bot_user_id: + continue + + # Found a user reply + reply_text = msg.get("text", "") + print(f"REPLY={reply_text}") + return reply_text + + remaining = int(deadline - time.time()) + if remaining > 0: + print( + f"No reply yet, {remaining}s remaining...", + file=sys.stderr, + flush=True, + ) + + time.sleep(min(poll_interval, max(1, remaining))) + + print("NO_REPLY") + return None + + +def cmd_send(args): + """Handle the 'send' subcommand.""" + webhook_url = os.environ.get("SLACK_WEBHOOK_URL", "") + bot_token = os.environ.get("SLACK_BOT_TOKEN", "") + channel_id = os.environ.get("SLACK_CHANNEL_ID", "") + + blocks = build_blocks( + args.event_type, args.message, pr=args.pr, branch=args.branch, url=args.url + ) + + # Option B: Bot token takes priority (supports two-way) + if bot_token and channel_id: + result = send_bot(bot_token, channel_id, blocks, thread_ts=args.thread_ts) + return 0 if result.get("ok") else 1 + + # Option A: Webhook (one-way) + if webhook_url: + result = send_webhook(webhook_url, blocks) + return 0 if result.get("ok") else 1 + + # No Slack configured — print to stdout and exit cleanly + emoji = EMOJI.get(args.event_type, "") + print(f"[slack-skip] {emoji} {args.event_type}: {args.message}") + return 0 + + +def cmd_wait(args): + """Handle the 'wait' subcommand.""" + bot_token = os.environ.get("SLACK_BOT_TOKEN", "") + channel_id = os.environ.get("SLACK_CHANNEL_ID", "") + + if not bot_token or not channel_id: + print( + "NO_REPLY (Option B not configured — SLACK_BOT_TOKEN and SLACK_CHANNEL_ID required)" + ) + return 0 + + reply = wait_for_reply( + bot_token, channel_id, args.message_ts, timeout=args.timeout + ) + return 0 + + +def main(): + parser = argparse.ArgumentParser( + description="Slack notifications for agentic test iteration" + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + # 'send' subcommand + send_parser = subparsers.add_parser("send", help="Send a notification") + send_parser.add_argument( + "event_type", + choices=list(EMOJI.keys()), + help="Event type", + ) + send_parser.add_argument("message", help="Message text (Slack mrkdwn supported)") + send_parser.add_argument("--pr", help="PR number") + send_parser.add_argument("--branch", help="Branch name") + send_parser.add_argument("--url", help="CI run URL") + send_parser.add_argument( + "--thread-ts", help="Thread timestamp to reply in (Option B)" + ) + + # 'wait' subcommand + wait_parser = subparsers.add_parser( + "wait", help="Wait for thread reply (Option B only)" + ) + wait_parser.add_argument("message_ts", help="Message timestamp to watch") + wait_parser.add_argument( + "--timeout", + type=int, + default=600, + help="Seconds to wait for reply (default: 600)", + ) + + args = parser.parse_args() + + if args.command == "send": + return cmd_send(args) + elif args.command == "wait": + return cmd_wait(args) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/commands/cypress/scripts/poll-ci-status.py b/.claude/commands/cypress/scripts/poll-ci-status.py new file mode 100644 index 000000000..226399074 --- /dev/null +++ b/.claude/commands/cypress/scripts/poll-ci-status.py @@ -0,0 +1,92 @@ +#!/usr/bin/env python3 +"""Poll OpenShift CI (Prow) job status for a PR until completion. + +Usage: + python3 poll-ci-status.py [job_substring] [max_attempts] [interval_seconds] + +Arguments: + pr_number GitHub PR number to poll + job_substring Substring to match in job name (default: e2e-incidents) + max_attempts Maximum polling attempts (default: 30) + interval_seconds Sleep between polls in seconds (default: 300) + +Output on completion: + CI_COMPLETE state=SUCCESS url= + CI_COMPLETE state=FAILURE url= + CI_TIMEOUT (if max_attempts reached) + +Requires: gh CLI authenticated with access to the repo. +""" + +import subprocess +import json +import time +import sys + + +def poll(pr, job_substring="e2e-incidents", max_attempts=30, interval=300): + for attempt in range(max_attempts): + result = subprocess.run( + ["gh", "pr", "checks", pr, "--json", "name,state,link"], + capture_output=True, + text=True, + ) + + if result.returncode != 0: + print( + f"gh pr checks failed (attempt {attempt + 1}/{max_attempts}): {result.stderr.strip()}", + flush=True, + ) + time.sleep(interval) + continue + + try: + checks = json.loads(result.stdout) + except json.JSONDecodeError: + print( + f"Invalid JSON from gh pr checks (attempt {attempt + 1}/{max_attempts})", + flush=True, + ) + time.sleep(interval) + continue + + found = False + for check in checks: + if job_substring in check.get("name", ""): + found = True + state = check["state"] + url = check.get("link", "") + + if state in ("SUCCESS", "FAILURE"): + print(f"CI_COMPLETE state={state} url={url}") + return 0 + + print( + f"CI_PENDING state={state}, attempt {attempt + 1}/{max_attempts}, sleeping {interval}s...", + flush=True, + ) + break + + if not found: + print( + f"Job '{job_substring}' not found yet, attempt {attempt + 1}/{max_attempts}, sleeping {interval}s...", + flush=True, + ) + + time.sleep(interval) + + print("CI_TIMEOUT") + return 1 + + +if __name__ == "__main__": + if len(sys.argv) < 2: + print(f"Usage: {sys.argv[0]} [job_substring] [max_attempts] [interval_seconds]") + sys.exit(2) + + pr = sys.argv[1] + job = sys.argv[2] if len(sys.argv) > 2 else "e2e-incidents" + attempts = int(sys.argv[3]) if len(sys.argv) > 3 else 30 + interval = int(sys.argv[4]) if len(sys.argv) > 4 else 300 + + sys.exit(poll(pr, job, attempts, interval)) diff --git a/.claude/commands/cypress/scripts/review-github.py b/.claude/commands/cypress/scripts/review-github.py new file mode 100644 index 000000000..57877c103 --- /dev/null +++ b/.claude/commands/cypress/scripts/review-github.py @@ -0,0 +1,232 @@ +#!/usr/bin/env python3 +"""GitHub PR comment-based review flow for agentic test iteration. + +Posts fix details as PR comments and polls for author replies within a +timed review window. Designed to work alongside Slack webhook notifications +(one-way) — GitHub PR comments provide the two-way interaction channel. + +Usage: + # Post a review comment on a PR + python3 review-github.py post [--repo owner/repo] + + # Wait for author reply within a review window + python3 review-github.py wait [--timeout 600] [--repo owner/repo] + +Output formats: + post: COMMENT_ID= COMMENT_TIME= + wait: REPLY= (author replied) + NO_REPLY (timeout reached, no author reply) + +Requires: gh CLI authenticated with comment access to the target repo. + +Security: Author filtering is enforced deterministically in code — +the PR author's login is fetched via API and only comments from that +user are considered. This is not instruction-based filtering. +""" + +import argparse +import json +import subprocess +import sys +import time +from datetime import datetime, timezone + + +DEFAULT_REPO = "openshift/monitoring-plugin" +MAGIC_PREFIX = "/agent" + + +def gh_api(endpoint, method="GET", body=None, repo=None): + """Call GitHub API via gh CLI.""" + cmd = ["gh", "api"] + if repo: + endpoint = endpoint.replace("{repo}", repo) + if method != "GET": + cmd.extend(["--method", method]) + if body: + for key, value in body.items(): + cmd.extend(["-f", f"{key}={value}"]) + cmd.append(endpoint) + + result = subprocess.run(cmd, capture_output=True, text=True) + if result.returncode != 0: + print(f"gh api failed: {result.stderr.strip()}", file=sys.stderr) + return None + + if not result.stdout.strip(): + return {} + + try: + return json.loads(result.stdout) + except json.JSONDecodeError: + print(f"Invalid JSON from gh api: {result.stdout[:200]}", file=sys.stderr) + return None + + +def get_pr_author(pr, repo): + """Fetch the PR author's login.""" + data = gh_api(f"repos/{repo}/pulls/{pr}") + if data and "user" in data: + return data["user"]["login"] + return None + + +def post_comment(pr, message, repo): + """Post a comment on a PR. Returns (comment_id, created_at).""" + data = gh_api( + f"repos/{repo}/issues/{pr}/comments", + method="POST", + body={"body": message}, + ) + if data and "id" in data: + comment_id = data["id"] + created_at = data.get("created_at", "") + print(f"COMMENT_ID={comment_id}") + print(f"COMMENT_TIME={created_at}") + return comment_id, created_at + + print("Failed to post comment", file=sys.stderr) + return None, None + + +def wait_for_author_reply(pr, since_timestamp, repo, timeout=600, poll_interval=30): + """Poll PR comments for a reply from the PR author. + + Only considers comments that: + 1. Were posted AFTER since_timestamp (time-scoped) + 2. Were authored by the PR author (deterministic .user.login check) + 3. Optionally start with the magic prefix /agent (if present, stripped from reply) + + Args: + pr: PR number + since_timestamp: ISO 8601 timestamp — only comments after this are considered + repo: owner/repo string + timeout: seconds to wait before giving up + poll_interval: seconds between polls + + Returns: + Reply text if found, None otherwise. + """ + # Fetch PR author login — deterministic, code-enforced filter + pr_author = get_pr_author(pr, repo) + if not pr_author: + print("Could not determine PR author. Proceeding without review.", file=sys.stderr) + print("NO_REPLY") + return None + + print(f"Waiting up to {timeout}s for reply from @{pr_author} on PR #{pr}...", flush=True) + + deadline = time.time() + timeout + seen_ids = set() + + while time.time() < deadline: + # Fetch comments created after since_timestamp + comments = gh_api( + f"repos/{repo}/issues/{pr}/comments?since={since_timestamp}&per_page=50" + ) + + if comments is None: + remaining = int(deadline - time.time()) + if remaining > 0: + print(f"API error, retrying in {poll_interval}s ({remaining}s remaining)...", + file=sys.stderr, flush=True) + time.sleep(min(poll_interval, max(1, remaining))) + continue + + for comment in comments: + comment_id = comment.get("id") + if comment_id in seen_ids: + continue + seen_ids.add(comment_id) + + # Deterministic author filter — code-enforced, not instruction-based + commenter = comment.get("user", {}).get("login", "") + if commenter != pr_author: + continue + + body = comment.get("body", "").strip() + + # If magic prefix is used, strip it; otherwise accept any author comment + if body.startswith(MAGIC_PREFIX): + body = body[len(MAGIC_PREFIX):].strip() + + if body: + print(f"REPLY={body}") + return body + + remaining = int(deadline - time.time()) + if remaining > 0: + print( + f"No reply yet from @{pr_author}, {remaining}s remaining...", + file=sys.stderr, + flush=True, + ) + time.sleep(min(poll_interval, max(1, remaining))) + + print("NO_REPLY") + return None + + +def format_fix_comment(message): + """Wrap the agent's message in a standard comment format.""" + return ( + "### Agent: Fix Applied\n\n" + f"{message}\n\n" + "---\n" + f"*Reply to this comment (or prefix with `{MAGIC_PREFIX}`) to provide feedback. " + "The agent will incorporate your input before pushing, or proceed automatically " + "after the review window expires.*" + ) + + +def cmd_post(args): + """Handle the 'post' subcommand.""" + formatted = format_fix_comment(args.message) + comment_id, created_at = post_comment(args.pr, formatted, args.repo) + return 0 if comment_id else 1 + + +def cmd_wait(args): + """Handle the 'wait' subcommand.""" + wait_for_author_reply( + args.pr, args.since, args.repo, timeout=args.timeout + ) + return 0 + + +def main(): + parser = argparse.ArgumentParser( + description="GitHub PR comment-based review for agentic test iteration" + ) + parser.add_argument( + "--repo", default=DEFAULT_REPO, + help=f"GitHub repo (default: {DEFAULT_REPO})" + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + # 'post' subcommand + post_parser = subparsers.add_parser("post", help="Post a review comment on a PR") + post_parser.add_argument("pr", help="PR number") + post_parser.add_argument("message", help="Comment body (markdown supported)") + + # 'wait' subcommand + wait_parser = subparsers.add_parser( + "wait", help="Wait for author reply on a PR" + ) + wait_parser.add_argument("pr", help="PR number") + wait_parser.add_argument("since", help="ISO 8601 timestamp — only consider comments after this") + wait_parser.add_argument( + "--timeout", type=int, default=600, + help="Seconds to wait for reply (default: 600)" + ) + + args = parser.parse_args() + + if args.command == "post": + return cmd_post(args) + elif args.command == "wait": + return cmd_wait(args) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/commands/diagnose-test-failure.md b/.claude/commands/diagnose-test-failure.md new file mode 100644 index 000000000..6c8185e49 --- /dev/null +++ b/.claude/commands/diagnose-test-failure.md @@ -0,0 +1,167 @@ +--- +name: diagnose-test-failure +description: Diagnose a Cypress test failure using error output, screenshots, and codebase analysis +parameters: + - name: test-name + description: "Full title of the failing test (from mochawesome 'fullTitle' or Cypress output)" + required: true + - name: spec-file + description: "Path to the spec file (e.g., cypress/e2e/incidents/regression/01.reg_filtering.cy.ts)" + required: true + - name: error-message + description: "The error message from the test failure" + required: true + - name: screenshot-path + description: "Absolute path to the failure screenshot (will be read with multimodal vision)" + required: false + - name: stack-trace + description: "The error stack trace (estack from mochawesome)" + required: false + - name: ci-context + description: "Optional context from /analyze-ci-results (commit correlation, infra status)" + required: false +--- + +# Diagnose Test Failure + +Analyze a Cypress test failure to determine root cause and recommend a fix. This skill is used by the `/iterate-incident-tests` orchestrator but can also be invoked standalone. + +## Diagnosis Protocol + +**IMPORTANT**: Follow this order. Visual evidence first, then code analysis. + +### Step 1: Read the Screenshot (if available) + +If `screenshot-path` is provided, read it using the Read tool (multimodal). + +Describe what you see: +- What page/view is displayed? +- Is the expected UI element visible? If not, what's in its place? +- Are there error dialogs, loading spinners, empty states, or overlays? +- Is the page fully loaded or still loading? +- Are there any browser console errors visible? +- Does the layout look correct (no overlapping elements, correct positioning)? + +This visual context often reveals the root cause faster than reading code. + +### Step 2: Read the Test Code + +Read the spec file at `spec-file`. Find the failing test by matching `test-name`. + +Identify: +- What the test is trying to do (user actions + assertions) +- Which page object methods it calls +- Which fixture it loads (look at `before`/`beforeEach` hooks) +- The specific assertion or command that failed +- Whether the failure is in a `before all` hook (affects all tests in suite) or a specific `it()` block + +### Step 3: Read the Page Object + +Read `web/cypress/views/incidents-page.ts`. + +For each page object method used by the failing test: +- Check the selector — does it match current DOM conventions? +- Check for hardcoded waits vs proper Cypress chaining +- Look for methods that might be missing or outdated + +### Step 4: Read the Fixture (if applicable) + +If the test uses `cy.mockIncidentFixture('...')`, read the fixture YAML file. + +Check: +- Does the fixture have the incidents/alerts the test expects? +- Are severities, states, components, timelines correct? +- Are there edge cases (empty arrays, missing fields, zero-duration timelines)? + +### Step 5: Read the Mock Layer (if relevant) + +If the error suggests an API/intercept issue, read relevant files in `cypress/support/incidents_prometheus_query_mocks/`: +- `prometheus-mocks.ts` — intercept setup and route matching +- `mock-generators.ts` — response data generation +- `types.ts` — type definitions for fixtures + +Check: +- Does the intercept URL pattern match the actual API call? +- Is the response shape what the UI code expects? +- Are query parameters (group_id, alertname, severity) handled correctly? + +### Step 6: Cross-reference with Error + +Now combine visual evidence + code analysis + error message to determine root cause. + +**Common patterns:** + +| Error Pattern | Likely Cause | +|--------------|--------------| +| `Timed out retrying after Nms: Expected to find element: .selector` | Selector wrong, element not rendered, or page not loaded | +| `Expected N to equal M` (counts) | Fixture doesn't have enough data, or filter state is wrong | +| `expected true to be false` / vice versa | Assertion logic inverted | +| `Cannot read properties of undefined` | Page object method returns wrong element, or DOM structure changed | +| `cy.intercept() matched no requests` | Mock intercept URL doesn't match actual API call | +| `Timed out retrying` on `.should('be.visible')` | Element exists but hidden (z-index, opacity, overflow, display:none) | +| `before all hook` failure | Setup issue — fixture load, navigation, or login failed | +| `detached from the DOM` | Element re-rendered between find and action — needs `.should('exist')` guard | +| `e is not a function` / runtime JS error | Application code bug, not test issue | +| `x509: certificate` / `Unable to connect` | Infrastructure issue | + +### Step 7: Classify and Recommend + +Output your diagnosis in this exact format: + +``` +## Diagnosis + +**Classification**: TEST_BUG | FIXTURE_ISSUE | PAGE_OBJECT_GAP | MOCK_ISSUE | REAL_REGRESSION | INFRA_ISSUE + +**Confidence**: HIGH | MEDIUM | LOW + +**Root Cause**: +[1-3 sentence explanation of what's wrong and why] + +**Evidence**: +- Screenshot: [what the screenshot showed] +- Error: [what the error message tells us] +- Code: [what the code analysis revealed] + +**Recommended Fix**: +- File: [path to file that needs editing] +- Change: [specific description of what to change] +- [If multiple files need changing, list each] + +**Risk Assessment**: +- Will this fix affect other tests? [yes/no and why] +- Could this mask a real bug? [yes/no and why] + +**Alternative Hypotheses**: +- [If confidence is MEDIUM or LOW, list other possible causes] +``` + +## Classification Reference + +### Auto-fixable (proceed with Fix Agent) + +| Classification | Description | Examples | +|---------------|-------------|----------| +| `TEST_BUG` | Test code is wrong | Wrong selector, incorrect assertion value, missing wait, wrong test order dependency | +| `FIXTURE_ISSUE` | Test data is wrong | Missing incident in fixture, wrong severity, timeline doesn't cover test's time window | +| `PAGE_OBJECT_GAP` | Page object needs update | Selector targets old class name, method missing for new UI element, method returns wrong element | +| `MOCK_ISSUE` | API mock is wrong | Intercept URL pattern outdated, response missing required field, query filter not handled | + +### Not auto-fixable (report to user) + +| Classification | Description | Examples | +|---------------|-------------|----------| +| `REAL_REGRESSION` | UI code has a bug | Component doesn't render, wrong data displayed, broken interaction | +| `INFRA_ISSUE` | Environment problem | Cluster down, cert expired, operator not installed, console unreachable | + +### Distinguishing TEST_BUG from REAL_REGRESSION + +This is the hardest classification. Use these heuristics: + +1. **Was the test ever passing?** If it's a new test, lean toward `TEST_BUG`. If it was passing before, check what changed. +2. **Does the screenshot show the UI working correctly but the test expecting something different?** → `TEST_BUG` +3. **Does the screenshot show the UI broken (empty state, error, wrong data)?** → Likely `REAL_REGRESSION` +4. **Do other tests in the same suite pass?** If yes, the infra/app is fine → `TEST_BUG` or `FIXTURE_ISSUE` +5. **If CI context is available**: Check if the source code was modified in the PR. Modified source + broken test = likely `REAL_REGRESSION` + +When in doubt, classify as `REAL_REGRESSION` — it's safer to report a false positive to the user than to silently "fix" a test that was correctly catching a bug. diff --git a/.claude/commands/iterate-ci-flaky.md b/.claude/commands/iterate-ci-flaky.md new file mode 100644 index 000000000..9f99418da --- /dev/null +++ b/.claude/commands/iterate-ci-flaky.md @@ -0,0 +1,416 @@ +--- +name: iterate-ci-flaky +description: Iterate on flaky Cypress tests against OpenShift CI presubmit jobs — push fixes, trigger CI, analyze results, repeat +parameters: + - name: pr + description: "PR number to iterate on (e.g., 857)" + required: true + - name: max-iterations + description: "Maximum fix-push-wait cycles (default: 3)" + required: false + - name: confirm-runs + description: "Number of green CI runs required to declare stable (default: 2)" + required: false + - name: job + description: "Prow job name to target (default: pull-ci-openshift-monitoring-plugin-main-e2e-incidents)" + required: false + - name: focus + description: "Optional: focus analysis on specific test area (e.g., 'regression', 'filtering')" + required: false + - name: review-window + description: "Seconds to wait for user feedback after posting fix to Slack before pushing (default: 0 = no wait). Requires Option B Slack setup." + required: false +--- + +# Iterate CI Flaky Tests + +Fix flaky Cypress tests by iterating against real OpenShift CI presubmit jobs. Pushes fixes, triggers CI, waits for results, analyzes failures, and repeats until stable. + +## Prerequisites + +### 1. GitHub CLI Authentication + +```bash +gh auth status +``` + +Must be logged in with comment access to `openshift/monitoring-plugin` (for `/test` comments to trigger Prow CI). + +**Recommended auth method**: `gh auth login --web` (OAuth via browser). This uses your GitHub user's existing org permissions — no PAT scope management needed. Revocable anytime at GitHub → Settings → Applications. + +**Why not a PAT?** +- Fine-grained PATs can only scope repos you own — you can't add `openshift/monitoring-plugin` as a contributor. +- Classic PATs with `public_repo` scope work but grant broader access than needed. +- OAuth via `--web` uses the GitHub CLI OAuth app which requests only the permissions it needs and inherits your org membership. + +**Push access**: Git push to your fork uses SSH (`origin` remote) — this is independent of the `gh` token. + +**Fallback**: If the token lacks upstream comment permissions, the agent will report the blocker and ask you to post the `/test` comment manually on the PR page. + +### 2. Permissions + +Required in `.claude/settings.local.json`: + +```json +{ + "permissions": { + "allow": [ + "Bash(gh auth:*)", + "Bash(gh api:*)", + "Bash(gh pr:*)", + "Bash(git push:*)", + "Bash(git add:*)", + "Bash(git commit:*)", + "Bash(git status:*)", + "Bash(git diff:*)", + "Bash(git log:*)", + "Bash(git rev-parse:*)", + "Bash(git -C:*)", + "Bash(git checkout:*)", + "Bash(git fetch:*)", + "Bash(python3:*)", + "Bash(find screenshots:*)", + "Bash(find cypress/screenshots:*)", + "Bash(find cypress/videos:*)", + "WebFetch(domain:gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com)" + ] + } +} +``` + +### 3. Notifications & Review (optional) + +Notifications and review are optional — if not configured, the script prints to stdout and the loop continues normally. + +**Slack Notifications (one-way):** +```bash +export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." +``` +Setup: Slack → Apps → Incoming Webhooks → create webhook for your channel. 5 minutes. +Provides one-way status notifications at key events (ci_started, ci_failed, fix_applied, etc.). + +**GitHub PR Comment Review (two-way):** + +The `review-window` parameter enables a two-way review flow using GitHub PR comments. When a fix is ready: + +1. Agent posts fix details as a PR comment (via `review-github.py post`) +2. Agent also sends a Slack webhook notification (if configured) +3. Agent waits `review-window` seconds for a reply from the **PR author only** +4. If the author replies on the PR — agent reads the feedback and adjusts the fix +5. If no reply within the window — agent proceeds autonomously + +**Security**: Author filtering is **code-enforced** in `review-github.py` — only comments where `.user.login` matches the PR author are considered. This is deterministic, not instruction-based. + +**How to reply**: Post a regular comment on the PR. The agent only reads comments from the PR author posted after the agent's notification. Optionally prefix with `/agent` for clarity. + +No additional setup needed beyond `gh auth` (Step 1) — the same token used for `/test` comments is used for posting and reading review comments. + +Both Slack webhook URL and review-window can be set in `cypress/export-env.sh` or `~/.zshrc`. + +### 4. Unsigned Commits + +Same as `/iterate-incident-tests` — all commits use `--no-gpg-sign`. They live on a PR branch and are squash-merged by the user. + +## Instructions + +**IMPORTANT — Autonomous Execution Rules:** +- **Never chain commands** with `&&` or `|` — use separate Bash calls for each operation. Compound commands and pipes trigger security prompts that block autonomous execution. +- **Never combine `cd` with other commands** — `cd && git` triggers an unskippable security prompt. +- When you need to process command output (e.g., parse JSON), capture it with a Bash call first, then process it in a second call or read the output directly. + +### Step 1: Gather PR Context + +Fetch PR metadata: +```bash +gh pr view {pr} --json headRefName,headRefOid,baseRefName,number,title,url,author,statusCheckRollup +``` + +Extract: +- **Branch**: `headRefName` +- **HEAD SHA**: `headRefOid` +- **Check runs**: from `statusCheckRollup`, find the job matching `{job}` (default: `pull-ci-openshift-monitoring-plugin-main-e2e-incidents`) + +Check out the PR branch locally: +```bash +git fetch origin {headRefName} +``` +```bash +git checkout {headRefName} +``` + +Present summary: +``` +PR #{pr}: {title} +Branch: {headRefName} +HEAD: {short_sha} +CI job: {job} +Latest run status: {SUCCESS|FAILURE|PENDING|none} +``` + +### Step 2: Determine Current CI State + +From the status check rollup, determine the state of the target job: + +- **SUCCESS**: Skip to Step 5 (flakiness confirmation — was it truly stable?) +- **FAILURE**: Proceed to Step 3 (analyze the failure) +- **PENDING / IN_PROGRESS**: Skip to Step 4 (wait for it) +- **No run found**: Trigger one in Step 3 + +### Step 3: Trigger CI Run (if needed) + +If there's no recent run, or a fix was just pushed: + +```bash +gh api repos/openshift/monitoring-plugin/issues/{pr}/comments -f body="/test e2e-incidents" +``` + +**IMPORTANT**: The `/test` command uses the **short alias** (`e2e-incidents`), not the full Prow job name. Using the full name will fail with "specified target(s) for /test were not found." + +Note: If you just pushed a commit in Step 6, the push automatically triggers Prow — you can skip the `/test` comment. Only use `/test` for: +- Retriggering without code changes (flakiness retry) +- The initial run if none exists + +After triggering, notify and proceed to Step 4: +```bash +python3 .claude/commands/cypress/scripts/notify-slack.py send ci_started "CI triggered for PR #{pr}. Polling for results (~2h)." --pr {pr} --branch {headRefName} +``` + +### Step 4: Wait for CI Completion + +Use the polling script at `.claude/commands/cypress/scripts/poll-ci-status.py`: + +```bash +python3 .claude/commands/cypress/scripts/poll-ci-status.py {pr} +``` + +Arguments: ` [job_substring] [max_attempts] [interval_seconds]` +- Default job substring: `e2e-incidents` +- Default max attempts: 30 (150 minutes at 5-minute intervals) +- Default interval: 300 seconds + +Run this with `run_in_background: true` and a timeout of 9000000ms (150 minutes). + +When the background task completes, parse the output line starting with `CI_COMPLETE`: +- Extract `state` (SUCCESS or FAILURE) +- Extract `url` (Prow URL for the run) + +### Step 5: Analyze CI Results + +Convert the Prow URL to a gcsweb URL: +- Replace `https://prow.ci.openshift.org/view/gs/` with `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/` + +Run `/analyze-ci-results` (or follow its instructions inline): + +1. Fetch `started.json`, `finished.json`, `prowjob.json` for metadata +2. Fetch `build-log.txt` from the test artifacts path +3. List and fetch failure screenshots +4. Classify each failure + +**Classification outcomes:** + +| Classification | Action | +|---------------|--------| +| `INFRA_*` | Report to user. Optionally retrigger with `/retest` (Step 3). Do NOT attempt code fixes. | +| `TEST_BUG` | Diagnose and fix locally (Step 6) | +| `FIXTURE_ISSUE` | Diagnose and fix locally (Step 6) | +| `PAGE_OBJECT_GAP` | Diagnose and fix locally (Step 6) | +| `MOCK_ISSUE` | Diagnose and fix locally (Step 6) | +| `CODE_REGRESSION` | Report to user and **STOP** | + +Notify after analysis: + +If failures: +```bash +python3 .claude/commands/cypress/scripts/notify-slack.py send ci_failed "{N} failures found: {test_names}. Diagnosing..." --pr {pr} --branch {headRefName} --url {ci_url} +``` + +If all green: +```bash +python3 .claude/commands/cypress/scripts/notify-slack.py send ci_complete "All tests passed. Starting flakiness confirmation." --pr {pr} --branch {headRefName} --url {ci_url} +``` + +If `CODE_REGRESSION` or `INFRA_*` blocks the loop: +```bash +python3 .claude/commands/cypress/scripts/notify-slack.py send blocked "{classification}: {description}. Agent stopped — needs human input." --pr {pr} --branch {headRefName} +``` + +If **all green** (SUCCESS): Proceed to Step 7 (flakiness confirmation). + +### Step 6: Fix and Push + +For each fixable failure: + +1. **Diagnose** using `/diagnose-test-failure` (read screenshots, test code, fixtures, page object) +2. **Fix** — edit the relevant files. Same constraints as `/iterate-incident-tests`: + - May edit: `cypress/e2e/incidents/**`, `cypress/fixtures/incident-scenarios/**`, `cypress/views/incidents-page.ts`, `cypress/support/incidents_prometheus_query_mocks/**` + - Must NOT edit: `src/**`, non-incident tests, cypress config +3. **Validate locally** (optional but recommended if cluster is accessible): + ```bash + source cypress/export-env.sh && npx cypress run --spec "{SPEC}" --env grep="{TEST_NAME}" + ``` +4. **Commit**: + ```bash + git add {files} + ``` + ```bash + git commit --no-gpg-sign -m "fix(tests): {summary} + + CI run: {prow_url} + Classifications: {list} + + Co-Authored-By: Claude Opus 4.6 " + ``` + +5. **Notify and review window** (before pushing): + + **a) Slack notification** (one-way, if configured): + ```bash + python3 .claude/commands/cypress/scripts/notify-slack.py send fix_applied "*What changed:*\n• {file}: {change_description}\n\n*Why:* {diagnosis_summary}\n*Classification:* {classification} (confidence: {confidence})\n\n`git diff HEAD~1` on branch `{headRefName}`" --pr {pr} --branch {headRefName} + ``` + + **b) GitHub PR review comment** (two-way, if `review-window` > 0): + + Post fix details as a PR comment: + ```bash + python3 .claude/commands/cypress/scripts/review-github.py post {pr} "**What changed:**\n• {file}: {change_description}\n\n**Why:** {diagnosis_summary}\n**Classification:** {classification} (confidence: {confidence})\n\n\`git diff HEAD~1\` on branch \`{headRefName}\`" + ``` + + Capture `COMMENT_TIME` from the output, then wait for author reply: + ```bash + python3 .claude/commands/cypress/scripts/review-github.py wait {pr} {COMMENT_TIME} --timeout {review-window} + ``` + + Parse the output: + - `REPLY=`: PR author provided feedback. Read the reply text and adjust the fix accordingly. This may mean: + - Reverting the commit (`git reset --soft HEAD~1`), applying the user's suggestion, and re-committing + - Or making an additional commit on top with the adjustment + - `NO_REPLY`: No feedback within the window. Proceed with push. + + **Note**: The `wait` command only considers comments from the PR author (`.user.login` match, code-enforced). Comments from other users or bots are ignored. + +6. **Push**: + ```bash + git push origin {headRefName} + ``` + +The push automatically triggers a new Prow run. Go to **Step 4** (wait for CI). + +Track iteration count. If `current_iteration >= max-iterations`: Report remaining failures and **STOP**. + +### Step 7: Flakiness Confirmation + +A single green CI run doesn't prove stability. Trigger `confirm-runs` additional runs (default: 2) to confirm. + +For each confirmation run: + +1. Trigger via `/test` comment (no code changes): + ```bash + gh api repos/openshift/monitoring-plugin/issues/{pr}/comments -f body="/test e2e-incidents" + ``` + +2. Wait for completion (Step 4) + +3. Analyze results (Step 5) + +4. If failures found: + - If same test fails across runs → likely a real bug, diagnose and fix (Step 6) + - If different tests fail across runs → environment-dependent flakiness, harder to fix + - Report flakiness pattern to user + +Track results across all runs: +``` +Stability Report: + Run 1 (fix iteration): {SHA} — PASSED + Run 2 (confirm #1): {SHA} — PASSED + Run 3 (confirm #2): {SHA} — PASSED (or FAILED: test X) +``` + +### Step 8: Final Report + +``` +# CI Flaky Test Iteration Report + +## PR: #{pr} - {title} +## Branch: {headRefName} +## Iterations: {N} + +## Timeline +1. [SHA] Initial state — CI FAILURE + - {N} failures: {test names} +2. [SHA] fix(tests): {summary} — pushed, CI triggered +3. [SHA] CI result: PASSED +4. Confirmation run 1: PASSED +5. Confirmation run 2: PASSED + +## Fixes Applied +1. [commit] fix(tests): {summary} + - {file}: {change} + CI run: {prow_url} + +## Stability Assessment +- Tests stable: {N}/{total} (passed all runs) +- Tests flaky: {N} (intermittent failures) +- Tests broken: {N} (failed every run) + +## Flaky Test Details (if any) +- "test name": passed 2/3 runs + Failure pattern: {timing issue / element not found / etc.} + Fix attempted: {yes/no} + +## Remaining Issues +- {any unresolved items} + +## Recommendations +- {merge / needs more investigation / etc.} +``` + +After generating the report, send the final notification: +```bash +python3 .claude/commands/cypress/scripts/notify-slack.py send iteration_done "Iteration complete: {passed}/{total} passed, {flaky} flaky, {iterations} cycles.\n\n{short_summary}" --pr {pr} --branch {headRefName} +``` + +### Step 9: Update Stability Ledger + +After the final report, update `web/cypress/reports/test-stability.md`. + +Read the file and update both sections: + +**1. Current Status table** — for each test in this run: +- If test already in table: update pass rate, update trend +- If test is new: add a row +- Pass rate = total passes / total runs across all recorded iterations +- Trend: compare last 3 runs — improving / stable / degrading + +**2. Run History log** — append a new row: +``` +| {next_number} | {YYYY-MM-DD} | ci | {branch} | {total_tests} | {passed} | {failed} | {flaky} | {commit_sha} | +``` + +**3. Machine-readable data** — update the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END` with the new run data. + +Commit: +```bash +git add web/cypress/reports/test-stability.md +``` +```bash +git commit --no-gpg-sign -m "docs: update test stability ledger — {passed}/{total} passed, {flaky} flaky (CI)" +``` + +## Error Handling + +- **Push rejected** (branch protection, force push required): Report to user. Do NOT force push. +- **`/test` comment ignored by Prow**: User may lack `ok-to-test` permission. Check if the label exists on the PR: `gh pr view {pr} --json labels`. +- **CI timeout** (>150 min): Report timeout, check if the job is stuck. Suggest manual inspection. +- **Multiple CI jobs running**: Only track the latest run. Use the `detailsUrl` from the most recent check run. +- **Merge conflicts after push**: Report to user. The PR branch may need rebasing — do NOT rebase automatically. +- **Rate limiting on gh api**: GitHub allows 5000 requests/hour for authenticated users. Polling every 5 min = 12/hour, well within limits. + +## Guardrails + +- **Never force-push** — always additive commits +- **Never push to main** — only to the PR branch +- **Never edit source code** (`src/`) — only test infrastructure +- **Never close or merge the PR** — that's the user's decision +- **Max 3 `/test` comments per hour** — avoid spamming the PR +- **Always include the CI run URL** in commit messages for traceability +- **Stop on CODE_REGRESSION** — if the UI is genuinely broken, that's not a flaky test diff --git a/.claude/commands/iterate-incident-tests.md b/.claude/commands/iterate-incident-tests.md new file mode 100644 index 000000000..246848a9a --- /dev/null +++ b/.claude/commands/iterate-incident-tests.md @@ -0,0 +1,465 @@ +--- +name: iterate-incident-tests +description: Autonomously run, diagnose, fix, and verify incident detection Cypress tests with flakiness probing +parameters: + - name: target + description: > + What to test. Options: + - "all" — all incident tests (excluding @e2e-real) + - "regression" — only regression/ directory tests + - a specific spec file path (e.g., "cypress/e2e/incidents/01.incidents.cy.ts") + - a grep pattern for a specific test (e.g., "should filter by severity") + required: true + - name: max-iterations + description: "Maximum fix-and-retry cycles (default: 3)" + required: false + - name: ci-url + description: "Optional: gcsweb or Prow URL for CI results to use as starting context (triggers /analyze-ci-results first)" + required: false + - name: flakiness-runs + description: "Number of flakiness probe runs (default: 3). Set to 0 to skip flakiness probing" + required: false + - name: skip-branch + description: "If 'true', work on current branch instead of creating a new one (default: false)" + required: false +--- + +# Iterate Incident Tests + +Autonomous test iteration loop: run tests, diagnose failures, apply fixes, verify, and probe for flakiness. + +## Prerequisites + +### 1. Cypress Environment + +Run `/cypress-setup` first to ensure `web/cypress/export-env.sh` exists with cluster credentials. + +### 2. Permissions + +This skill runs autonomously and needs pre-approved permissions in `.claude/settings.local.json` to avoid interactive approval prompts blocking the loop. Required permissions: + +```json +{ + "permissions": { + "allow": [ + "Bash(git stash:*)", + "Bash(git checkout:*)", + "Bash(git checkout -b:*)", + "Bash(git branch:*)", + "Bash(git add:*)", + "Bash(git commit:*)", + "Bash(git status:*)", + "Bash(git diff:*)", + "Bash(git log:*)", + "Bash(rm -f screenshots/cypress_report_*.json:*)", + "Bash(rm -f screenshots/merged-report.json:*)", + "Bash(rm -rf cypress/screenshots/*:*)", + "Bash(rm -rf cypress/videos/*:*)", + "Bash(npx cypress run:*)", + "Bash(npx mochawesome-merge:*)", + "Bash(source cypress/export-env.sh:*)", + "Bash(cd /home/drajnoha/Code/monitoring-plugin:*)", + "Bash(find /home/drajnoha/Code/monitoring-plugin/web/cypress:*)", + "Bash(ls:*)" + ] + } +} +``` + +The `rm` permissions are scoped to test artifact directories only (mochawesome reports, screenshots, videos) — these are regenerated every run. + +### 3. Unsigned Commits + +All commits in this workflow use `--no-gpg-sign` to avoid GPG passphrase prompts blocking the loop. These unsigned commits live on a working branch and are intended to be **squash-merged** by the user with their own signature when approved. Never push unsigned commits directly to main. + +If using CI analysis, also add to `web/.claude/settings.local.json`: +```json +"WebFetch(domain:gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com)" +``` + +## Instructions + +Execute the following steps in order. This is the main orchestrator — it coordinates sub-agents and manages the iteration loop. + +### Step 0: CI Context (optional) + +If `ci-url` is provided, run `/analyze-ci-results` first to get CI failure context. + +Capture the CI analysis output: +- If **all failures are INFRA_***: Report the infrastructure issues to the user and **STOP**. No test changes will help. +- If **mixed infra + test/code**: Note the infra issues for the user, but proceed with the test/code failures only. +- If **all test/code**: Proceed. Use the CI diagnosis (commit correlation, screenshots) as context for the local iteration. + +Store the CI analysis as `ci_context` for later reference by diagnosis agents. + +### Step 1: Branch Setup + +First, check the current branch: +```bash +git rev-parse --abbrev-ref HEAD +``` + +**Decision logic:** +- If `skip-branch` is "true": Stay on the current branch, skip to Step 2. +- If already on a `test/incident-robustness-*` branch: Stay on it, skip to Step 2. +- If on any other non-main working branch (e.g., `agentic-test-iteration`, a feature branch): Ask the user whether to create a child branch or work on the current one. +- If on `main`: Create a new branch. + +To create a branch (only when needed): +```bash +git checkout -b test/incident-robustness-$(date +%Y-%m-%d) +``` + +If that branch name already exists, append a suffix: `-2`, `-3`, etc. + +**IMPORTANT**: Do NOT combine `cd` and `git` in the same command — compound `cd && git` commands trigger a security approval prompt that blocks autonomous execution. Always use separate Bash calls, or set the working directory before running git. + +### Step 2: Resolve Target + +Based on the `target` parameter, determine the Cypress run command: + +| Target | Spec | Grep Tags | +|--------|------|-----------| +| `all` | `cypress/e2e/incidents/**/*.cy.ts` | `@incidents --@e2e-real --@flaky --@demo` | +| `regression` | `cypress/e2e/incidents/regression/**/*.cy.ts` | `@incidents --@e2e-real --@flaky` | +| specific file | `cypress/e2e/incidents/{target}` | (none) | +| grep pattern | `cypress/e2e/incidents/**/*.cy.ts` | (none, use `--env grep="{target}"`) | + +### Step 3: Clean Previous Results + +**IMPORTANT**: Never chain commands with `&&`. Use separate Bash calls for each operation — compound commands trigger security prompts that block autonomous execution. + +From the `web/` directory: +```bash +rm -f screenshots/cypress_report_*.json +``` +```bash +rm -f screenshots/merged-report.json +``` +```bash +rm -rf cypress/screenshots/* +``` +```bash +rm -rf cypress/videos/* +``` + +### Step 4: Run Tests + +Execute Cypress inline (NOT in a separate terminal). From the `web/` directory: + +```bash +source cypress/export-env.sh && npx cypress run --spec "{SPEC}" {GREP_ARGS} +``` + +Note: `source && npx` is one logical operation (env setup + run) and is acceptable as a single command. + +**IMPORTANT**: This command may take several minutes. Use a timeout of 600000ms (10 minutes). + +Capture the exit code: +- `0` = all passed +- non-zero = failures occurred + +### Step 5: Parse Results + +Merge mochawesome reports and parse. From the `web/` directory: + +```bash +npx mochawesome-merge screenshots/cypress_report_*.json -o screenshots/merged-report.json +``` + +Read `screenshots/merged-report.json` and extract: + +For each test: +``` +{ + spec_file: string, // from results[].fullFile + suite: string, // from suites[].title + test_name: string, // from tests[].title + full_title: string, // from tests[].fullTitle + state: "passed" | "failed" | "skipped", + error_message: string, // from tests[].err.message (if failed) + stack_trace: string, // from tests[].err.estack (if failed) + duration_ms: number // from tests[].duration +} +``` + +Build a failure list and a pass list. + +**Note**: Mochawesome JSON has nested suites. Walk the tree recursively: +``` +results[] -> suites[] -> tests[] + -> suites[] -> tests[] (nested suites) +``` + +### Step 6: Identify Screenshots + +For each failure, find the corresponding screenshot: + +```bash +find /home/drajnoha/Code/monitoring-plugin/web/cypress/screenshots -name "*.png" -type f +``` + +Match screenshots to failures using the naming convention: +``` +{Suite Name} -- {Test Title} (failed).png +{Suite Name} -- {Test Title} -- before all hook (failed).png +``` + +### Step 7: Diagnosis Loop + +**If no failures** (exit code 0): Skip to Step 10 (flakiness probe). + +**If failures exist**: For each failing test, spawn a **Diagnosis Agent** (Explore-type sub-agent). + +Use the `/diagnose-test-failure` skill prompt. Provide: +- `test-name`: the full title +- `spec-file`: the spec file path +- `error-message`: the error message +- `screenshot-path`: absolute path to the failure screenshot +- `stack-trace`: the error stack trace +- `ci-context`: any relevant context from Step 0 + +**Parallelization**: If failures are in **different spec files**, spawn diagnosis agents in parallel. If they're in the **same spec file**, diagnose sequentially (they may share root causes like a broken `before all` hook). + +**Before-all hook failures**: If a `before all` hook failed, all tests in that suite were skipped. Diagnose only the hook failure — fixing it will unblock all skipped tests. + +Collect all diagnoses. Separate into: +- **Fixable**: `TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE` +- **Blocking**: `REAL_REGRESSION`, `INFRA_ISSUE` + +If any **blocking** issues found: Report them to the user. Continue fixing the fixable issues. + +### Step 8: Fix Loop + +For each fixable failure, spawn a **Fix Agent** (general-purpose sub-agent). + +Provide the Fix Agent with: +1. The full diagnosis from Step 7 +2. The test file content (read it) +3. The page object content (read `cypress/views/incidents-page.ts`) +4. The fixture content (if relevant) +5. These constraints: + +``` +## Fix Constraints + +You may ONLY edit files in these paths: +- web/cypress/e2e/incidents/**/*.cy.ts (test files) +- web/cypress/fixtures/incident-scenarios/*.yaml (fixtures) +- web/cypress/views/incidents-page.ts (page object) +- web/cypress/support/incidents_prometheus_query_mocks/** (mock layer) + +You must NOT edit: +- web/src/** (source code — that's Phase 2) +- Non-incident test files +- Cypress config or support infrastructure +- Any file outside the web/ directory + +## Fix Guidelines + +- Prefer the minimal change that fixes the issue +- Don't refactor surrounding code — only fix the failing test +- If adding a wait/timeout, prefer Cypress retry-ability (.should()) over cy.wait() +- If fixing a selector, check that the new selector exists in the current DOM + by reading the relevant React component in src/ (read-only, don't edit) +- If fixing a fixture, validate it against the fixture schema + (run /validate-incident-fixtures mentally or reference the schema) +- If adding a page object method, follow existing naming conventions +``` + +After the Fix Agent returns, verify the fix makes sense: +- Does the edit address the diagnosed root cause? +- Could the edit break other tests? +- Is it the minimal change needed? + +If the fix looks wrong, re-diagnose with additional context. + +### Step 9: Validate Fixes + +After applying fixes, re-run **only the previously failing tests**: + +From the `web/` directory: +```bash +source cypress/export-env.sh && npx cypress run --spec "{SPEC}" --env grep="{FAILING_TEST_NAME}" +``` + +For each test: +- **Now passes**: Stage the fix files with `git add` +- **Still fails**: Re-diagnose (increment retry counter). Max 2 retries per test. +- **After 2 retries still failing**: Mark as `UNRESOLVED` and report to user + +### Step 10: Commit Batch + +After all fixable failures are addressed (or max retries reached): + +Stage and commit as separate commands (never chain `cd && git`): +```bash +git add +``` +```bash +git commit --no-gpg-sign -m "" +``` + +Commit message format: +``` +fix(tests): + +- : +- : + +Classifications: N TEST_BUG, N FIXTURE_ISSUE, N PAGE_OBJECT_GAP, N MOCK_ISSUE +Unresolved: N (if any) + +Co-Authored-By: Claude Opus 4.6 +``` + +Track commit count. If commit count reaches **5**: Notify the user that the review threshold has been reached and ask whether to continue or pause for review. + +### Step 11: Iterate + +If there were failures and `current_iteration < max-iterations`: +- Increment iteration counter +- Go back to **Step 3** (clean results and re-run) + +This catches cascading fixes — e.g., fixing a `before all` hook unblocks skipped tests that may have their own issues. + +If all tests pass: Proceed to Step 12. + +### Step 12: Flakiness Probe + +Run the full target test suite `flakiness-runs` times (default: 3), even if everything is green. + +For each run: +1. Clean previous results (Step 3) +2. Run tests (Step 4) +3. Parse results (Step 5) +4. Record per-test pass/fail + +After all runs, compute flakiness: + +``` +Flakiness Report: + Total tests: N + Stable (all runs passed): N + Flaky (some runs failed): N + Broken (all runs failed): N + + Flaky tests: + - "test name" — passed 2/3 runs + Error on failure: + - "test name" — passed 1/3 runs + Error on failure: +``` + +For each **flaky** test: +- Diagnose it using `/diagnose-test-failure` with the context that it's intermittent +- Common flaky patterns: race conditions, animation timing, network mock timing, DOM detach/reattach +- Apply fix if confident (add `.should('exist')` guards, use `{ timeout: N }`, avoid `.eq(N)` on dynamic lists) +- Re-run flakiness probe on just the fixed tests to verify + +### Step 13: Final Report + +Output a summary: + +``` +# Iteration Complete + +## Branch: test/incident-robustness-YYYY-MM-DD +## Commits: N +## Iterations: N + +## Results +- Tests run: N +- Passing: N +- Fixed in this session: N +- Unresolved: N (details below) +- Flaky (stabilized): N +- Flaky (remaining): N + +## Fixes Applied +1. [commit-sha] fix(tests): + - : + +2. [commit-sha] fix(tests): + - : + +## Unresolved Issues +- "test name": REAL_REGRESSION — . Source file X was modified in PR #N. +- "test name": UNRESOLVED after 2 retries — + +## Remaining Flakiness +- "test name": 2/3 passed — timing issue in chart rendering, needs investigation + +## Recommendations +- [Next steps for unresolved issues] +- [Whether to merge current fixes or wait] +``` + +### Step 14: Update Stability Ledger + +After the final report, update `web/cypress/reports/test-stability.md`. + +Read the file and update both sections: + +**1. Current Status table** — for each test in this run: +- If test already in table: update pass rate (rolling average across all recorded runs), update trend +- If test is new: add a row +- Pass rate = total passes / total runs across all recorded iterations +- Trend: compare last 3 runs — improving / stable / degrading + +**2. Run History log** — append a new row: +``` +| {next_number} | {YYYY-MM-DD} | local | {branch} | {total_tests} | {passed} | {failed} | {flaky} | {commit_sha} | +``` + +**3. Machine-readable data** — update the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END`: +```json +{ + "tests": { + "test full title": { + "results": ["pass", "pass", "fail", "pass"], + "last_failure_reason": "Timed out...", + "last_failure_date": "2026-03-23", + "fixed_by": "abc1234" + } + }, + "runs": [ + { + "date": "2026-03-23", + "type": "local", + "branch": "test/incident-robustness-2026-03-23", + "total": 15, + "passed": 15, + "failed": 0, + "flaky": 0, + "commit": "abc1234" + } + ] +} +``` + +Commit the ledger update together with the final batch of fixes if any, or as a standalone commit: +```bash +git add web/cypress/reports/test-stability.md +``` +```bash +git commit --no-gpg-sign -m "docs: update test stability ledger — {passed}/{total} passed, {flaky} flaky" +``` + +### Error Handling + +- **Cypress crashes** (not just test failures): Check if it's an OOM issue (`--max-old-space-size`), a missing dependency, or a config problem. Report to user. +- **No `export-env.sh`**: Remind user to run `/cypress-setup` first. +- **No mochawesome reports generated**: Check if the reporter config is correct. Fall back to parsing Cypress console output. +- **Git conflicts**: If the working branch has conflicts with changes, report to user and stop. +- **Sub-agent failure**: If a Diagnosis or Fix agent fails, log the error and skip that test. Don't let one broken agent block the whole loop. + +### Guardrails + +- **Never edit source code** (`src/`) in Phase 1 +- **Never disable a test** — if a test can't be fixed, mark it as unresolved, don't add `.skip()` +- **Never add `@flaky` tag** to a test — that's a human decision +- **Never change test assertions to match wrong behavior** — if the UI is wrong, it's a REAL_REGRESSION +- **Max 2 retries per test** to avoid infinite loops +- **Max 5 commits before pausing** for user review +- **Always run flakiness probe** before declaring success diff --git a/docs/agentic-test-iteration-ideas.md b/docs/agentic-test-iteration-ideas.md new file mode 100644 index 000000000..3101473bd --- /dev/null +++ b/docs/agentic-test-iteration-ideas.md @@ -0,0 +1,464 @@ +# Agentic Test Iteration — Ideas & Future Improvements + +Ideas and potential enhancements for the agentic test iteration system. These are not committed plans — they're options to explore when the core workflow is stable. + +## Authentication: GitHub App for CI Triggering + +**Problem**: The CI iteration skill (`/iterate-ci-flaky`) needs to comment `/test` on upstream PRs to trigger Prow. Current options (PATs, OAuth) are tied to a personal GitHub account. + +**Idea**: Create a dedicated GitHub App installed on `openshift/monitoring-plugin`. + +### How it would work + +1. Create a GitHub App with minimal permissions: `Issues: Write`, `Pull requests: Read`, `Checks: Read` +2. An org admin approves installation on `openshift/monitoring-plugin` +3. The app authenticates via a private key (`.pem` file) → short-lived installation tokens (1h expiry, auto-rotated) +4. Comments appear as `my-ci-bot[bot]` instead of a personal user + +### Tradeoffs vs OAuth + +| Aspect | OAuth (`gh auth login --web`) | GitHub App | +|--------|-------------------------------|------------| +| Setup effort | Minimal | Moderate (create app, org admin approval) | +| Tied to a person | Yes | No — bot identity | +| Survives user leaving org | No | Yes | +| Token management | Manual refresh | Automatic (1h expiry from private key) | +| Audit trail | Personal user | Dedicated bot account | +| Team sharing | Each person needs own auth | One app, anyone's agent can use it | + +### When to pursue + +- When multiple team members want to use the CI iteration skill +- When you want a persistent bot identity for test automation comments +- When you want to remove personal account dependency + +### Blocker + +Requires an `openshift` org admin to approve the app installation. + +--- + +## CI Iteration: Fully Automated Job Triggering + +**Problem**: Currently the CI loop requires either a `/test` comment (needs upstream write access) or a `git push` (triggers automatically). The push path works but creates noise commits. + +**Ideas**: +- **Empty commits**: `git commit --allow-empty -m "retrigger CI"` — triggers Prow without code changes, but pollutes history +- **Prow API**: Prow may have a direct API for retriggering jobs without GitHub comments — investigate `https://prow.ci.openshift.org/` endpoints +- **GitHub Actions bridge**: A lightweight GitHub Action on the fork that comments `/test` on the upstream PR when triggered via `workflow_dispatch` + +--- + +## Parallel CI Runs for Flakiness Detection + +**Problem**: Flakiness probing requires N sequential CI runs (~2h each). 3 runs = 6 hours. + +**Idea**: Open N temporary PRs from the same branch, each triggers its own CI run in parallel. Collect all results, then close the temporary PRs. + +**Tradeoff**: Consumes N times the CI resources. May not be acceptable for shared CI infrastructure. + +**Alternative**: Ask if Prow supports multiple runs of the same job on the same PR — some CI systems allow this. + +--- + +## Local Mock Tests + CI Real Tests as Two-Phase Validation + +**Problem**: Local iteration is fast but uses mocked data. CI uses real clusters but is slow (~2h). + +**Idea**: Formalize a two-phase approach: +1. **Phase A** (`/iterate-incident-tests`): Fast local iteration with mocks — fix all mock-testable issues +2. **Phase B** (`/iterate-ci-flaky`): Push to CI — catch environment-specific flakiness + +The orchestrator could automatically transition from Phase A to Phase B when local tests are green. + +--- + +## Agent Fork with Deploy Key + +**Problem**: The agent creates unsigned commits on the user's working branch. Push access, GPG signing, and branch management all create friction. + +**Idea**: A dedicated fork (`monitoring-plugin-agent` or similar) with: +- A passwordless deploy key for push access +- No GPG signing requirement +- Agent creates PRs from the fork to the upstream repo +- User reviews and merges — clean separation of human vs agent work + +**Benefits**: +- No unsigned commits in the user's fork +- Agent can push freely without SSH key access to user's account +- Clear audit trail: all agent work comes from the agent fork +- Multiple agents (different team members) can share the same fork + +--- + +## Screenshot Diffing for Visual Regression + +**Problem**: The diagnosis agent reads failure screenshots to understand UI state, but has no reference for "what it should look like." + +**Idea**: Capture baseline screenshots from passing tests and store them. On failure, the agent can compare the failure screenshot against the baseline to identify visual differences. + +**Implementation**: Cypress has plugins for visual regression testing (`cypress-image-snapshot`). The agent could: +1. Generate baselines from a known-good run +2. On failure, diff the failure screenshot against baseline +3. Highlight visual changes to speed up diagnosis + +--- + +## Test Stability Ledger + +**Status**: Partially implemented. Ledger file created at `web/cypress/reports/test-stability.md`. Update step added to `/iterate-incident-tests` (Step 14). Still needs to be wired into `/iterate-ci-flaky`. + +**Problem**: Flakiness data is ephemeral — it exists in the agent's report from one run and is lost. Next time the agent runs, it has no memory of previous results. + +**Design**: A markdown file with embedded machine-readable JSON, updated by both skills after each run. + +**Location**: `web/cypress/reports/test-stability.md` — committed to the working branch, travels with the fixes. + +**Contents**: +- Human-readable table: per-test pass rate, trend, last failure reason, fix commit +- Run history log: date, type (local/CI), branch, pass/fail counts +- Machine-readable JSON block for programmatic parsing by the agent + +**Agent behavior**: +- Reads the ledger at the start of each run to prioritize — "this test was flaky in last 3 runs, focus here" +- Updates the ledger after each run with new results +- Commits the ledger update alongside fixes + +--- + +## Slack Notifications for Long-Running Loops + +**Status**: Implemented. Slack webhook notifications (Option A) integrated into `/iterate-ci-flaky`. GitHub PR comment-based review flow implemented as the two-way interaction channel (`review-github.py`). Option B (Slack bot with thread replies) documented but deprioritized due to internal setup complexity. + +### The Problem + +The CI iteration loop (`/iterate-ci-flaky`) runs for hours — each CI run takes ~2h, and the loop may do 3-5 fix-push-wait cycles. During that time: + +- The user has no visibility into what the agent decided to fix or how +- By the time the loop finishes, multiple commits may have been pushed with no chance to course-correct +- A wrong fix in cycle 1 wastes 2+ hours of CI time before the agent discovers it didn't work +- The user may have domain context ("that test is flaky because of animation timing, not the selector") that would save cycles + +The core tension: **autonomy vs oversight**. The agent should run independently, but the user needs the ability to intervene at natural pause points. + +### Natural Pause Points + +The CI loop has built-in pauses where user input is most valuable: + +``` +Push fix ──→ [PAUSE: fix_applied] ──→ CI runs (~2h) ──→ [PAUSE: ci_complete] ──→ Analyze ──→ ... +``` + +1. **After fix, before CI runs** (`fix_applied`): The agent committed a fix and is about to push (or just pushed). This is the highest-value notification — the user can review the approach and say "redo" before a 2-hour CI cycle starts. + +2. **After CI completes** (`ci_complete`): Results are in. The agent is about to diagnose. User might have context about known issues. + +3. **When blocked** (`blocked`): Agent can't continue — needs human decision. + +### Review Window + +For the `fix_applied` event, the agent could optionally **wait before pushing**, giving the user a time window to respond: + +``` +Agent: "I'm about to push this fix. Waiting 10 minutes for feedback before proceeding." + [Shows diff summary in Slack] + +User (within 10 min): "Don't change the selector, the issue is timing. Add a cy.wait(500) instead." + +Agent: Reverts fix, applies user's suggestion, pushes that instead. +``` + +Or if no response within the window, the agent proceeds autonomously. + +Configuration: `review-window=10m` parameter on `/iterate-ci-flaky`. Set to `0` for fully autonomous (no waiting). + +### Notification Content — What Makes Each Message Actionable + +**`fix_applied`** — the most important notification: +``` +:wrench: Agent: Fix Applied + +*What changed:* +• `cypress/views/incidents-page.ts:45` — selector `.severity-filter` → `[data-test="severity-filter"]` +• `cypress/e2e/incidents/regression/01.reg_filtering.cy.ts:78` — added `.should('exist')` guard before click + +*Why:* Screenshot showed the filter dropdown existed but had a different class. The `data-test` attribute is stable across builds. + +*Classification:* PAGE_OBJECT_GAP (confidence: HIGH) + +*Diff:* `git diff HEAD~1` on branch `test/incident-robustness-2026-03-24` + +*Next:* CI will trigger automatically on push. Reply in the agent session to change approach. + +PR #860 | Branch: test/incident-robustness-2026-03-24 +``` + +The key: show **what** changed, **why** the agent chose that fix, and **how confident** it is. This lets the user quickly decide "looks good, let it run" vs "wrong approach, let me intervene." + +**`ci_complete`** — actionable status: +``` +:white_check_mark: Agent: CI Complete — PASSED (run 2/5) + +*Results:* 15/15 tests passed in 1h 47m +*Flakiness probe:* 2 of 5 confirmation runs complete, all green so far + +*Next:* Triggering confirmation run 3. No action needed. + +PR #860 | Branch: test/incident-robustness-2026-03-24 | CI Run +``` + +Or on failure: +``` +:x: Agent: CI Complete — FAILED (iteration 2/3) + +*Results:* 13/15 passed, 2 failed +*Failures:* +• "should filter by severity" — Timed out on `[data-test="severity-chip"]` (same as last run) +• "should display chart bars" — new failure, `Expected 5 bars, found 0` + +*Assessment:* +• severity filter: same fix didn't work, will try different approach +• chart bars: new failure — possibly caused by previous fix (will investigate) + +*Next:* Diagnosing and fixing. Will notify before pushing. + +PR #860 | Branch: test/incident-robustness-2026-03-24 | CI Run +``` + +**`blocked`** — requires user action: +``` +:octagonal_sign: Agent: Blocked — REAL_REGRESSION + +*Test:* "should display incident bars in chart" +*Issue:* Chart component renders empty. Screenshot shows the chart area with no bars, no error, no loading state. +*Commit correlation:* `src/components/incidents/IncidentChart.tsx` was modified in this PR (+45, -12) + +*This is not a test issue* — the chart rendering logic appears broken. Agent cannot fix source code in Phase 1. + +*Action needed:* Investigate the chart component refactor. Agent will stop iterating on this test. + +PR #860 | Branch: test/incident-robustness-2026-03-24 +``` + +### Implementation Options + +**Option A: Slack Incoming Webhook** (recommended starting point) +- Setup: Slack → Apps → Incoming Webhooks → create webhook for your channel. 5 minutes. +- Set `SLACK_WEBHOOK_URL` in `export-env.sh` or `~/.zshrc` +- Agent posts via `curl` in a standalone `notify-slack.py` script +- Messages formatted with Slack Block Kit (sections, context, code blocks) +- Pro: No Slack app, no server, no OAuth. Just a URL. +- Con: One-way — user sees notifications but must respond in the Claude Code session, not in Slack + +**Option B: Slack Bot with thread-based interaction** (no callback server needed) +- Create a Slack App with bot token (`chat:write`, `channels:history`) +- Agent posts messages to a channel, capturing the message `ts` (timestamp/ID) +- Before proceeding at pause points, agent **reads thread replies** via `conversations.replies` API +- If user replied in the Slack thread → agent reads the reply and adjusts +- If no reply within the review window → agent proceeds + +``` +Agent posts: "Fix applied. Reply in this thread to change approach. Proceeding in 10 min." +User replies: "Use data-test attributes instead of class selectors" +Agent reads: conversations.replies → sees user feedback → adjusts fix +``` + +- Pro: Two-way interaction without a callback server. User stays in Slack. +- Con: Needs a Slack App (not just a webhook). Polling for replies adds complexity. Bot token needs to be stored securely. + +**Implementation sketch for Option B:** +```python +# Post notification and get message timestamp +response = slack_client.chat_postMessage(channel=CHANNEL, blocks=blocks) +message_ts = response["ts"] + +# Wait for review window, polling for replies +deadline = time.time() + review_window_seconds +while time.time() < deadline: + replies = slack_client.conversations_replies(channel=CHANNEL, ts=message_ts) + user_replies = [r for r in replies["messages"] if r.get("user") != BOT_USER_ID] + if user_replies: + return user_replies[-1]["text"] # Return latest user feedback + time.sleep(30) + +return None # No feedback, proceed autonomously +``` + +**Option C: Claude Code hooks → Slack bridge** +- Configure a Claude Code hook that fires on `git commit` or specific tool calls +- The hook runs a shell script that posts to Slack +- Pro: Zero changes to the skills — hooks are external +- Con: Less control over notification content and timing. Can't implement review windows. Hooks are local config, not portable. + +**Option D: GitHub PR comments as notification channel** +- Instead of Slack, the agent posts status updates as PR comments +- User replies directly on the PR +- Agent reads PR comments via `gh api` before proceeding +- Pro: No Slack setup at all. Everything stays in GitHub. Natural for code review context. +- Con: Noisier PR history. Not real-time (no push notifications unless GitHub notifications are configured). + +### Recommended Progression + +1. **Start with Option A** — get visibility. User monitors passively, intervenes in Claude Code session when needed. +2. **Upgrade to Option B** when the review window pattern proves valuable — adds two-way interaction within Slack. +3. **Option D** is a good alternative if you prefer keeping everything in GitHub — especially for team use where the PR is the natural communication hub. + +### Configuration + +```bash +# Option A: Webhook only (one-way) +export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." + +# Option B: Bot with thread interaction (two-way) +export SLACK_BOT_TOKEN="xoxb-..." +export SLACK_CHANNEL_ID="C0123456789" +export SLACK_REVIEW_WINDOW="600" # seconds to wait for feedback (0 = no wait) +``` + +### Skill Integration Points + +Where notifications fire in each skill: + +**`/iterate-ci-flaky`:** +- Step 3: `ci_started` — after `/test` comment or push +- Step 5: `ci_complete` — after CI analysis +- Step 6: `fix_applied` — after committing fix, before push (with optional review window) +- Step 7: `flaky_found` — when flakiness detected in confirmation runs +- Step 8: `iteration_done` — final summary +- Any step: `blocked` — on REAL_REGRESSION, INFRA_ISSUE, auth failure + +**`/iterate-incident-tests`:** +- Step 10: `fix_applied` — after committing batch (less critical since local runs are fast) +- Step 12: `flaky_found` — during flakiness probe +- Step 13: `iteration_done` — final summary +- Any step: `blocked` — on REAL_REGRESSION + +--- + +## Cloud Execution: Long-Running Autonomous Agent + +**Problem**: The current setup requires a local machine with an active Claude Code CLI session. Long CI polling (~2h per run) causes session timeouts, and the user must keep a terminal open. + +### Option 1: Claude Code Headless Mode (simplest) + +Run Claude Code non-interactively without a TTY: + +```bash +claude --print --dangerously-skip-permissions \ + -p "/iterate-ci-flaky pr=860 confirm-runs=5" +``` + +- `--print` / `-p`: non-interactive, outputs result and exits +- `--dangerously-skip-permissions`: skips all approval prompts (use only in sandboxed environments) +- Can run in `tmux`, `nohup`, GitHub Actions, or any CI runner +- Uses the same tools, skills, and CLAUDE.md as interactive mode +- Limitation: single-shot execution — runs the prompt and exits + +**Deployment**: `nohup claude --print ... > output.log 2>&1 &` on any machine, or in a GitHub Actions runner. + +### Option 2: Claude Agent SDK (most flexible) + +The Agent SDK (`@anthropic-ai/claude-code`) is a Node.js/TypeScript library that embeds Claude Code as a programmable agent: + +```typescript +import { Claude } from "@anthropic-ai/claude-code"; + +const claude = new Claude({ + dangerouslySkipPermissions: true, +}); + +const result = await claude.message({ + prompt: "/iterate-ci-flaky pr=860 confirm-runs=5", + workingDirectory: "/path/to/monitoring-plugin", +}); + +// Post result as PR comment +await octokit.issues.createComment({ + owner: "openshift", repo: "monitoring-plugin", + issue_number: 860, body: result.text, +}); +``` + +#### SDK vs CLI comparison + +| Aspect | CLI (`claude`) | Agent SDK | +|--------|---------------|-----------| +| Runtime | Terminal process | Node.js library | +| Lifecycle | Single session, exits | Embed in any long-lived process | +| Event-driven | No | Yes — webhooks, timers, PR events | +| Permissions | Interactive prompts or skip-all | Programmatic control | +| Tools | Built-in (Read, Write, Bash, etc.) | Same built-in + custom tools | +| State | Session-scoped | Persistent (DB, files, etc.) | +| Deployment | Local terminal | Anywhere Node.js runs | + +#### Requirements to port current skills + +- Node.js runtime with `@anthropic-ai/claude-code` +- `ANTHROPIC_API_KEY` environment variable +- `gh` CLI authenticated (or GitHub App token for comment access) +- Git + SSH for pushing to fork +- The repo cloned in the agent's working directory +- All skill files (`.claude/commands/`) present in the clone + +#### What stays the same + +- Skills (`.md` files) — the SDK reads them from `.claude/commands/` +- Polling script (`poll-ci-status.py`) — SDK runs Bash the same way +- `/diagnose-test-failure`, `/analyze-ci-results` — all work as-is +- File editing, git operations, Cypress execution — identical + +#### What changes + +- No permission prompts — `dangerouslySkipPermissions` in a sandboxed container +- State between runs — persist to file or DB instead of ephemeral session +- Triggering — webhook handler calls the SDK instead of user typing a command +- Error recovery — the wrapping process can catch failures and retry + +### Option 3: GitHub Actions Workflow (cloud, event-driven) + +A GitHub Actions workflow that runs the agent on PR events: + +```yaml +name: Flaky Test Iteration +on: + issue_comment: + types: [created] + +jobs: + iterate: + if: contains(github.event.comment.body, '/run-flaky-iteration') + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-node@v4 + - name: Install Claude Code + run: npm install -g @anthropic-ai/claude-code + - name: Run iteration + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + GH_TOKEN: ${{ secrets.GH_TOKEN }} + run: | + claude --print --dangerously-skip-permissions \ + -p "/iterate-ci-flaky pr=${{ github.event.issue.number }} confirm-runs=3" + - name: Post results + run: gh pr comment ${{ github.event.issue.number }} --body-file output.md +``` + +**Flow**: +1. User comments `/run-flaky-iteration` on a PR +2. GitHub Actions triggers the workflow +3. Claude Code runs in headless mode on the Actions runner +4. Agent executes the full iteration loop (trigger CI, wait, analyze, fix, push) +5. Results posted back as a PR comment + +**Considerations**: +- GitHub Actions runners have a 6h timeout — enough for 2-3 CI runs +- Needs `ANTHROPIC_API_KEY` and `GH_TOKEN` as repository secrets +- Runner needs SSH key for git push (or use `GH_TOKEN` with HTTPS) +- Cost: API tokens consumed + GitHub Actions minutes + +### Recommendation + +1. **Start with headless mode** (`tmux` + `--print`) to validate the flow works without interactive prompts +2. **Move to GitHub Actions** for true cloud execution — event-driven, no local machine needed +3. **Agent SDK** when you want a custom orchestrator with richer state management, error recovery, or Slack integration beyond what the skills provide diff --git a/docs/agentic-test-iteration.md b/docs/agentic-test-iteration.md new file mode 100644 index 000000000..9887d8403 --- /dev/null +++ b/docs/agentic-test-iteration.md @@ -0,0 +1,258 @@ +# Agentic Test Iteration Architecture + +Autonomous multi-agent system for iterating on Cypress test robustness, with visual feedback (screenshots + videos), CI result ingestion, and flakiness detection. + +## Goals + +| Phase | Objective | +|-------|-----------| +| **Phase 1** (current) | Make incident detection tests robust — fix selectors, timing, fixtures, page object gaps | +| **Phase 2** (future) | Refactor frontend code using tests as a behavioral contract / safety net | + +## Architecture Overview + +``` +User: /iterate-incident-tests target=regression max-iterations=3 + +Coordinator (main Claude Code session) + | + |-- [CI Analysis] /analyze-ci-results (optional first step) + | Fetches CI artifacts, classifies infra vs test/code failures + | Correlates failures with git commits for context + | If all INFRA -> report to user and STOP + | + |-- Create branch: test/incident-robustness- + | + |-- [Runner] Cypress headless via Bash (inline, not separate terminal) + | Sources export-env.sh, produces mochawesome JSON + screenshots + videos + | + |-- [Parser] Extract failures from mochawesome JSON reports + | Per failure: test name, error message, stack trace, screenshot path, video path + | + |-- For each failure (parallelizable): + | | + | |-- [Diagnosis Agent] (Explore-type sub-agent) + | | Reads: screenshot (multimodal) + error + test code + fixture + page object + | | Returns: root cause classification + recommended fix + | | + | |-- [Fix Agent] (general-purpose sub-agent) + | | Makes targeted edits based on diagnosis + | | Returns: diff summary + | | + | |-- [Validation] Re-run the specific failing test + | Pass -> stage fix + | Fail -> re-diagnose (max 2 retries per test) + | + |-- Commit batch of related fixes + |-- Repeat if failures remain (up to max-iterations) + |-- [Flakiness Probe] Run full suite 3x even if green + |-- Report final state to user +``` + +## Agent Roles + +### 1. Coordinator (main session) + +Owns the iteration loop, branch management, and commit strategy. + +Responsibilities: +- Create and manage the working branch +- Run Cypress tests inline via Bash +- Parse mochawesome JSON reports +- Dispatch Diagnosis and Fix agents +- Track cumulative pass/fail state across iterations +- Commit fixes in batches (threshold: **5 commits** before notifying user) +- Run flakiness probes (multiple runs even when green) +- Decide when to stop: all green + flakiness probe passed, max iterations, or needs human input + +### 2. Diagnosis Agent (Explore-type sub-agent) + +Input per failure: +- Error message and stack trace from mochawesome JSON +- Screenshot path (read with multimodal Read tool) +- Video path (reference for user, not directly parseable by agent) +- Test file path + relevant line numbers +- Associated fixture YAML +- Page object methods used + +Output — one of these classifications: + +| Classification | Description | Action | +|---------------|-------------|--------| +| `TEST_BUG` | Wrong selector, incorrect assertion, timing/race condition | Auto-fix | +| `FIXTURE_ISSUE` | Missing data, wrong structure, edge case not covered | Auto-fix | +| `PAGE_OBJECT_GAP` | Missing method, stale selector, outdated DOM reference | Auto-fix | +| `MOCK_ISSUE` | Intercept not matching, response shape wrong | Auto-fix | +| `REAL_REGRESSION` | Actual UI/code bug — not a test problem | **STOP and report to user** | +| `INFRA_ISSUE` | Cluster down, cert expired, operator not installed | **STOP and report to user** | + +The agent should **read the screenshot first** before looking at code — visual state often reveals the root cause faster than stack traces. + +### 3. Fix Agent (general-purpose sub-agent) + +Input: +- Diagnosis classification and details +- Specific file references and what to change + +Scope — may only edit: +- `cypress/e2e/incidents/**/*.cy.ts` (test files) +- `cypress/fixtures/incident-scenarios/*.yaml` (fixtures) +- `cypress/views/incidents-page.ts` (page object) +- `cypress/support/incidents_prometheus_query_mocks/**` (mock layer) + +Must NOT edit: +- Source code (`src/`) — that's Phase 2 +- Non-incident test files +- Cypress config or support infrastructure + +### 4. Validation Agent + +Re-runs the specific failing test after a fix is applied: +```bash +source cypress/export-env.sh && npx cypress run --env grep="" --spec "" +``` + +Reports pass/fail. If still failing, feeds back to Diagnosis Agent (max 2 retries per test). + +## Flakiness Detection + +Even if the first run is all green, the coordinator runs a **flakiness probe**: + +1. Run the full incident test suite 3 times consecutively +2. Track per-test results across runs +3. Flag any test that fails in any run as `FLAKY` +4. For flaky tests: attempt to diagnose the timing/race condition and fix +5. Report flakiness statistics: `test_name: 2/3 passed` etc. + +This catches intermittent failures that a single run would miss. + +## CI Result Ingestion + +CI analysis is handled by the dedicated `/analyze-ci-results` skill (`.claude/commands/analyze-ci-results.md`). + +The skill fetches artifacts from OpenShift CI (Prow) runs on GCS, classifies failures as infrastructure vs test/code issues, reads failure screenshots with multimodal vision, and correlates failures with the git commits that triggered them. + +### Key Capabilities + +- **URL normalization**: Accepts gcsweb or Prow UI URLs at any level of the artifact tree +- **Structured metadata**: Extracts PR number, author, branch, commit SHAs from `started.json` / `finished.json` / `prowjob.json` +- **Build log parsing**: Parses Cypress console output from `build-log.txt` for per-spec pass/fail/skip counts and error details +- **Visual diagnosis**: Fetches and reads failure screenshots (multimodal) to understand UI state at failure time +- **Failure classification**: Categorizes each failure as `INFRA_*` (cluster, operator, plugin, auth, CI) or test/code (`TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE`, `CODE_REGRESSION`) +- **Commit correlation**: Maps failures to specific file changes in the PR using `git diff {base}..{pr_head}` + +### Integration with Orchestrator + +The orchestrator uses `/analyze-ci-results` as an optional first step: + +1. If all failures are `INFRA_*` -> report to user and STOP (no test changes will help) +2. If mixed -> report infra issues, proceed with test/code fixes only +3. If all test/code -> proceed with full iteration loop +4. Commit correlation tells the orchestrator whether to fix tests or investigate source changes +5. CI screenshots give the Diagnosis Agent a head start before local reproduction + +### Usage + +``` +/analyze-ci-results ci-url=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/.../{RUN_ID}/ +/analyze-ci-results ci-url=https://prow.ci.openshift.org/view/gs/.../{RUN_ID} focus=regression +``` + +## Commit Strategy + +- **Branch naming**: `test/incident-robustness-YYYY-MM-DD` +- **Commit granularity**: Group related fixes (e.g., "fix 3 selector issues in filtering tests") +- **Review threshold**: Notify user after **5 commits** for review +- **Never force-push**; always additive commits +- User merges when ready, or continues iteration + +## Test Execution (Inline) + +Tests run inline via Bash, not in a separate terminal: + +```bash +cd web && source cypress/export-env.sh && \ + npx cypress run \ + --spec "cypress/e2e/incidents/regression/**/*.cy.ts" \ + --env grepTags="@incidents --@e2e-real --@flaky" \ + --reporter ./node_modules/cypress-multi-reporters \ + --reporter-options configFile=reporter-config.json +``` + +Results are collected from: +- **Exit code**: 0 = all passed, non-zero = failures +- **Mochawesome JSON**: `screenshots/cypress_report_*.json` — per-test details +- **Screenshots**: `cypress/screenshots/{spec}/` — failure screenshots +- **Videos**: `cypress/videos/{spec}.mp4` — kept on failure + +### Report Parsing + +Mochawesome JSON structure (per report file): +```json +{ + "stats": { "passes": N, "failures": N, "skipped": N }, + "results": [{ + "suites": [{ + "title": "Suite Name", + "tests": [{ + "title": "test description", + "fullTitle": "Suite -- test description", + "state": "failed|passed|skipped", + "err": { + "message": "error description", + "estack": "full stack trace" + } + }] + }] + }] +} +``` + +Use `npx mochawesome-merge screenshots/cypress_report_*.json > merged-report.json` to combine per-spec reports. + +## Skills + +| Skill | Purpose | Invoked by | +|-------|---------|------------| +| `/iterate-incident-tests` | Main orchestrator — local iteration loop, dispatches agents, manages commits | User | +| `/iterate-ci-flaky` | CI-based iteration — push fixes, trigger Prow jobs, wait, analyze, repeat | User | +| `/diagnose-test-failure` | Classifies a single test failure using screenshots + code analysis | Orchestrator (as sub-agent prompt) | +| `/analyze-ci-results` | Fetches and analyzes OpenShift CI artifacts, classifies infra vs test/code | User or orchestrator | + +Skills are defined in `.claude/commands/` and can be invoked as slash commands. + +## Existing Infrastructure Leveraged + +| Asset | How the agent uses it | +|-------|----------------------| +| mochawesome JSON reporter | Structured test results parsing | +| `@cypress/grep` | Run individual tests by name or tag | +| `export-env.sh` | Source env vars for inline execution | +| YAML fixture system | Edit fixtures to fix data issues | +| Page object (`incidents-page.ts`) | Fix selectors and add missing methods | +| Mock layer (`incidents_prometheus_query_mocks/`) | Fix intercept patterns | +| `/generate-incident-fixture` skill | Generate new fixtures when needed | +| `/validate-incident-fixtures` skill | Validate fixture edits | + +## Phase 2: Frontend Refactoring (Future) + +### Concept + +Tests become the behavioral contract. The agent refactors frontend code while using the test suite as a safety net. + +### Additional Agent Roles + +| Agent | Role | +|-------|------| +| **Refactor Planner** | Analyzes frontend code, proposes refactoring steps | +| **Refactor Agent** | Makes code changes (replaces Fix Agent) | +| **Contract Validator** | Runs tests, classifies failures as regression vs test-coupling | +| **Test Adapter** | Updates tests that assert implementation details instead of behavior | + +### Key Principle + +If a test breaks due to refactoring but behavior is preserved, the test needs updating — it was too coupled to implementation. Phase 1 (robustness) reduces this coupling, making Phase 2 more effective. + +### Additional Classification + +The Diagnosis Agent gains `TEST_TOO_COUPLED` — the test asserts implementation details (specific DOM structure, internal state) rather than observable behavior. The Test Adapter agent rewrites it to be implementation-agnostic. diff --git a/web/cypress/e2e/incidents/regression/02.reg_ui_tooltip_boundary_times.cy.ts b/web/cypress/e2e/incidents/regression/02.reg_ui_tooltip_boundary_times.cy.ts index 8ad39e5ad..1bdce1008 100644 --- a/web/cypress/e2e/incidents/regression/02.reg_ui_tooltip_boundary_times.cy.ts +++ b/web/cypress/e2e/incidents/regression/02.reg_ui_tooltip_boundary_times.cy.ts @@ -103,7 +103,7 @@ describe('Regression: Mixed Severity Interval Boundary Times', { tags: ['@incide incidentsPage.setDays('1 day'); incidentsPage.elements.incidentsChartContainer().should('be.visible'); incidentsPage.elements.incidentsChartBarsGroups().should('have.length', 1); - cy.pause(); + cy.log('2.2 Consecutive interval boundaries: End of segment 1 should equal Start of segment 2'); incidentsPage.hoverOverIncidentBarSegment(0, 0); @@ -122,7 +122,7 @@ describe('Regression: Mixed Severity Interval Boundary Times', { tags: ['@incide ).to.equal(firstEnd); }); }); - cy.pause(); + cy.log('2.3 Incident tooltip Start vs alert tooltip Start vs alerts table Start'); incidentsPage.hoverOverIncidentBarSegment(0, 0); @@ -158,7 +158,7 @@ describe('Regression: Mixed Severity Interval Boundary Times', { tags: ['@incide }); }); }); - cy.pause(); + cy.log('Expected failure: Incident tooltip Start times are 5 minutes off (OU-1221)'); }); diff --git a/web/cypress/reports/test-stability.md b/web/cypress/reports/test-stability.md new file mode 100644 index 000000000..a3cd4f485 --- /dev/null +++ b/web/cypress/reports/test-stability.md @@ -0,0 +1,34 @@ +# Test Stability Ledger + +Tracks incident detection test stability across local and CI iteration runs. Updated automatically by `/iterate-incident-tests` and `/iterate-ci-flaky`. + +## How to Read + +- **Pass rate**: percentage across all recorded runs (local + CI combined) +- **Trend**: direction over last 3 runs +- **Last failure**: most recent failure reason and which run it occurred in +- **Fixed by**: commit that resolved the issue (if applicable) + +## Current Status + +| Test | Pass Rate | Trend | Runs | Last Failure | Fixed By | +|------|-----------|-------|------|-------------|----------| +| _No data yet — run `/iterate-incident-tests` or `/iterate-ci-flaky` to populate_ | | | | | | + +## Run History + +### Run Log + +| # | Date | Type | Branch | Tests | Passed | Failed | Flaky | Commit | +|---|------|------|--------|-------|--------|--------|-------|--------| +| _No runs recorded yet_ | | | | | | | | | + + diff --git a/web/package.json b/web/package.json index c66a43344..55264ebe6 100644 --- a/web/package.json +++ b/web/package.json @@ -38,8 +38,8 @@ "test-cypress-coo": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@coo --@flaky --@demo'", "test-cypress-coo-bvt": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@coo+@smoke --@demo'", "test-cypress-virtualization": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@virtualization --@flaky --@demo'", - "test-cypress-incidents": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@incidents --@flaky --@demo'", - "test-cypress-incidents-e2e": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@incidents+@e2e-real --@flaky --@demo'", + "test-cypress-incidents": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@incidents --@flaky --@demo --@xfail'", + "test-cypress-incidents-e2e": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@incidents+@e2e-real --@flaky --@demo --@xfail'", "test-cypress-smoke": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@smoke --@flaky --@demo'", "test-cypress-fast": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@smoke --@slow --@demo --@flaky'", "test-cypress-perses-dev": "node --max-old-space-size=4096 ./node_modules/.bin/cypress run --browser chrome --headless --env grepTags='@perses-dev --@demo'",