diff --git a/.github/workflows/sweep-coordinator-status-accuracy.yml b/.github/workflows/sweep-coordinator-status-accuracy.yml new file mode 100644 index 00000000000..4ff3912c954 --- /dev/null +++ b/.github/workflows/sweep-coordinator-status-accuracy.yml @@ -0,0 +1,133 @@ +name: "Sweeper: Coordinator and Health Status Accuracy" +on: + schedule: + - cron: "0 9 * * 5" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0 + with: + title-prefix: "[coordinator-status-accuracy]" + severity-threshold: "high" + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + You are a **coordinator and health status accuracy sweeper** for Elastic Agent. Your + goal is to find every code path where the agent reports an incorrect health state — + showing healthy when degraded, degraded when healthy, or getting stuck in a transient + state permanently — and write a failing test for each confirmed issue. + + ## The component + + The coordinator is the central orchestration hub. It lives under + `internal/pkg/agent/application/coordinator/`: + + - `coordinator.go` — Main coordinator: receives policy/config, manages components, + integrates Fleet, upgrade, and OTel translation + - `coordinator_state.go` — Aggregates overall state from: coordinator state, Fleet + gateway state, component runtime states, OTel aggregate status, and upgrade details + - `handler.go` — Request/action handling entry points + - `config_operations.go` / `config_patcher.go` — Config transform and patching + + Component runtime management lives under `pkg/component/runtime/`: + + - `manager.go` — Runtime manager: starts/stops component processes, gRPC communication + with elastic-agent-client, streams state updates + - `state.go` — Component/unit state structures + - `subscription.go` — Subscriptions to state change events + + Status is exposed through: + - `pkg/control/v2/server/server.go` — gRPC control server (used by `elastic-agent status`) + - `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint + - `internal/pkg/agent/application/monitoring/readiness.go` — Readiness handling + - Fleet check-in (status is sent as part of each check-in payload) + + ## The bug class + + Health status misreporting is the second-most common bug category (~40+ issues). + The recurring failure modes are: + + **Transient unhealthy states that persist**: The agent goes unhealthy briefly during + normal operations (adding/removing integrations, changing log level, restarting + components) and never recovers back to healthy. Historical bugs include: agent staying + degraded after a log level change to "warning" (the component restarts but status is + not updated), agent staying unhealthy after removing and re-adding Elastic Defend + (status from the old component instance is not cleared), and agent staying in + "Updating" state indefinitely after a policy change. + + **Status from removed components not being cleaned up**: When a component is removed + from the agent's policy, its last-known status should be cleared from the aggregate + state. Bugs have occurred where the removed component's "unhealthy" status persists + in the aggregate, keeping the overall agent in a degraded state permanently. + + **Out-of-order status updates**: Component state updates can arrive asynchronously + and out of order. If the coordinator processes an older "unhealthy" update after a + newer "healthy" update, the agent's reported state will be incorrect. The coordinator + needs to handle timestamp-based ordering or sequence numbers. + + **Monitoring endpoint status divergence**: The gRPC control server (used by + `elastic-agent status`), the Fleet check-in payload, and the HTTP liveness endpoint + can all report different states for the same agent at the same time. This happens + when one reporting path caches state while another reads it live, or when the + liveness endpoint only considers component state but not coordinator/fleet state. + + **Agent healthy but components silently failing**: The agent reports healthy overall + while individual components or units are in an error state. This occurs when the + aggregation logic treats missing status as healthy, or when error states from service + runtime components (Elastic Defend, Fleet Server) are not propagated to the aggregate. + + ## How to investigate + + Start with `coordinator_state.go`. Read how the aggregate state is computed from the + individual sources (coordinator, fleet, components, OTel). Ask: if one source reports + unhealthy and then is removed, does the aggregate state recompute? Is there a race + between receiving a component removal and receiving a late status update from that + component? + + Read `pkg/component/runtime/manager.go` and `subscription.go`. Ask: when a component + is stopped, is its subscription cleaned up? Can a late status update from a stopped + component be delivered to the coordinator after the component is removed from the + expected set? + + Read `monitoring/liveness.go`. Ask: does it read from the same state source as the + gRPC server and Fleet check-in? If not, can they diverge? + + For the common "agent goes unhealthy on log level change" pattern: trace how a log + level change in Fleet policy flows through the coordinator to the component runtime. + Does changing the log level cause a component restart? If so, is the transient + "starting" state handled without marking the agent as degraded? + + ## For each risk you confirm + + Write a Go test. Create mock components that send status updates in specific sequences + (healthy → unhealthy → removed, or out-of-order timestamps). Assert the coordinator's + aggregate state is correct at each point. Run: + `go test ./internal/pkg/agent/application/coordinator/... ./pkg/component/runtime/...` + + ## The bar for filing + + Only report findings that a real Fleet-managed or standalone agent could encounter + through normal operations: changing policy, adding/removing integrations, changing + log levels, restarting components, network interruptions to Fleet. The status + inaccuracy must be observable by a user (through the `elastic-agent status` command, + Fleet UI, or Kubernetes liveness probe). Do not file findings about internal state + that is never exposed to users. + + ## Output + + File a single issue containing: + - Confirmed issues with test code, the exact state transition sequence, and fix direction + - A priority ranking: permanent incorrect states first, then transient inaccuracies, + then cosmetic issues + - State transitions and aggregation paths you audited and found correct + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/sweep-fleet-enrollment-resilience.yml b/.github/workflows/sweep-fleet-enrollment-resilience.yml new file mode 100644 index 00000000000..f509c7b8d60 --- /dev/null +++ b/.github/workflows/sweep-fleet-enrollment-resilience.yml @@ -0,0 +1,126 @@ +name: "Sweeper: Fleet Enrollment and Communication Resilience" +on: + schedule: + - cron: "0 9 * * 4" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0 + with: + title-prefix: "[fleet-enrollment-resilience]" + severity-threshold: "high" + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + You are a **Fleet enrollment and communication resilience sweeper** for Elastic Agent. + Your goal is to find every code path where enrollment can fail and leave the agent in + an unrecoverable state, where check-in communication failures cause incorrect behavior, + or where re-enrollment logic triggers unnecessarily — and write a failing test for each + confirmed issue. + + ## The component + + Fleet communication spans several packages: + + - `internal/pkg/fleetapi/` — HTTP API layer: + - `enroll_cmd.go` — Enrollment request/response + - `checkin_cmd.go` — Agent check-in payload + - `ack_cmd.go` — Action acknowledgment + - `client/client.go` — HTTP sender, API version headers, status path + - `client/round_trippers.go` — Transport middleware + - `acker/` — Ack implementations (fleet, lazy, noop, retrier) + - `internal/pkg/agent/application/enroll/enroll.go` — High-level enrollment: + token validation, backoff, config persistence + - `internal/pkg/agent/application/gateway/fleet/fleet_gateway.go` — Long-running + gateway: periodic check-in, action dispatch, backoff on auth errors, integrates + coordinator/runtime/OTel status into Fleet state + - `internal/pkg/agent/application/managed_mode.go` — Wires managed mode together + - `internal/pkg/agent/application/fleet_server_bootstrap.go` — Fleet Server bootstrap + when the agent itself runs Fleet Server + - `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint + for orchestrators (Kubernetes, systemd) + - `internal/pkg/agent/application/monitoring/server.go` — Monitoring HTTP server + + ## The bug class + + Fleet communication bugs recur in three categories: + + **Enrollment failures that leave stale state**: When enrollment fails partway through + (network error after the server accepts but before the agent persists credentials), + the agent can end up with partial state: an API key that the server considers valid + but that the agent cannot use, or a fleet.enc file that references a Fleet URL that + no longer matches. Historical bugs include: `shouldFleetEnroll` re-triggering enrollment + because the stored host URL differs in scheme (http vs https) from the Fleet URL, + enrollment failing silently when the policy is not on the first page of API results, + and delayed enrollment (`--delay-enroll`) failing to start the service on reboot + across multiple platforms. + + **Check-in timing and backoff issues**: The agent checks in with Fleet periodically. + Bugs have included: check-in calculations using wall clock time instead of monotonic + time (causing storms after NTP sync or manual time changes), retry backoff being too + aggressive (thousands of retries per second), and the agent not recovering from + transient auth failures (staying offline permanently after a brief Fleet outage). + + **Liveness endpoint inaccuracy**: The `/liveness` endpoint is used by Kubernetes and + other orchestrators to determine if the agent is healthy. Bugs have included: the + endpoint not being available during enrollment, not considering overall agent state + (only component state), and `?failon=degraded` not returning HTTP 500 when the + agent is actually degraded. + + **Re-enrollment loops**: Several bugs have caused agents to repeatedly re-enroll + when they should not: URL normalization differences, stale enrollment tokens in + Kubernetes deployments, and migration between Fleet clusters leaving conflicting state. + + ## How to investigate + + Start with the enrollment flow in `enroll/enroll.go`. Trace from the CLI command or + delayed-enroll trigger through credential exchange, config persistence, and first + check-in. At each step ask: what happens if this step fails? Is the partial state + cleaned up? Can the agent retry from a clean state? + + Then read the Fleet gateway (`gateway/fleet/fleet_gateway.go`). Follow the check-in + loop. Ask: how is the check-in interval computed? Is it monotonic-time safe? What + happens when the server returns 401, 429, or 500? Does the backoff reset correctly + after a successful check-in? What happens to queued actions if check-in fails for + an extended period? + + For liveness, read `monitoring/liveness.go`. Ask: what state does it inspect? Does + it reflect enrollment state, or only post-enrollment component state? Is there a + race between the enrollment completing and the liveness handler being registered? + + For re-enrollment, read the `shouldFleetEnroll` logic and the config comparison code. + Ask: under what conditions does the agent decide it needs to re-enroll? Are URL + comparisons normalized? What about trailing slashes, port numbers, scheme differences? + + ## For each risk you confirm + + Write a Go test. For enrollment, mock the Fleet API and simulate failure at each step. + For check-in, test with simulated clock skew and error responses. For liveness, test + the HTTP endpoint at different agent lifecycle stages. Run: + `go test ./internal/pkg/fleetapi/... ./internal/pkg/agent/application/gateway/... ./internal/pkg/agent/application/enroll/... ./internal/pkg/agent/application/monitoring/...` + + ## The bar for filing + + Only report findings that a real Fleet-managed agent could encounter: network + interruptions, Fleet server restarts, Kubernetes pod restarts, clock drift, migration + between clusters. Do not file findings that require manipulating internal state files + in ways that cannot happen through normal agent operations. + + ## Output + + File a single issue containing: + - Confirmed issues with test code, the exact failure scenario, and fix direction + - A priority ranking: unrecoverable states first, then data plane impact, then + cosmetic status issues + - Communication paths you audited and found resilient + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/sweep-otel-collector-integration.yml b/.github/workflows/sweep-otel-collector-integration.yml new file mode 100644 index 00000000000..5be76688395 --- /dev/null +++ b/.github/workflows/sweep-otel-collector-integration.yml @@ -0,0 +1,123 @@ +name: "Sweeper: OTel Collector and Beats Receiver Integration" +on: + schedule: + - cron: "0 9 * * 3" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0 + with: + title-prefix: "[otel-collector-integration]" + severity-threshold: "high" + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + You are an **OTel collector integration sweeper** for Elastic Agent. Your goal is to find + every code path where the OTel collector subprocess can crash, lose data, misreport + status, or receive incorrect configuration from the agent's translation layer — and + write a failing test for each confirmed issue. + + ## The component + + Elastic Agent manages an OpenTelemetry Collector as a supervised subprocess. The + integration code lives under `internal/pkg/otel/`: + + - `manager/manager.go` — OTel manager: receives config updates from the coordinator, + starts/restarts the collector subprocess + - `manager/execution.go` / `execution_subprocess.go` — Subprocess lifecycle, flags, + health and metrics port assignment + - `manager/healthcheck.go` — Probes the collector's health endpoint + - `manager/recovery_backoff.go` — Restart backoff when the collector crashes + - `translate/otelconfig.go` — Maps Elastic Agent component model → OTel Collector + `confmap` configuration (receivers, exporters, processors, extensions, pipelines) + - `translate/output_elasticsearch.go` — Maps Elasticsearch output settings (TLS, + auth, retry, bulk) into OTel exporter config + - `translate/status.go` — Maps OTel pipeline/receiver status back to agent runtime + state (component names, beats receiver semantics) + - `status/serializable.go` — Serializable aggregate status for agent reporting + - `receivers/elasticmonitoring/` — Custom OTel receiver for internal monitoring telemetry + - `extension/elasticdiagnostics/` — OTel extension for diagnostics + + ## The bug class + + This is the fastest-growing bug category in the repository (~40+ issues and climbing). + The recurring failure modes are: + + **Config translation errors**: The translation from Elastic Agent's component model to + OTel Collector config has many edge cases. Historical bugs include: TLS settings not + mapped correctly (proxy_url type mismatch, missing CA paths), retry behavior not matching + standalone beats (retry intervals, max_retries <= 0 handling), `service.telemetry` config + not persisted or propagated to the collector subprocess, and output-specific settings + silently dropped during translation. Each of these caused data loss or collector crashes + in production. + + **Status reporting mismatch**: The agent reports component health based on the collector's + status, but the mapping between OTel pipeline status and agent component status has + produced bugs: extension status always appearing in output, beatsreceiver status persisting + after unenroll, switching between process and otel runtime removing components from + status output, and timestamps not being considered when comparing statuses (causing stale + status to override current status). + + **Collector subprocess lifecycle**: The collector runs as a subprocess. Bugs have included: + the collector not shutting down gracefully on Unix signals, defunct processes left behind + when the agent re-execs itself, health check connections not being reused (opening a new + connection per request), and the collector not being reloaded (vs restarted) when only + config changes. + + **Data loss on error responses**: The Elasticsearch exporter in the OTel collector has + had bugs where HTTP 413 errors caused data loss (entire batch dropped instead of split + and retried), non-retriable errors were not retried, and the retry interval did not + back off on repeated failures. These are subtle because the collector reports healthy + while silently losing data. + + ## How to investigate + + Start with the translation layer (`translate/otelconfig.go`). Read how each Elastic + Agent config field maps to the OTel config. For each field, ask: what happens when + this field is missing? What happens when it has an unexpected type (string vs URL vs + map)? What happens when the output type changes? Are there fields that the agent + supports but the translation silently ignores? + + Then read the status mapping (`translate/status.go`). Follow how an OTel pipeline + status event flows back to the coordinator. Ask: if two status updates arrive out of + order, does the agent show the correct current state? If a component is removed from + the config, does its status get cleaned up? If the collector reports healthy but an + exporter is dropping data, does the agent know? + + For the subprocess lifecycle, read `manager/execution_subprocess.go`. Ask: what happens + if the collector crashes immediately after starting? What if it crashes during a config + reload? What if the agent is shut down while the collector is starting up? Are there + race conditions between the health check and the subprocess exit? + + ## For each risk you confirm + + Write a Go test. For translation issues, create an Elastic Agent config and assert the + OTel config output matches expectations. For status issues, simulate status events and + assert the agent's aggregate state is correct. Run: + `go test ./internal/pkg/otel/...` + + ## The bar for filing + + Only report findings that affect a real Elastic Agent deployment running with OTel + components (managed or hybrid mode). The issue must be triggerable through normal + operations: changing agent policy, restarting the agent, having the Elasticsearch + cluster return errors, or having network interruptions. Do not file issues that require + manipulating the OTel collector's internal state directly. + + ## Output + + File a single issue containing: + - Confirmed issues with test code, the exact config or status scenario, and fix direction + - A priority ranking: data-loss issues first, then status misreporting, then lifecycle + - Translation fields you audited and found correctly mapped + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/sweep-upgrade-lifecycle.yml b/.github/workflows/sweep-upgrade-lifecycle.yml new file mode 100644 index 00000000000..fbc694c18da --- /dev/null +++ b/.github/workflows/sweep-upgrade-lifecycle.yml @@ -0,0 +1,126 @@ +name: "Sweeper: Upgrade and Rollback Lifecycle" +on: + schedule: + - cron: "0 9 * * 2" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0 + with: + title-prefix: "[upgrade-lifecycle]" + severity-threshold: "high" + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + You are an **upgrade and rollback lifecycle sweeper** for Elastic Agent. Your goal is to + find every code path where an upgrade can leave the agent in a broken state, a rollback + can fail silently, or platform-specific behavior causes the upgrade to diverge from the + expected sequence — and write a failing test for each confirmed issue. + + ## The component + + The upgrade system lives under `internal/pkg/agent/application/upgrade/`. Key files: + + - `upgrade.go` — Core upgrade orchestration: download, unpack, marker handling, re-exec + - `rollback.go` (+ platform variants `rollback_linux.go`, `rollback_darwin.go`, + `rollback_windows.go`) — Platform-specific rollback execution + - `watcher.go` (+ `watcher_windows.go`, `watcher_other.go`) — Post-upgrade watcher + process that monitors health and triggers rollback + - `step_mark.go` — Writes/reads the `.update-marker` file + - `marker_watcher.go` — Watches marker state on disk + - `step_download.go`, `step_unpack.go`, `step_relink.go` — Discrete upgrade pipeline steps + - `cleanup.go` — Post-upgrade/rollback cleanup + - `details/` — Structured upgrade progress reporting to Fleet + - `ttl/` — TTL markers for rollback window expiry + - `artifact/download/` — Artifact resolution, HTTP/FS downloaders, verifiers + + Related code: + - `internal/pkg/agent/cmd/watch.go` / `watch_impl.go` — Watcher CLI entry point + - `internal/pkg/agent/application/reexec/manager.go` — In-process re-exec after upgrade + - `internal/pkg/agent/application/actions/handlers/handler_action_upgrade.go` — Fleet + action handler for managed upgrades + + ## The bug class + + Upgrade/rollback is the #1 source of bugs in this repository (~80+ historical issues). + The recurring failure modes are: + + **Marker corruption or stale markers**: The `.update-marker` file is the source of truth + for whether an upgrade is in progress, succeeded, or needs rollback. If the marker is not + written atomically, is left behind after a completed upgrade, or is unreadable after a + crash, the agent can enter an undefined state on next startup. On Windows, file locking + and permission inheritance add additional failure modes. + + **Watcher/agent PID race**: The upgrade watcher monitors the new agent process. If the + watcher reads the wrong PID (stale PID file, systemd reporting the old process), it may + incorrectly conclude the upgrade failed and trigger a rollback of a healthy upgrade. This + has happened specifically with systemd and Windows service manager. + + **Platform-specific rollback failures**: Rollback involves restoring the previous binary, + symlinks, and service configuration. On DEB/RPM, the package manager's symlinks may + conflict with the agent's own symlink management. On Windows, the watcher may fail to + kill processes that hold file locks. On macOS, the service plist may not be restored + correctly. + + **Upgrade retry storms**: When an artifact download fails, the retry logic should use + exponential backoff. Historical bugs show the agent retrying thousands of times per + second when the backoff duration computes to zero or negative (e.g. after clock skew + or NTP adjustment). + + **Version migration gaps**: Upgrading across major versions (7.17 → 8.x, 8.x → 9.x) + requires migrating persisted state (fleet.enc, action store, run directories). Missing + migration steps cause the upgraded agent to fail on startup and roll back. + + ## How to investigate + + Trace the upgrade flow end-to-end: from the Fleet action or CLI command, through + download, verification, unpack, marker write, re-exec, watcher startup, health check, + and marker cleanup. At each step ask: + + 1. What happens if the process is killed (SIGKILL, power loss) right now? + 2. What happens if the filesystem operation fails (disk full, permission denied)? + 3. What happens on each platform (Linux systemd, Windows service, macOS launchd)? + 4. What state is left on disk, and can the next startup recover from it? + + For the watcher, ask: how does it determine the new agent is healthy? What PID does it + monitor, and how does it get that PID? What happens if the new agent starts, reports + healthy, then crashes 30 seconds later — does the watcher catch that? + + For rollback, ask: after rollback completes, is the agent in exactly the same state as + before the upgrade started? Are there any files, symlinks, or service registrations + that are not restored? + + ## For each risk you confirm + + Write a Go test in the appropriate `*_test.go` file. For marker issues, create a marker + in a specific state and verify the upgrade/rollback logic handles it correctly. For + watcher issues, simulate the health check sequence. Run: + `go test ./internal/pkg/agent/application/upgrade/...` + + ## The bar for filing + + Only report findings that a real user could encounter during a normal upgrade operation: + upgrading via Fleet, upgrading via CLI, upgrading across versions, upgrading on each + platform. The failure must be triggerable through user-observable events (power loss + during upgrade, disk full, service restart, clock skew). Do not file findings that + require manually corrupting internal files in ways that cannot happen during normal + operation. + + ## Output + + File a single issue containing: + - Confirmed issues with test code, the exact failure scenario, and the fix direction + - A priority ranking: which issues affect the most common upgrade paths vs edge cases + - Platform-specific findings clearly labeled (Windows, macOS, Linux) + - A summary of upgrade paths you audited and found safe + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-breaking-change-detect.yml b/.github/workflows/trigger-breaking-change-detect.yml new file mode 100644 index 00000000000..24fbafdffc7 --- /dev/null +++ b/.github/workflows/trigger-breaking-change-detect.yml @@ -0,0 +1,22 @@ +name: Breaking Change Detect +on: + schedule: + - cron: "0 13 * * 1" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-breaking-change-detect.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-docs-new-contributor-review.yml b/.github/workflows/trigger-docs-new-contributor-review.yml new file mode 100644 index 00000000000..0a9d6bbce9e --- /dev/null +++ b/.github/workflows/trigger-docs-new-contributor-review.yml @@ -0,0 +1,22 @@ +name: Docs New Contributor Review +on: + schedule: + - cron: "0 14 * * 2" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-newbie-contributor-patrol.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-docs-patrol.yml b/.github/workflows/trigger-docs-patrol.yml new file mode 100644 index 00000000000..ba873c04f72 --- /dev/null +++ b/.github/workflows/trigger-docs-patrol.yml @@ -0,0 +1,22 @@ +name: Docs Patrol +on: + schedule: + - cron: "0 14 * * 1-5" + workflow_dispatch: + +permissions: + actions: read + contents: write + issues: write + pull-requests: write + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-docs-patrol.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-duplicate-issue-detector.yml b/.github/workflows/trigger-duplicate-issue-detector.yml new file mode 100644 index 00000000000..312c43c6039 --- /dev/null +++ b/.github/workflows/trigger-duplicate-issue-detector.yml @@ -0,0 +1,21 @@ +name: Duplicate Issue Detector +on: + issues: + types: [opened] + +permissions: + actions: read + contents: read + discussions: write + issues: write + pull-requests: read + +jobs: + run: + if: >- + contains(fromJSON('["OWNER","MEMBER","COLLABORATOR"]'), github.event.issue.author_association) + uses: elastic/ai-github-actions/.github/workflows/gh-aw-duplicate-issue-detector.lock.yml@v0 + with: + detect-related-issues: false + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-issue-triage.yml b/.github/workflows/trigger-issue-triage.yml new file mode 100644 index 00000000000..cb658c59091 --- /dev/null +++ b/.github/workflows/trigger-issue-triage.yml @@ -0,0 +1,22 @@ +name: Issue Triage +on: + issues: + types: [opened] + +permissions: + actions: read + contents: read + discussions: write + issues: write + pull-requests: write + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-issue-triage.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-mention-in-issue.yml b/.github/workflows/trigger-mention-in-issue.yml new file mode 100644 index 00000000000..b179e1cb71f --- /dev/null +++ b/.github/workflows/trigger-mention-in-issue.yml @@ -0,0 +1,27 @@ +name: Mention in Issue +on: + issue_comment: + types: [created] + +permissions: + actions: read + contents: write + discussions: write + issues: write + pull-requests: write + +jobs: + run: + if: >- + github.event.issue.pull_request == null && + (startsWith(github.event.comment.body, '/ai') || contains(github.event.comment.body, '@claude')) + uses: elastic/ai-github-actions/.github/workflows/gh-aw-mention-in-issue.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + If you do make a pull request, please add the label "ai" to it. + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-mention-in-pr.yml b/.github/workflows/trigger-mention-in-pr.yml new file mode 100644 index 00000000000..acc4c959659 --- /dev/null +++ b/.github/workflows/trigger-mention-in-pr.yml @@ -0,0 +1,27 @@ +name: Mention in PR +on: + issue_comment: + types: [created] + pull_request_review_comment: + types: [created] + +permissions: + actions: read + contents: write + discussions: write + issues: write + pull-requests: write + +jobs: + run: + if: >- + (startsWith(github.event.comment.body, '/ai') || contains(github.event.comment.body, '@claude')) && + (github.event.issue.pull_request != null || github.event_name == 'pull_request_review_comment') + uses: elastic/ai-github-actions/.github/workflows/gh-aw-mention-in-pr.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-pr-actions-detective.yml b/.github/workflows/trigger-pr-actions-detective.yml new file mode 100644 index 00000000000..1ffa09af19f --- /dev/null +++ b/.github/workflows/trigger-pr-actions-detective.yml @@ -0,0 +1,25 @@ +name: PR Actions Detective +on: + workflow_run: + workflows: ["pre-commit", "golangci-lint", "Validate docs generation structure", "docs-build", "fragment-in-pr"] + types: [completed] + +permissions: + actions: read + contents: read + issues: write + pull-requests: write + +jobs: + run: + if: >- + github.event.workflow_run.conclusion == 'failure' && + toJSON(github.event.workflow_run.pull_requests) != '[]' + uses: elastic/ai-github-actions/.github/workflows/gh-aw-pr-actions-detective.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-pr-review.yml b/.github/workflows/trigger-pr-review.yml new file mode 100644 index 00000000000..834a2d54ccc --- /dev/null +++ b/.github/workflows/trigger-pr-review.yml @@ -0,0 +1,25 @@ +name: PR Review +on: + pull_request: + types: [opened, synchronize, reopened, ready_for_review, labeled, unlabeled] + +permissions: + actions: read + contents: read + issues: write + pull-requests: write + +jobs: + run: + if: >- + github.event.pull_request.draft == false && + !contains(github.event.pull_request.labels.*.name, 'skip-auto-pr-review') && + contains(fromJSON('["OWNER","MEMBER","COLLABORATOR"]'), github.event.pull_request.author_association) + uses: elastic/ai-github-actions/.github/workflows/gh-aw-pr-review.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-stale-issues.yml b/.github/workflows/trigger-stale-issues.yml new file mode 100644 index 00000000000..517f4e84dac --- /dev/null +++ b/.github/workflows/trigger-stale-issues.yml @@ -0,0 +1,25 @@ +name: Stale Issues +on: + schedule: + - cron: "0 15 * * 1" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-stale-issues.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + additional-instructions: | + Add a note to the issue indicating that you do not have permission to actually + close the issues and that a person with permission will need to do so. + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} diff --git a/.github/workflows/trigger-text-auditor.yml b/.github/workflows/trigger-text-auditor.yml new file mode 100644 index 00000000000..ab55747ade2 --- /dev/null +++ b/.github/workflows/trigger-text-auditor.yml @@ -0,0 +1,22 @@ +name: Text Auditor +on: + schedule: + - cron: "0 13 * * 3" + workflow_dispatch: + +permissions: + actions: read + contents: read + issues: write + pull-requests: read + +jobs: + run: + uses: elastic/ai-github-actions/.github/workflows/gh-aw-text-auditor.lock.yml@v0 + with: + setup-commands: | + sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv + export GOTOOLCHAIN=auto + make mage + secrets: + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}