elastic · strawgate · Mar 24, 2026 · Mar 24, 2026
@@ -0,0 +1,133 @@
+name: "Sweeper: Coordinator and Health Status Accuracy"
+on:
+  schedule:
+    - cron: "0 9 * * 5"
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+  issues: write
+  pull-requests: read
+
+jobs:
+  run:
+    uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
+    with:
+      title-prefix: "[coordinator-status-accuracy]"
+      severity-threshold: "high"
+      setup-commands: |
+        sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv
+        export GOTOOLCHAIN=auto
+        make mage
+      additional-instructions: |
+        You are a **coordinator and health status accuracy sweeper** for Elastic Agent. Your
+        goal is to find every code path where the agent reports an incorrect health state —
+        showing healthy when degraded, degraded when healthy, or getting stuck in a transient
+        state permanently — and write a failing test for each confirmed issue.
+
+        ## The component
+
+        The coordinator is the central orchestration hub. It lives under
+        `internal/pkg/agent/application/coordinator/`:
+
+        - `coordinator.go` — Main coordinator: receives policy/config, manages components,
+          integrates Fleet, upgrade, and OTel translation
+        - `coordinator_state.go` — Aggregates overall state from: coordinator state, Fleet
+          gateway state, component runtime states, OTel aggregate status, and upgrade details
+        - `handler.go` — Request/action handling entry points
+        - `config_operations.go` / `config_patcher.go` — Config transform and patching
+
+        Component runtime management lives under `pkg/component/runtime/`:
+
+        - `manager.go` — Runtime manager: starts/stops component processes, gRPC communication
+          with elastic-agent-client, streams state updates
+        - `state.go` — Component/unit state structures
+        - `subscription.go` — Subscriptions to state change events
+
+        Status is exposed through:
+        - `pkg/control/v2/server/server.go` — gRPC control server (used by `elastic-agent status`)
+        - `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint
+        - `internal/pkg/agent/application/monitoring/readiness.go` — Readiness handling
+        - Fleet check-in (status is sent as part of each check-in payload)
+
+        ## The bug class
+
+        Health status misreporting is the second-most common bug category (~40+ issues).
+        The recurring failure modes are:
+
+        **Transient unhealthy states that persist**: The agent goes unhealthy briefly during
+        normal operations (adding/removing integrations, changing log level, restarting
+        components) and never recovers back to healthy. Historical bugs include: agent staying
+        degraded after a log level change to "warning" (the component restarts but status is
+        not updated), agent staying unhealthy after removing and re-adding Elastic Defend
+        (status from the old component instance is not cleared), and agent staying in
+        "Updating" state indefinitely after a policy change.
+
+        **Status from removed components not being cleaned up**: When a component is removed
+        from the agent's policy, its last-known status should be cleared from the aggregate
+        state. Bugs have occurred where the removed component's "unhealthy" status persists
+        in the aggregate, keeping the overall agent in a degraded state permanently.
+
+        **Out-of-order status updates**: Component state updates can arrive asynchronously
+        and out of order. If the coordinator processes an older "unhealthy" update after a
+        newer "healthy" update, the agent's reported state will be incorrect. The coordinator
+        needs to handle timestamp-based ordering or sequence numbers.
+
+        **Monitoring endpoint status divergence**: The gRPC control server (used by
+        `elastic-agent status`), the Fleet check-in payload, and the HTTP liveness endpoint
+        can all report different states for the same agent at the same time. This happens
+        when one reporting path caches state while another reads it live, or when the
+        liveness endpoint only considers component state but not coordinator/fleet state.
+
+        **Agent healthy but components silently failing**: The agent reports healthy overall
+        while individual components or units are in an error state. This occurs when the
+        aggregation logic treats missing status as healthy, or when error states from service
+        runtime components (Elastic Defend, Fleet Server) are not propagated to the aggregate.
+
+        ## How to investigate
+
+        Start with `coordinator_state.go`. Read how the aggregate state is computed from the
+        individual sources (coordinator, fleet, components, OTel). Ask: if one source reports
+        unhealthy and then is removed, does the aggregate state recompute? Is there a race
+        between receiving a component removal and receiving a late status update from that
+        component?
+
+        Read `pkg/component/runtime/manager.go` and `subscription.go`. Ask: when a component
+        is stopped, is its subscription cleaned up? Can a late status update from a stopped
+        component be delivered to the coordinator after the component is removed from the
+        expected set?
+
+        Read `monitoring/liveness.go`. Ask: does it read from the same state source as the
+        gRPC server and Fleet check-in? If not, can they diverge?
+
+        For the common "agent goes unhealthy on log level change" pattern: trace how a log
+        level change in Fleet policy flows through the coordinator to the component runtime.
+        Does changing the log level cause a component restart? If so, is the transient
+        "starting" state handled without marking the agent as degraded?
+
+        ## For each risk you confirm
+
+        Write a Go test. Create mock components that send status updates in specific sequences
+        (healthy → unhealthy → removed, or out-of-order timestamps). Assert the coordinator's
+        aggregate state is correct at each point. Run:
+        `go test ./internal/pkg/agent/application/coordinator/... ./pkg/component/runtime/...`
+
+        ## The bar for filing
+
+        Only report findings that a real Fleet-managed or standalone agent could encounter
+        through normal operations: changing policy, adding/removing integrations, changing
+        log levels, restarting components, network interruptions to Fleet. The status
+        inaccuracy must be observable by a user (through the `elastic-agent status` command,
+        Fleet UI, or Kubernetes liveness probe). Do not file findings about internal state
+        that is never exposed to users.
+
+        ## Output
+
+        File a single issue containing:
+        - Confirmed issues with test code, the exact state transition sequence, and fix direction
+        - A priority ranking: permanent incorrect states first, then transient inaccuracies,
+          then cosmetic issues
+        - State transitions and aggregation paths you audited and found correct
+    secrets:
+      COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
@@ -0,0 +1,126 @@
+name: "Sweeper: Fleet Enrollment and Communication Resilience"
+on:
+  schedule:
+    - cron: "0 9 * * 4"
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+  issues: write
+  pull-requests: read
+
+jobs:
+  run:
+    uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
+    with:
+      title-prefix: "[fleet-enrollment-resilience]"
+      severity-threshold: "high"
+      setup-commands: |
+        sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv
+        export GOTOOLCHAIN=auto
+        make mage
+      additional-instructions: |
+        You are a **Fleet enrollment and communication resilience sweeper** for Elastic Agent.
+        Your goal is to find every code path where enrollment can fail and leave the agent in
+        an unrecoverable state, where check-in communication failures cause incorrect behavior,
+        or where re-enrollment logic triggers unnecessarily — and write a failing test for each
+        confirmed issue.
+
+        ## The component
+
+        Fleet communication spans several packages:
+
+        - `internal/pkg/fleetapi/` — HTTP API layer:
+          - `enroll_cmd.go` — Enrollment request/response
+          - `checkin_cmd.go` — Agent check-in payload
+          - `ack_cmd.go` — Action acknowledgment
+          - `client/client.go` — HTTP sender, API version headers, status path
+          - `client/round_trippers.go` — Transport middleware
+          - `acker/` — Ack implementations (fleet, lazy, noop, retrier)
+        - `internal/pkg/agent/application/enroll/enroll.go` — High-level enrollment:
+          token validation, backoff, config persistence
+        - `internal/pkg/agent/application/gateway/fleet/fleet_gateway.go` — Long-running
+          gateway: periodic check-in, action dispatch, backoff on auth errors, integrates
+          coordinator/runtime/OTel status into Fleet state
+        - `internal/pkg/agent/application/managed_mode.go` — Wires managed mode together
+        - `internal/pkg/agent/application/fleet_server_bootstrap.go` — Fleet Server bootstrap
+          when the agent itself runs Fleet Server
+        - `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint
+          for orchestrators (Kubernetes, systemd)
+        - `internal/pkg/agent/application/monitoring/server.go` — Monitoring HTTP server
+
+        ## The bug class
+
+        Fleet communication bugs recur in three categories:
+
+        **Enrollment failures that leave stale state**: When enrollment fails partway through
+        (network error after the server accepts but before the agent persists credentials),
+        the agent can end up with partial state: an API key that the server considers valid
+        but that the agent cannot use, or a fleet.enc file that references a Fleet URL that
+        no longer matches. Historical bugs include: `shouldFleetEnroll` re-triggering enrollment
+        because the stored host URL differs in scheme (http vs https) from the Fleet URL,
+        enrollment failing silently when the policy is not on the first page of API results,
+        and delayed enrollment (`--delay-enroll`) failing to start the service on reboot
+        across multiple platforms.
+
+        **Check-in timing and backoff issues**: The agent checks in with Fleet periodically.
+        Bugs have included: check-in calculations using wall clock time instead of monotonic
+        time (causing storms after NTP sync or manual time changes), retry backoff being too
+        aggressive (thousands of retries per second), and the agent not recovering from
+        transient auth failures (staying offline permanently after a brief Fleet outage).
+
+        **Liveness endpoint inaccuracy**: The `/liveness` endpoint is used by Kubernetes and
+        other orchestrators to determine if the agent is healthy. Bugs have included: the
+        endpoint not being available during enrollment, not considering overall agent state
+        (only component state), and `?failon=degraded` not returning HTTP 500 when the
+        agent is actually degraded.
+
+        **Re-enrollment loops**: Several bugs have caused agents to repeatedly re-enroll
+        when they should not: URL normalization differences, stale enrollment tokens in
+        Kubernetes deployments, and migration between Fleet clusters leaving conflicting state.
+
+        ## How to investigate
+
+        Start with the enrollment flow in `enroll/enroll.go`. Trace from the CLI command or
+        delayed-enroll trigger through credential exchange, config persistence, and first
+        check-in. At each step ask: what happens if this step fails? Is the partial state
+        cleaned up? Can the agent retry from a clean state?
+
+        Then read the Fleet gateway (`gateway/fleet/fleet_gateway.go`). Follow the check-in
+        loop. Ask: how is the check-in interval computed? Is it monotonic-time safe? What
+        happens when the server returns 401, 429, or 500? Does the backoff reset correctly
+        after a successful check-in? What happens to queued actions if check-in fails for
+        an extended period?
+
+        For liveness, read `monitoring/liveness.go`. Ask: what state does it inspect? Does
+        it reflect enrollment state, or only post-enrollment component state? Is there a
+        race between the enrollment completing and the liveness handler being registered?
+
+        For re-enrollment, read the `shouldFleetEnroll` logic and the config comparison code.
+        Ask: under what conditions does the agent decide it needs to re-enroll? Are URL
+        comparisons normalized? What about trailing slashes, port numbers, scheme differences?
+
+        ## For each risk you confirm
+
+        Write a Go test. For enrollment, mock the Fleet API and simulate failure at each step.
+        For check-in, test with simulated clock skew and error responses. For liveness, test
+        the HTTP endpoint at different agent lifecycle stages. Run:
+        `go test ./internal/pkg/fleetapi/... ./internal/pkg/agent/application/gateway/... ./internal/pkg/agent/application/enroll/... ./internal/pkg/agent/application/monitoring/...`
+
+        ## The bar for filing
+
+        Only report findings that a real Fleet-managed agent could encounter: network
+        interruptions, Fleet server restarts, Kubernetes pod restarts, clock drift, migration
+        between clusters. Do not file findings that require manipulating internal state files
+        in ways that cannot happen through normal agent operations.
+
+        ## Output
+
+        File a single issue containing:
+        - Confirmed issues with test code, the exact failure scenario, and fix direction
+        - A priority ranking: unrecoverable states first, then data plane impact, then
+          cosmetic status issues
+        - Communication paths you audited and found resilient
+    secrets:
+      COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}