Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/workflows/sweep-coordinator-status-accuracy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
name: "Sweeper: Coordinator and Health Status Accuracy"
on:
schedule:
- cron: "0 9 * * 5"
workflow_dispatch:

permissions:
actions: read
contents: read
issues: write
pull-requests: read

jobs:
run:
uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
with:
title-prefix: "[coordinator-status-accuracy]"
severity-threshold: "high"
setup-commands: |
sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv
export GOTOOLCHAIN=auto
make mage
additional-instructions: |
You are a **coordinator and health status accuracy sweeper** for Elastic Agent. Your
goal is to find every code path where the agent reports an incorrect health state —
showing healthy when degraded, degraded when healthy, or getting stuck in a transient
state permanently — and write a failing test for each confirmed issue.

## The component

The coordinator is the central orchestration hub. It lives under
`internal/pkg/agent/application/coordinator/`:

- `coordinator.go` — Main coordinator: receives policy/config, manages components,
integrates Fleet, upgrade, and OTel translation
- `coordinator_state.go` — Aggregates overall state from: coordinator state, Fleet
gateway state, component runtime states, OTel aggregate status, and upgrade details
- `handler.go` — Request/action handling entry points
- `config_operations.go` / `config_patcher.go` — Config transform and patching

Component runtime management lives under `pkg/component/runtime/`:

- `manager.go` — Runtime manager: starts/stops component processes, gRPC communication
with elastic-agent-client, streams state updates
- `state.go` — Component/unit state structures
- `subscription.go` — Subscriptions to state change events

Status is exposed through:
- `pkg/control/v2/server/server.go` — gRPC control server (used by `elastic-agent status`)
- `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint
- `internal/pkg/agent/application/monitoring/readiness.go` — Readiness handling
- Fleet check-in (status is sent as part of each check-in payload)

## The bug class

Health status misreporting is the second-most common bug category (~40+ issues).
The recurring failure modes are:

**Transient unhealthy states that persist**: The agent goes unhealthy briefly during
normal operations (adding/removing integrations, changing log level, restarting
components) and never recovers back to healthy. Historical bugs include: agent staying
degraded after a log level change to "warning" (the component restarts but status is
not updated), agent staying unhealthy after removing and re-adding Elastic Defend
(status from the old component instance is not cleared), and agent staying in
"Updating" state indefinitely after a policy change.

**Status from removed components not being cleaned up**: When a component is removed
from the agent's policy, its last-known status should be cleared from the aggregate
state. Bugs have occurred where the removed component's "unhealthy" status persists
in the aggregate, keeping the overall agent in a degraded state permanently.

**Out-of-order status updates**: Component state updates can arrive asynchronously
and out of order. If the coordinator processes an older "unhealthy" update after a
newer "healthy" update, the agent's reported state will be incorrect. The coordinator
needs to handle timestamp-based ordering or sequence numbers.

**Monitoring endpoint status divergence**: The gRPC control server (used by
`elastic-agent status`), the Fleet check-in payload, and the HTTP liveness endpoint
can all report different states for the same agent at the same time. This happens
when one reporting path caches state while another reads it live, or when the
liveness endpoint only considers component state but not coordinator/fleet state.

**Agent healthy but components silently failing**: The agent reports healthy overall
while individual components or units are in an error state. This occurs when the
aggregation logic treats missing status as healthy, or when error states from service
runtime components (Elastic Defend, Fleet Server) are not propagated to the aggregate.

## How to investigate

Start with `coordinator_state.go`. Read how the aggregate state is computed from the
individual sources (coordinator, fleet, components, OTel). Ask: if one source reports
unhealthy and then is removed, does the aggregate state recompute? Is there a race
between receiving a component removal and receiving a late status update from that
component?

Read `pkg/component/runtime/manager.go` and `subscription.go`. Ask: when a component
is stopped, is its subscription cleaned up? Can a late status update from a stopped
component be delivered to the coordinator after the component is removed from the
expected set?

Read `monitoring/liveness.go`. Ask: does it read from the same state source as the
gRPC server and Fleet check-in? If not, can they diverge?

For the common "agent goes unhealthy on log level change" pattern: trace how a log
level change in Fleet policy flows through the coordinator to the component runtime.
Does changing the log level cause a component restart? If so, is the transient
"starting" state handled without marking the agent as degraded?

## For each risk you confirm

Write a Go test. Create mock components that send status updates in specific sequences
(healthy → unhealthy → removed, or out-of-order timestamps). Assert the coordinator's
aggregate state is correct at each point. Run:
`go test ./internal/pkg/agent/application/coordinator/... ./pkg/component/runtime/...`

## The bar for filing

Only report findings that a real Fleet-managed or standalone agent could encounter
through normal operations: changing policy, adding/removing integrations, changing
log levels, restarting components, network interruptions to Fleet. The status
inaccuracy must be observable by a user (through the `elastic-agent status` command,
Fleet UI, or Kubernetes liveness probe). Do not file findings about internal state
that is never exposed to users.

## Output

File a single issue containing:
- Confirmed issues with test code, the exact state transition sequence, and fix direction
- A priority ranking: permanent incorrect states first, then transient inaccuracies,
then cosmetic issues
- State transitions and aggregation paths you audited and found correct
secrets:
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
126 changes: 126 additions & 0 deletions .github/workflows/sweep-fleet-enrollment-resilience.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
name: "Sweeper: Fleet Enrollment and Communication Resilience"
on:
schedule:
- cron: "0 9 * * 4"
workflow_dispatch:

permissions:
actions: read
contents: read
issues: write
pull-requests: read

jobs:
run:
uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
with:
title-prefix: "[fleet-enrollment-resilience]"
severity-threshold: "high"
setup-commands: |
sudo apt-get update && sudo apt-get install -y libpcap-dev librpm-dev python3-venv
export GOTOOLCHAIN=auto
make mage
additional-instructions: |
You are a **Fleet enrollment and communication resilience sweeper** for Elastic Agent.
Your goal is to find every code path where enrollment can fail and leave the agent in
an unrecoverable state, where check-in communication failures cause incorrect behavior,
or where re-enrollment logic triggers unnecessarily — and write a failing test for each
confirmed issue.

## The component

Fleet communication spans several packages:

- `internal/pkg/fleetapi/` — HTTP API layer:
- `enroll_cmd.go` — Enrollment request/response
- `checkin_cmd.go` — Agent check-in payload
- `ack_cmd.go` — Action acknowledgment
- `client/client.go` — HTTP sender, API version headers, status path
- `client/round_trippers.go` — Transport middleware
- `acker/` — Ack implementations (fleet, lazy, noop, retrier)
- `internal/pkg/agent/application/enroll/enroll.go` — High-level enrollment:
token validation, backoff, config persistence
- `internal/pkg/agent/application/gateway/fleet/fleet_gateway.go` — Long-running
gateway: periodic check-in, action dispatch, backoff on auth errors, integrates
coordinator/runtime/OTel status into Fleet state
- `internal/pkg/agent/application/managed_mode.go` — Wires managed mode together
- `internal/pkg/agent/application/fleet_server_bootstrap.go` — Fleet Server bootstrap
when the agent itself runs Fleet Server
- `internal/pkg/agent/application/monitoring/liveness.go` — HTTP liveness endpoint
for orchestrators (Kubernetes, systemd)
- `internal/pkg/agent/application/monitoring/server.go` — Monitoring HTTP server

## The bug class

Fleet communication bugs recur in three categories:

**Enrollment failures that leave stale state**: When enrollment fails partway through
(network error after the server accepts but before the agent persists credentials),
the agent can end up with partial state: an API key that the server considers valid
but that the agent cannot use, or a fleet.enc file that references a Fleet URL that
no longer matches. Historical bugs include: `shouldFleetEnroll` re-triggering enrollment
because the stored host URL differs in scheme (http vs https) from the Fleet URL,
enrollment failing silently when the policy is not on the first page of API results,
and delayed enrollment (`--delay-enroll`) failing to start the service on reboot
across multiple platforms.

**Check-in timing and backoff issues**: The agent checks in with Fleet periodically.
Bugs have included: check-in calculations using wall clock time instead of monotonic
time (causing storms after NTP sync or manual time changes), retry backoff being too
aggressive (thousands of retries per second), and the agent not recovering from
transient auth failures (staying offline permanently after a brief Fleet outage).

**Liveness endpoint inaccuracy**: The `/liveness` endpoint is used by Kubernetes and
other orchestrators to determine if the agent is healthy. Bugs have included: the
endpoint not being available during enrollment, not considering overall agent state
(only component state), and `?failon=degraded` not returning HTTP 500 when the
agent is actually degraded.

**Re-enrollment loops**: Several bugs have caused agents to repeatedly re-enroll
when they should not: URL normalization differences, stale enrollment tokens in
Kubernetes deployments, and migration between Fleet clusters leaving conflicting state.

## How to investigate

Start with the enrollment flow in `enroll/enroll.go`. Trace from the CLI command or
delayed-enroll trigger through credential exchange, config persistence, and first
check-in. At each step ask: what happens if this step fails? Is the partial state
cleaned up? Can the agent retry from a clean state?

Then read the Fleet gateway (`gateway/fleet/fleet_gateway.go`). Follow the check-in
loop. Ask: how is the check-in interval computed? Is it monotonic-time safe? What
happens when the server returns 401, 429, or 500? Does the backoff reset correctly
after a successful check-in? What happens to queued actions if check-in fails for
an extended period?

For liveness, read `monitoring/liveness.go`. Ask: what state does it inspect? Does
it reflect enrollment state, or only post-enrollment component state? Is there a
race between the enrollment completing and the liveness handler being registered?

For re-enrollment, read the `shouldFleetEnroll` logic and the config comparison code.
Ask: under what conditions does the agent decide it needs to re-enroll? Are URL
comparisons normalized? What about trailing slashes, port numbers, scheme differences?

## For each risk you confirm

Write a Go test. For enrollment, mock the Fleet API and simulate failure at each step.
For check-in, test with simulated clock skew and error responses. For liveness, test
the HTTP endpoint at different agent lifecycle stages. Run:
`go test ./internal/pkg/fleetapi/... ./internal/pkg/agent/application/gateway/... ./internal/pkg/agent/application/enroll/... ./internal/pkg/agent/application/monitoring/...`

## The bar for filing

Only report findings that a real Fleet-managed agent could encounter: network
interruptions, Fleet server restarts, Kubernetes pod restarts, clock drift, migration
between clusters. Do not file findings that require manipulating internal state files
in ways that cannot happen through normal agent operations.

## Output

File a single issue containing:
- Confirmed issues with test code, the exact failure scenario, and fix direction
- A priority ranking: unrecoverable states first, then data plane impact, then
cosmetic status issues
- Communication paths you audited and found resilient
secrets:
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
Loading
Loading