Skip to content

fix(arc): move dind docker root to /home with overlay2 (disk-pressure hazard)#167

Open
brandonrc wants to merge 3 commits into
mainfrom
fix/dind-overlay2-on-home
Open

fix(arc): move dind docker root to /home with overlay2 (disk-pressure hazard)#167
brandonrc wants to merge 3 commits into
mainfrom
fix/dind-overlay2-on-home

Conversation

@brandonrc

Copy link
Copy Markdown
Contributor

Problem

The dind sidecars in all three runner scale sets run dockerd --storage-driver=vfs with /var/lib/docker unmounted. Every image layer is a full directory copy (vfs) landing in the container's writable layer under the crio graph root on the 70G root disk (~25G free). With maxRunners: 20 on ak-ci-runners doing concurrent docker builds, this can exhaust the root disk and trigger kubelet disk-pressure eviction of prod pods on the rocky node.

Fix

For ak-ci-runners, ak-e2e-runners, and ak-beefy-runners:

  • Mount hostPath /home/runner-dind at /var/lib/docker with subPathExpr: $(POD_NAME) — per-pod isolation, since concurrent dockerds cannot share a docker root. /home is 1.8T xfs, ftype=1 (d_type OK for overlay2).
  • Switch dockerd to --storage-driver=overlay2. vfs was only required because /var/lib/docker previously sat on the crio overlay layer and the RHEL 8.10 4.18 kernel forbids overlay-on-overlay. On a plain xfs directory overlay2 works — verified with a manual overlay mount + copy-up test on /home on the node.
  • New e2e/runner-dind-cleanup.yaml: CronJob (every 15 min, concurrencyPolicy: Forbid) reaping /home/runner-dind subdirs that have no matching live pod in arc-runners AND are older than 10 min (grace window). Ephemeral runner pods churn constantly and preStop hooks are not guaranteed to run. Fail-safe: set -euo pipefail aborts before any deletion if the API call fails. Uses ak-runner-rust (already on the node, ships curl+jq) with a dedicated SA limited to pods get/list.

Permission note (lesson from the SCCACHE_ERROR_LOG outage): the DirectoryOrCreate hostPath is root:755, but dockerd runs as root in a privileged container, so it's writable.

Why not emptyDir + overlay2

emptyDir lives under /var/lib/kubelet/pods — still the root disk — and a sizeLimit breach evicts the pod mid-job. The hostPath approach moves all heavy build I/O onto the 1.4T-free /home disk.

Depends on / includes PR #163

⚠️ This branch is based on fix/sccache-cache-size (#163), which is live in prod as helm revision 26 of ak-ci-runners. It includes those commits (SCCACHE_CACHE_SIZE=50G, SCCACHE_ERROR_LOG into the 777 subdir). Merge #163 first or merge this PR which carries both. A values file based on main would revert the sccache fix and re-break every compile job.

Validation

  • helm upgrade --dry-run=server with chart 0.13.1 renders cleanly (overlay2 args, POD_NAME downward API env, subPathExpr mount, hostPath volume, sccache env intact).
  • overlayfs mount + copy-up tested directly on /home (xfs ftype=1) on the rocky node.
  • End-to-end verification (runner pod, docker info, in-pod docker build, layer location on /home, sccache --show-stats) performed after deploy — evidence in PR comments / ops log.

🤖 Generated with Claude Code

brandonrc and others added 3 commits June 11, 2026 23:32
The shared sccache hostPath was capped at sccache's 10 GiB default
(SCCACHE_CACHE_SIZE unset) and verified at exactly 10G on 2026-06-12 —
continuously evicting and degrading hit rates. Baseline measurement:
~30 cold rust builds/week at ~12m penalty each, ~22 runner-hours/week
lost to cache churn across GHA quota + local sccache eviction.

Also adds SCCACHE_ERROR_LOG so compile-cache failures are observable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
/cache (hostPath root) is root:755; only the subdirs get chmod 777 from
init-cache. sccache could not create /cache/sccache-error.log, the
server died at startup, and every compile job on ak-ci-runners failed
with 'Timed out waiting for server startup' (~04:00Z 2026-06-12).
Coverage stayed green only because it sets RUSTC_WRAPPER=''.
…disk

The dind sidecars ran dockerd --storage-driver=vfs with /var/lib/docker
unmounted, so every image layer was a full directory copy landing in the
container writable layer under the crio graph root on the 70G root disk
(~25G free). With up to 20 concurrent ak-ci-runners doing docker builds
this risks root-disk exhaustion and kubelet disk-pressure eviction of
prod pods on the rocky node.

Fix, applied to all three scale sets (ci, e2e, beefy):

- Mount hostPath /home/runner-dind at /var/lib/docker with
  subPathExpr $(POD_NAME) for per-pod isolation (concurrent dockerds
  cannot share a docker root). /home is 1.8T xfs with ftype=1.
- Switch dockerd to --storage-driver=overlay2. vfs was only needed
  because /var/lib/docker previously sat on the crio overlay layer and
  the RHEL 8 4.18 kernel forbids overlay-on-overlay; on a plain xfs
  directory overlay2 works (verified with a manual overlay mount on
  /home on the rocky node).
- Add e2e/runner-dind-cleanup.yaml: a CronJob (every 15 min) that
  reaps /home/runner-dind subdirs with no matching live pod in
  arc-runners and older than 10 min, since ephemeral runner pods churn
  constantly and preStop hooks are not guaranteed to run. Fail-safe:
  an API error aborts the run before any deletion.

dockerd runs as root in a privileged container, so the root:755
DirectoryOrCreate hostPath is writable (cf. the SCCACHE_ERROR_LOG
permissions outage - sccache runs unprivileged, dockerd does not).

Based on fix/sccache-cache-size (PR #163), which is live as helm
revision 26; basing on main would revert the sccache fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@brandonrc brandonrc requested a review from a team as a code owner June 12, 2026 04:22
@github-actions

Copy link
Copy Markdown

Missing linked issue

This PR does not reference a tracking issue in its body. Every PR must link to an issue in this repository so we can trace work back to a planned change.

How to fix

  1. Edit the PR description and add a line like Closes #123, Fixes #123, or Resolves #123 referring to an open issue in artifact-keeper/artifact-keeper-iac.
  2. Save the description. This check will re-run automatically.

Accepted keywords (case-insensitive, any tense): close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved.

Policy reference: see the PR template.

Maintainer bypass: apply the no-issue-required label to this PR to skip the check (use sparingly, e.g. for trivial typo fixes or release-tag chores).

@brandonrc

Copy link
Copy Markdown
Contributor Author

Deployed and verified on the rocky node (2026-06-12 ~00:25 EDT)

Helm revisions (all chart 0.13.1): ak-ci-runners 27, ak-e2e-runners 11, ak-beefy-runners 6. Rollback point for ci: revision 26.

Evidence

  • All three AutoscalingRunnerSet CRs accepted by the 0.13.x controller (listeners 1/1 Running, new runner pods scheduling) and carry --storage-driver=overlay2, subPathExpr: $(POD_NAME), hostPath /home/runner-dind.
  • New-spec runner pod docker info:
    Storage Driver: overlay2
     Backing Filesystem: xfs
     Supports d_type: true
     Native Overlay Diff: true
    
  • In-pod docker build (alpine + 32M layer) + docker run succeeded; layers confirmed on the host at /home/runner-dind/<pod-name>/overlay2/ on /dev/mapper/rl00-home (1.3T free), not the root disk.
  • sccache --show-stats exits 0 in the new runner container (PR fix(arc): raise sccache cache size to 50G on ak-ci-runners #163 fix intact; render shows SCCACHE_CACHE_SIZE=50G + SCCACHE_ERROR_LOG=/cache/sccache/sccache-error.log).
  • Cleanup CronJob manually triggered: correctly identified a dead pod's orphan dir and held it in the 10-min grace window (skipping ...-h26lm: only 70s old), reaped 0, exited clean.
  • Old vfs-generation pods are draining naturally as their jobs complete; all new jobs land on overlay2 pods.
  • Node conditions: no DiskPressure/MemoryPressure; root disk steady at 46G/70G.

Unrelated pre-existing issue observed during verification

artifact-keeper-prod backend is crashlooping with Error: Migration(VersionMissing(114)): ArgoCD selfHeal keeps reverting the backend to git-declared 1.2.0 while the prod DB has been migrated forward by dev digests (sync count ~9178, deployment revision 9028; ak-mesh-* namespaces show the same class of failure — VersionMismatch(81) — for 9 days). Predates and is independent of this change (this PR only touches arc-runners scale sets, which ArgoCD does not manage). Needs its own issue: either pin prod to a compatible image or stop the selfHeal/image-updater tug-of-war.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant