fix(arc): move dind docker root to /home with overlay2 (disk-pressure hazard) by brandonrc · Pull Request #167 · artifact-keeper/artifact-keeper-iac

brandonrc · 2026-06-12T04:22:56Z

Problem

The dind sidecars in all three runner scale sets run dockerd --storage-driver=vfs with /var/lib/docker unmounted. Every image layer is a full directory copy (vfs) landing in the container's writable layer under the crio graph root on the 70G root disk (~25G free). With maxRunners: 20 on ak-ci-runners doing concurrent docker builds, this can exhaust the root disk and trigger kubelet disk-pressure eviction of prod pods on the rocky node.

Fix

For ak-ci-runners, ak-e2e-runners, and ak-beefy-runners:

Mount hostPath /home/runner-dind at /var/lib/docker with subPathExpr: $(POD_NAME) — per-pod isolation, since concurrent dockerds cannot share a docker root. /home is 1.8T xfs, ftype=1 (d_type OK for overlay2).
Switch dockerd to --storage-driver=overlay2. vfs was only required because /var/lib/docker previously sat on the crio overlay layer and the RHEL 8.10 4.18 kernel forbids overlay-on-overlay. On a plain xfs directory overlay2 works — verified with a manual overlay mount + copy-up test on /home on the node.
New e2e/runner-dind-cleanup.yaml: CronJob (every 15 min, concurrencyPolicy: Forbid) reaping /home/runner-dind subdirs that have no matching live pod in arc-runners AND are older than 10 min (grace window). Ephemeral runner pods churn constantly and preStop hooks are not guaranteed to run. Fail-safe: set -euo pipefail aborts before any deletion if the API call fails. Uses ak-runner-rust (already on the node, ships curl+jq) with a dedicated SA limited to pods get/list.

Permission note (lesson from the SCCACHE_ERROR_LOG outage): the DirectoryOrCreate hostPath is root:755, but dockerd runs as root in a privileged container, so it's writable.

Why not emptyDir + overlay2

emptyDir lives under /var/lib/kubelet/pods — still the root disk — and a sizeLimit breach evicts the pod mid-job. The hostPath approach moves all heavy build I/O onto the 1.4T-free /home disk.

Depends on / includes PR #163

⚠️ This branch is based on fix/sccache-cache-size (#163), which is live in prod as helm revision 26 of ak-ci-runners. It includes those commits (SCCACHE_CACHE_SIZE=50G, SCCACHE_ERROR_LOG into the 777 subdir). Merge #163 first or merge this PR which carries both. A values file based on main would revert the sccache fix and re-break every compile job.

Validation

helm upgrade --dry-run=server with chart 0.13.1 renders cleanly (overlay2 args, POD_NAME downward API env, subPathExpr mount, hostPath volume, sccache env intact).
overlayfs mount + copy-up tested directly on /home (xfs ftype=1) on the rocky node.
End-to-end verification (runner pod, docker info, in-pod docker build, layer location on /home, sccache --show-stats) performed after deploy — evidence in PR comments / ops log.

🤖 Generated with Claude Code

The shared sccache hostPath was capped at sccache's 10 GiB default (SCCACHE_CACHE_SIZE unset) and verified at exactly 10G on 2026-06-12 — continuously evicting and degrading hit rates. Baseline measurement: ~30 cold rust builds/week at ~12m penalty each, ~22 runner-hours/week lost to cache churn across GHA quota + local sccache eviction. Also adds SCCACHE_ERROR_LOG so compile-cache failures are observable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

/cache (hostPath root) is root:755; only the subdirs get chmod 777 from init-cache. sccache could not create /cache/sccache-error.log, the server died at startup, and every compile job on ak-ci-runners failed with 'Timed out waiting for server startup' (~04:00Z 2026-06-12). Coverage stayed green only because it sets RUSTC_WRAPPER=''.

…disk The dind sidecars ran dockerd --storage-driver=vfs with /var/lib/docker unmounted, so every image layer was a full directory copy landing in the container writable layer under the crio graph root on the 70G root disk (~25G free). With up to 20 concurrent ak-ci-runners doing docker builds this risks root-disk exhaustion and kubelet disk-pressure eviction of prod pods on the rocky node. Fix, applied to all three scale sets (ci, e2e, beefy): - Mount hostPath /home/runner-dind at /var/lib/docker with subPathExpr $(POD_NAME) for per-pod isolation (concurrent dockerds cannot share a docker root). /home is 1.8T xfs with ftype=1. - Switch dockerd to --storage-driver=overlay2. vfs was only needed because /var/lib/docker previously sat on the crio overlay layer and the RHEL 8 4.18 kernel forbids overlay-on-overlay; on a plain xfs directory overlay2 works (verified with a manual overlay mount on /home on the rocky node). - Add e2e/runner-dind-cleanup.yaml: a CronJob (every 15 min) that reaps /home/runner-dind subdirs with no matching live pod in arc-runners and older than 10 min, since ephemeral runner pods churn constantly and preStop hooks are not guaranteed to run. Fail-safe: an API error aborts the run before any deletion. dockerd runs as root in a privileged container, so the root:755 DirectoryOrCreate hostPath is writable (cf. the SCCACHE_ERROR_LOG permissions outage - sccache runs unprivileged, dockerd does not). Based on fix/sccache-cache-size (PR #163), which is live as helm revision 26; basing on main would revert the sccache fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T04:23:05Z

Missing linked issue

This PR does not reference a tracking issue in its body. Every PR must link to an issue in this repository so we can trace work back to a planned change.

How to fix

Edit the PR description and add a line like Closes #123, Fixes #123, or Resolves #123 referring to an open issue in artifact-keeper/artifact-keeper-iac.
Save the description. This check will re-run automatically.

Accepted keywords (case-insensitive, any tense): close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved.

Policy reference: see the PR template.

Maintainer bypass: apply the no-issue-required label to this PR to skip the check (use sparingly, e.g. for trivial typo fixes or release-tag chores).

brandonrc · 2026-06-12T04:30:01Z

Deployed and verified on the rocky node (2026-06-12 ~00:25 EDT)

Helm revisions (all chart 0.13.1): ak-ci-runners 27, ak-e2e-runners 11, ak-beefy-runners 6. Rollback point for ci: revision 26.

Evidence

All three AutoscalingRunnerSet CRs accepted by the 0.13.x controller (listeners 1/1 Running, new runner pods scheduling) and carry --storage-driver=overlay2, subPathExpr: $(POD_NAME), hostPath /home/runner-dind.

New-spec runner pod docker info:

Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true

In-pod docker build (alpine + 32M layer) + docker run succeeded; layers confirmed on the host at /home/runner-dind/<pod-name>/overlay2/ on /dev/mapper/rl00-home (1.3T free), not the root disk.
sccache --show-stats exits 0 in the new runner container (PR fix(arc): raise sccache cache size to 50G on ak-ci-runners #163 fix intact; render shows SCCACHE_CACHE_SIZE=50G + SCCACHE_ERROR_LOG=/cache/sccache/sccache-error.log).
Cleanup CronJob manually triggered: correctly identified a dead pod's orphan dir and held it in the 10-min grace window (skipping ...-h26lm: only 70s old), reaped 0, exited clean.
Old vfs-generation pods are draining naturally as their jobs complete; all new jobs land on overlay2 pods.
Node conditions: no DiskPressure/MemoryPressure; root disk steady at 46G/70G.

Unrelated pre-existing issue observed during verification

artifact-keeper-prod backend is crashlooping with Error: Migration(VersionMissing(114)): ArgoCD selfHeal keeps reverting the backend to git-declared 1.2.0 while the prod DB has been migrated forward by dev digests (sync count ~9178, deployment revision 9028; ak-mesh-* namespaces show the same class of failure — VersionMismatch(81) — for 9 days). Predates and is independent of this change (this PR only touches arc-runners scale sets, which ArgoCD does not manage). Needs its own issue: either pin prod to a compatible image or stop the selfHeal/image-updater tug-of-war.

🤖 Generated with Claude Code

brandonrc and others added 3 commits June 11, 2026 23:32

brandonrc requested a review from a team as a code owner June 12, 2026 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(arc): move dind docker root to /home with overlay2 (disk-pressure hazard)#167

fix(arc): move dind docker root to /home with overlay2 (disk-pressure hazard)#167
brandonrc wants to merge 3 commits into
mainfrom
fix/dind-overlay2-on-home

brandonrc commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

brandonrc commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brandonrc commented Jun 12, 2026

Problem

Fix

Why not emptyDir + overlay2

Depends on / includes PR #163

Validation

Uh oh!

github-actions Bot commented Jun 12, 2026

Missing linked issue

Uh oh!

brandonrc commented Jun 12, 2026

Deployed and verified on the rocky node (2026-06-12 ~00:25 EDT)

Evidence

Unrelated pre-existing issue observed during verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant