Skip to content

fix(chart): raise backend.resources.limits.ephemeral-storage default to 16Gi#116

Open
dragonpaw wants to merge 2 commits into
artifact-keeper:mainfrom
dragonpaw:fix/backend-ephemeral-storage-limit
Open

fix(chart): raise backend.resources.limits.ephemeral-storage default to 16Gi#116
dragonpaw wants to merge 2 commits into
artifact-keeper:mainfrom
dragonpaw:fix/backend-ephemeral-storage-limit

Conversation

@dragonpaw

@dragonpaw dragonpaw commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Bumps the backend pod's ephemeral-storage limit from 1Gi to 16Gi so streaming format handlers (Incus, large OCI layer pushes) don't get kubelet-evicted on >1 GiB uploads.

The bug

Kubernetes ephemeral-storage accounting includes the container writable layer, container logs, and all emptyDir volumes. The chart's /data/storage mount is an emptyDir with sizeLimit: 10Gi (when persistence.enabled: false), and any single artifact streamed to disk by the backend goes into that mount as a temp file before being finalized. Streaming-format handlers like the Incus monolithic + chunked upload paths build the artifact entirely on disk via tokio::fs calls (they don't route through StorageBackend::put_streaming), so the pod's writable usage tracks the full artifact size in flight.

With the previous ephemeral-storage: 1Gi limit, any artifact >1 GiB triggers kubelet eviction:

Pod ephemeral local storage usage exceeds the total limit of containers 1Gi

The client sees a TLS EOF mid-stream; the backend logs nothing because the pod was killed before the write completed. The deployment recreates the pod and the next retry hits the same wall.

The previous default also disagreed with the chart's own persistence.size: 10Gi in the same block — anyone enabling persistence and assuming the limits matched would still hit the eviction.

Reproduction

Stock-chart install with STORAGE_BACKEND set to anything (filesystem, gcs, s3). Create an incus repo and PUT a >1 GiB file:

HTTP TLS EOF mid-stream
$ kubectl describe pod ak-backend-xxx | grep -A2 Reason:
  Reason: Evicted
  Message: Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.

Post-fix: HTTP 201, artifact lands in storage backend, no eviction.

Validation

Self-hosted dev-shared-gke deploy. 3.4 GiB .tar.zst Incus upload:

  • Pre-fix: kubelet eviction within ~30s, TLS EOF on client.
  • Post-fix: HTTP 201 in 205s @ 17.8 MiB/s, artifact persisted, kubectl describe pod shows no eviction events.

Related

  • artifact-keeper/artifact-keeper#1296 — accept .tar.zst Incus filename (also needed to upload our Incus images, but separate fix; this chart change is independently useful for OCI / any large-artifact format).
  • artifact-keeper/artifact-keeper#1297 — incus handler uses a writable staging dir on GCS/S3 backends. Same scenario surfaces both bugs and this PR; could be in any order.

Note for reviewers

Longer-term, the chart could make this configurable per-format (Incus deployments need more headroom than a deploy serving only npm-sized artifacts), but defaulting closer to persistence.size is at minimum required for the chart's own opinionated storage layout to work end-to-end.


Closes #153

@dragonpaw dragonpaw requested a review from a team as a code owner May 21, 2026 20:27
@brandonrc

Copy link
Copy Markdown
Contributor

Thanks for the PR. The one required check failing here is Verify helm-docs output: the chart README needs regenerating after the values change. Could you run this from the repo root and commit the result?

cd charts/artifact-keeper && helm-docs

(helm-docs v1.14.2, matching .github/workflows/helm-docs.yml.) That should clear the check.

Heads up: the red SonarCloud Scan is a known non-blocking issue on fork PRs. GitHub withholds SONAR_TOKEN from forks and the job is already continue-on-error, so it is safe to ignore.

Ash Arnold added 2 commits June 3, 2026 16:37
…to 16Gi

Kubernetes `ephemeral-storage` accounting includes the container writable
layer, container logs, AND all emptyDir volumes (including
`/data/storage`, which `values.yaml` sets to `sizeLimit: 10Gi`). Streaming
format handlers (Incus, large OCI layer pushes) stage uploads on local
disk before finalizing the storage backend write. Any artifact larger
than the pod's `ephemeral-storage` limit triggers kubelet eviction:

    Pod ephemeral local storage usage exceeds the total limit of
    containers 1Gi

The client sees a TLS EOF mid-stream; AK never logs anything because the
pod was killed before the write completed. The deployment recreates the
pod and the cycle repeats on the next retry.

The previous default of `1Gi` was inconsistent with `persistence.size`
(also defaulted to `10Gi` in this same block) and with the chart's own
`scanWorkspace.size: 2Gi`. Bump to `16Gi` — matches the storage volume
sizing plus headroom for the writable layer and logs.

Reproduction: stock-chart install, GCS-backed storage, PUT a >1 GiB file
to any Incus repo. Pre-fix: TLS EOF, kubelet `Evicted`. Post-fix: HTTP
201, artifact persisted.

Validated on a self-hosted dev-shared-gke deploy: 3.4 GiB `.tar.zst`
upload completes in 205s @ 17.8 MiB/s.
@dragonpaw dragonpaw force-pushed the fix/backend-ephemeral-storage-limit branch from 23d9ffd to 024ba5d Compare June 3, 2026 23:38
@dragonpaw

Copy link
Copy Markdown
Contributor Author

Regenerated the chart README via helm-docs (v1.14.2) as requested — the Verify helm-docs output check is green now. Also rebased onto the latest main and filed/linked a tracking issue (#153) so require-linked-issue passes. Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend ephemeral-storage limit of 1Gi evicts the pod on >1 GiB streaming uploads

2 participants