Skip to content

SCHED-957: isolate Docker containers from slurmd cgroup to prevent pod OOM kills#2268

Draft
Uburro wants to merge 2 commits intomainfrom
SCHED-957/0
Draft

SCHED-957: isolate Docker containers from slurmd cgroup to prevent pod OOM kills#2268
Uburro wants to merge 2 commits intomainfrom
SCHED-957/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Mar 6, 2026

Problem

When a user runs Docker inside a Slurm job, Docker containers are spawned by dockerd as its own children and inherit dockerd's cgroup -which is the pod-level cgroup, not the Slurm job cgroup. As a result:

  • Container memory usage bypasses Slurm's per-job memory.max constraint
  • All Docker memory accumulates in the pod cgroup alongside slurmd, sshd, and dockerd
  • When total pod memory exceeds the Kubernetes limit, kubelet OOM-kills the entire pod, taking slurmd down with it

Solution

A single dedicated Docker cgroup is created once at pod startup (in supervisord_entrypoint.sh) as a sibling of the main container cgroup:

Pod cgroup (K8s limit, e.g. 1600G)
├── / ← slurmd, dockerd, sshd (unaffected by Docker OOMs)
└── docker/ ← ALL Docker containers from ALL jobs
memory.max = REAL_MEMORY

memory.max on the Docker cgroup is set to REAL_MEMORY (in MiB) — the node's allocatable memory as configured in Slurm. This ensures Docker containers are OOM-killed at the job allocation boundary before the pod-level limit is reached, leaving slurmd with its intended headroom.

The existing docker.sh wrapper (installed as /usr/bin/docker

Testing

  • manually

Release Notes

Fix: Docker containers launched inside Slurm jobs are now placed in a dedicated isolated cgroup (docker/) with memory.max set to the node's REAL_MEMORY value. This prevents Docker workload memory from causing pod-level OOM kills and slurmd restarts.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Creates an isolated cgroup for Docker containers spawned inside Slurm jobs to prevent Docker memory usage from accumulating in the pod-level cgroup and triggering kubelet OOM kills that take down slurmd.

Changes:

  • Inject REAL_MEMORY (MiB) into the slurmd container environment for cgroup memory sizing.
  • At worker startup, create a dedicated docker/ cgroup sibling and set its memory.max based on REAL_MEMORY.
  • Extend the /usr/bin/docker wrapper to inject --cgroup-parent so new containers are created under the dedicated Docker cgroup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
internal/render/worker/container.go Exposes REAL_MEMORY env var to the slurmd container based on rendered resource requests.
internal/consts/cgroup.go Adds EnvRealMemory constant used for the new env var name.
images/worker/supervisord_entrypoint.sh Creates /sys/fs/cgroup/.../docker and configures memory.max; persists cgroup parent path for the docker wrapper.
ansible/roles/docker-cli/files/docker.sh Injects --cgroup-parent into docker run/create calls based on /run/docker-cgroup-parent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants