Skip to content

sysbox-fs hangs in fuse_flush (D-state hung_task) — wedges host with ~50 stuck runc tasks [Debian 13, kernel 6.12] #1018

Description

@sscoxx

Summary

On a host running Coder workspaces as unprivileged Docker-in-Docker via Sysbox, sysbox-fs got stuck in fuse_flush (kernel hung_task), wedging the whole host: load average ~50 while the CPU was ~100% idle and RAM was free. About 50 processes were stuck in uninterruptible sleep (D state) and could not be killed (not even kill -9). systemctl restart docker also hung in D. Only a full host reboot recovered it — and the ordered shutdown itself took ~6 minutes fighting the D-state tasks before forcing.

Environment

  • sysbox-ce / sysbox-runc 0.7.0 (commit a4dd414f7b9b7455c0fbf0d5e5db7bcfe30645bc, built 2026-03-03)
  • OS: Debian GNU/Linux 13 (trixie)
  • Kernel: 6.12.90+deb13.1-amd64 (idmapped mounts, no shiftfs)
  • Docker 29.4.2, runc 1.3.5; sysbox-runc registered as a runtime in daemon.json
  • VM, 8 vCPU / ~23 GiB RAM, dedicated to Coder workspaces (DinD, unprivileged)

I'm aware Debian 13 + kernel 6.12 is outside the officially supported matrix — flagging in case it's relevant to the FUSE path.

Symptom — kernel hung_task in fuse_flush

sysbox-fs (tgid 886, the daemon) and then every runc:[INIT|PARENT|CHILD] trying to start/operate workspaces blocked for >120s, all parked on fuse_flush:

INFO: task sysbox-fs:<pid> blocked for more than 120 seconds.
task:sysbox-fs       state:D stack:0     pid:<pid> tgid:886   ppid:1   flags:0x00000002
Call Trace:
 __schedule+0x505/0xc00
 schedule+0x27/0xf0
 fuse_flush+0xe8/0x1e0
 ? __lruvec_stat_mod_folio+0x83/0xd0
 ? __folio_mod_stat+0x26/0x80

INFO: task runc:[0:PARENT]:<pid> blocked for more than 120 seconds.
Call Trace:
 __schedule+0x505/0xc00
 schedule+0x27/0xf0
 fuse_flush+0xe8/0x1e0
 ...

The same set was re-reported at 120 / 241 / 362s until Future hung task reports are suppressed. Note the fuse_flush frame sits on top of folio memory-accounting (__lruvec_stat_mod_folio / __folio_mod_stat).

Impact

  • Host effectively unusable: ~50 tasks in D, mostly runc:[0:PARENT] / runc:[1:CHILD] (spawned by containerd to start/exec workspaces) plus coder stat disk.
  • CPU ~100% idle, RAM free, iowait 0 — the load (~50) was entirely D-state tasks, so it was invisible to CPU/memory monitoring.
  • systemd restarted the sysbox-fs daemon, but that did not free the already-stuck tasks (they stayed attached to the wedged FUSE mount).
  • systemctl restart docker / service docker restart also hung in D.

Recovery

Only a full host reboot cleared the D-state processes.

Reproducibility

No deterministic repro — it happened under normal workspace usage (starting/operating a DinD workspace) after ~14 days of uptime. It has not recurred since the reboot.

Questions

  1. Is this a known sysbox-fs deadlock in fuse_flush on kernel 6.12 / Debian 13 (or generally outside the supported matrix)?
  2. Any recommended mitigation (specific kernel version, sysbox-fs mount/config option) short of changing the host distro/kernel?
  3. Anything specific worth capturing if it recurs (it's intermittent)?

Happy to provide more detail — full dmesg, ps -eo pid,stat,wchan,args of the stuck tasks, docker info, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions