Summary
When amazon-efs-mount-watchdog detects a dead efs-proxy/stunnel tunnel, it cleans up state files but never calls umount on the kernel NFS mount. This leaves orphaned hard NFS mounts with no server behind them. When the port is reused and a new pod mounts the same EFS filesystem, a second hard NFS mount is created on the dead port. The kernel NFS client then blocks indefinitely in rpc_wait_bit_killable, which causes kubelet goroutines to freeze when they call stat() during orphan cleanup — resulting in a full kubelet freeze lasting 58 minutes in our case.
Environment
| Field |
Value |
| efs-utils version |
3.1.1 (efs-csi-driver v2.3.0) |
| OS |
Bottlerocket 1.56.0 (aws-k8s-1.33-fips) |
| Kernel |
6.1.161 |
| Kubernetes |
v1.33.7-eks-ac2d5a0 |
| Region |
eu-central-1 |
Root Cause Chain
Bug 1 (Primary): watchdog never calls umount
check_efs_mounts() in src/watchdog/__init__.py has two cleanup paths:
- Grace period expired →
clean_up_mount_state() → sends SIGTERM to efs-proxy → deletes state files. No umount.
- TLS tunnel not running →
send_signal_to_running_stunnel_process_group() → logs warning. No umount.
The watchdog assumes the NFS mount is already gone because it disappeared from /proc/mounts. But under kubelet/systemd cgroup teardown stress, there is a race: the NFS mount can transiently disappear from /proc/mounts (or the watchdog polls at the wrong moment during a partial unmount), causing the watchdog to declare it gone and clean up state — while the kernel mount is still live.
Evidence from mount-watchdog.log (1431 lines, zero umount entries):
00:11:27 WARNING - TLS tunnel is not running for fs-0672e82db7ffd86e9...mount.20370
00:11:27 INFO - TLS tunnel: 279 is no longer running, cleaning up state
# [no umount — 230 "cleaning up state" events, 0 umount calls in entire log]
00:20:50 INFO - No mount found for "...6cf20ec0...mount.20361"
00:21:20 INFO - Unmount grace period expired for ...6cf20ec0...mount.20361
00:21:22 WARNING - TLS tunnel is not running for ...6cf20ec0...mount.20361
# [port 20361 kernel NFS mount remains active — no umount called]
Bug 2 (Amplifier): Port reuse races against orphaned kernel mount
After watchdog removes the state file for port 20361, the port allocator considers it free and reassigns it to a new pod. The new efs-proxy cannot bind because the kernel NFS mount still holds the port (errno=101 EADDRNOTAVAIL).
Evidence from mount.log:
00:20:13 efs-proxy started on port 20361 for pod 6cf20ec0 → mount succeeds
[watchdog cleans up state at 00:20:50 — no umount — port 20361 kernel mount still live]
00:25:05 port 20361 reallocated for new pod b7f4edea
00:25:06 ERROR - Failed to start tunnel: "Failed to bind 127.0.0.1:20361" (errno=101)
Bug 3 (Amplifier): mount.nfs4 called even after efs-proxy bind failure
After efs-proxy panics with errno=101, the EFS mount helper still proceeds to call mount.nfs4 with the dead port. This creates a new hard NFS mount pointing at a port with no server behind it.
Evidence from mount.log:
ERROR - Failed to start tunnel (errno=101), stderr="Failed to bind 127.0.0.1:20361"
INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:data/35bb13525aa542d595369f5f15c94439
/var/lib/kubelet/pods/b7f4edea-.../mount
-o rw,nfsvers=4.1,...,hard,...,port=20361" with 15 sec time limit.
Bug 4 (Consequence): Kubelet freezes on stat() of hard NFS mount with dead server
Kubelet's orphan cleanup goroutine calls stat() on the orphaned mount path. This blocks indefinitely in the kernel (rpc_wait_bit_killable). Two kubelet worker threads become permanently blocked, starving the pod worker loop.
Evidence from kernel journal:
00:29:46 nfs: server 127.0.0.1 not responding, timed out
[129 total NFS timeout messages over 54 minutes: 00:29:46 → 01:24:29 UTC]
D-state kernel thread:
PID 200716: [127.0.0.1-manager] state=D wchan=rpc_wait_bit_killable
Kubelet freeze window:
Last kubelet log before freeze: 00:26:57 UTC
Kubelet recovered after: ~01:25 UTC (after umount -l + systemctl restart kubelet)
Freeze duration: ~58 minutes
Pods stuck Terminating: ~65
Orphaned NFS mount (captured before fix)
127.0.0.1:/data/35bb13525aa542d595369f5f15c94439
/var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,
noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,
clientaddr=10.208.27.139,local_lock=none,addr=127.0.0.1,port=20361
Fix Applied
umount -l /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
systemctl restart kubelet
# All 65 stuck pods self-cleared within minutes.
Recommended Fixes
Fix 1 (Critical): watchdog must call umount before cleaning up state
In src/watchdog/__init__.py, clean_up_mount_state() and the "TLS tunnel not running" path must call umount -l <mountpoint> before removing state files:
# In clean_up_mount_state() and cleanup_mount_state_if_stunnel_not_running():
# Before removing state files, unmount the kernel NFS mount if it still exists.
mountpoint = state.get("mountpoint")
if mountpoint and os.path.ismount(mountpoint):
try:
subprocess.run(["umount", "-l", mountpoint], check=True, timeout=30)
logging.info("Unmounted orphaned NFS mount at %s", mountpoint)
except Exception as e:
logging.warning("Failed to unmount %s: %s", mountpoint, e)
Fix 2 (Critical): Abort mount if efs-proxy fails to bind
In the mount helper, if efs-proxy exits with errno=101 (EADDRNOTAVAIL), abort the mount and return an error to the CSI NodePublishVolume call. Do not proceed to call mount.nfs4 with a dead port.
Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup
On efs-csi-node pod startup, scan /proc/mounts for NFS mounts with addr=127.0.0.1 where no live efs-proxy is listening on the corresponding port. Unmount any found.
Fix 4 (Defense-in-depth): NodeUnpublishVolume must verify mount is gone before returning success
After calling umount, verify the path is no longer in /proc/mounts before returning success to kubelet.
Intermittency
This is triggered by the combination of:
- High pod churn on a node (many EFS mounts/unmounts in a short window)
- Kubelet/systemd cgroup teardown stress (slow slice removal)
- Watchdog polling at the wrong moment during a partial unmount
Under normal conditions the race window is narrow. Under load it opens up reliably.
Related Issues
Summary
When
amazon-efs-mount-watchdogdetects a dead efs-proxy/stunnel tunnel, it cleans up state files but never callsumounton the kernel NFS mount. This leaves orphanedhardNFS mounts with no server behind them. When the port is reused and a new pod mounts the same EFS filesystem, a second hard NFS mount is created on the dead port. The kernel NFS client then blocks indefinitely inrpc_wait_bit_killable, which causes kubelet goroutines to freeze when they callstat()during orphan cleanup — resulting in a full kubelet freeze lasting 58 minutes in our case.Environment
Root Cause Chain
Bug 1 (Primary): watchdog never calls
umountcheck_efs_mounts()insrc/watchdog/__init__.pyhas two cleanup paths:clean_up_mount_state()→ sends SIGTERM to efs-proxy → deletes state files. Noumount.send_signal_to_running_stunnel_process_group()→ logs warning. Noumount.The watchdog assumes the NFS mount is already gone because it disappeared from
/proc/mounts. But under kubelet/systemd cgroup teardown stress, there is a race: the NFS mount can transiently disappear from/proc/mounts(or the watchdog polls at the wrong moment during a partial unmount), causing the watchdog to declare it gone and clean up state — while the kernel mount is still live.Evidence from
mount-watchdog.log(1431 lines, zeroumountentries):Bug 2 (Amplifier): Port reuse races against orphaned kernel mount
After watchdog removes the state file for port 20361, the port allocator considers it free and reassigns it to a new pod. The new efs-proxy cannot bind because the kernel NFS mount still holds the port (errno=101 EADDRNOTAVAIL).
Evidence from
mount.log:Bug 3 (Amplifier):
mount.nfs4called even after efs-proxy bind failureAfter efs-proxy panics with errno=101, the EFS mount helper still proceeds to call
mount.nfs4with the dead port. This creates a newhardNFS mount pointing at a port with no server behind it.Evidence from
mount.log:Bug 4 (Consequence): Kubelet freezes on
stat()of hard NFS mount with dead serverKubelet's orphan cleanup goroutine calls
stat()on the orphaned mount path. This blocks indefinitely in the kernel (rpc_wait_bit_killable). Two kubelet worker threads become permanently blocked, starving the pod worker loop.Evidence from kernel journal:
D-state kernel thread:
Kubelet freeze window:
Orphaned NFS mount (captured before fix)
Fix Applied
umount -l /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount systemctl restart kubelet # All 65 stuck pods self-cleared within minutes.Recommended Fixes
Fix 1 (Critical): watchdog must call
umountbefore cleaning up stateIn
src/watchdog/__init__.py,clean_up_mount_state()and the "TLS tunnel not running" path must callumount -l <mountpoint>before removing state files:Fix 2 (Critical): Abort mount if efs-proxy fails to bind
In the mount helper, if efs-proxy exits with errno=101 (EADDRNOTAVAIL), abort the mount and return an error to the CSI
NodePublishVolumecall. Do not proceed to callmount.nfs4with a dead port.Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup
On efs-csi-node pod startup, scan
/proc/mountsfor NFS mounts withaddr=127.0.0.1where no live efs-proxy is listening on the corresponding port. Unmount any found.Fix 4 (Defense-in-depth):
NodeUnpublishVolumemust verify mount is gone before returning successAfter calling
umount, verify the path is no longer in/proc/mountsbefore returning success to kubelet.Intermittency
This is triggered by the combination of:
Under normal conditions the race window is narrow. Under load it opens up reliably.
Related Issues