Skip to content

mount-watchdog cleans up state without calling umount, leaving orphaned hard NFS mounts that freeze kubelet #346

@nloke

Description

@nloke

Summary

When amazon-efs-mount-watchdog detects a dead efs-proxy/stunnel tunnel, it cleans up state files but never calls umount on the kernel NFS mount. This leaves orphaned hard NFS mounts with no server behind them. When the port is reused and a new pod mounts the same EFS filesystem, a second hard NFS mount is created on the dead port. The kernel NFS client then blocks indefinitely in rpc_wait_bit_killable, which causes kubelet goroutines to freeze when they call stat() during orphan cleanup — resulting in a full kubelet freeze lasting 58 minutes in our case.

Environment

Field Value
efs-utils version 3.1.1 (efs-csi-driver v2.3.0)
OS Bottlerocket 1.56.0 (aws-k8s-1.33-fips)
Kernel 6.1.161
Kubernetes v1.33.7-eks-ac2d5a0
Region eu-central-1

Root Cause Chain

Bug 1 (Primary): watchdog never calls umount

check_efs_mounts() in src/watchdog/__init__.py has two cleanup paths:

  1. Grace period expiredclean_up_mount_state() → sends SIGTERM to efs-proxy → deletes state files. No umount.
  2. TLS tunnel not runningsend_signal_to_running_stunnel_process_group() → logs warning. No umount.

The watchdog assumes the NFS mount is already gone because it disappeared from /proc/mounts. But under kubelet/systemd cgroup teardown stress, there is a race: the NFS mount can transiently disappear from /proc/mounts (or the watchdog polls at the wrong moment during a partial unmount), causing the watchdog to declare it gone and clean up state — while the kernel mount is still live.

Evidence from mount-watchdog.log (1431 lines, zero umount entries):

00:11:27 WARNING - TLS tunnel is not running for fs-0672e82db7ffd86e9...mount.20370
00:11:27 INFO    - TLS tunnel: 279 is no longer running, cleaning up state
# [no umount — 230 "cleaning up state" events, 0 umount calls in entire log]

00:20:50 INFO    - No mount found for "...6cf20ec0...mount.20361"
00:21:20 INFO    - Unmount grace period expired for ...6cf20ec0...mount.20361
00:21:22 WARNING - TLS tunnel is not running for ...6cf20ec0...mount.20361
# [port 20361 kernel NFS mount remains active — no umount called]

Bug 2 (Amplifier): Port reuse races against orphaned kernel mount

After watchdog removes the state file for port 20361, the port allocator considers it free and reassigns it to a new pod. The new efs-proxy cannot bind because the kernel NFS mount still holds the port (errno=101 EADDRNOTAVAIL).

Evidence from mount.log:

00:20:13  efs-proxy started on port 20361 for pod 6cf20ec0 → mount succeeds
[watchdog cleans up state at 00:20:50 — no umount — port 20361 kernel mount still live]
00:25:05  port 20361 reallocated for new pod b7f4edea
00:25:06  ERROR - Failed to start tunnel: "Failed to bind 127.0.0.1:20361" (errno=101)

Bug 3 (Amplifier): mount.nfs4 called even after efs-proxy bind failure

After efs-proxy panics with errno=101, the EFS mount helper still proceeds to call mount.nfs4 with the dead port. This creates a new hard NFS mount pointing at a port with no server behind it.

Evidence from mount.log:

ERROR - Failed to start tunnel (errno=101), stderr="Failed to bind 127.0.0.1:20361"
INFO  - Executing: "/sbin/mount.nfs4 127.0.0.1:data/35bb13525aa542d595369f5f15c94439
         /var/lib/kubelet/pods/b7f4edea-.../mount
         -o rw,nfsvers=4.1,...,hard,...,port=20361" with 15 sec time limit.

Bug 4 (Consequence): Kubelet freezes on stat() of hard NFS mount with dead server

Kubelet's orphan cleanup goroutine calls stat() on the orphaned mount path. This blocks indefinitely in the kernel (rpc_wait_bit_killable). Two kubelet worker threads become permanently blocked, starving the pod worker loop.

Evidence from kernel journal:

00:29:46  nfs: server 127.0.0.1 not responding, timed out
[129 total NFS timeout messages over 54 minutes: 00:29:46 → 01:24:29 UTC]

D-state kernel thread:

PID 200716: [127.0.0.1-manager]  state=D  wchan=rpc_wait_bit_killable

Kubelet freeze window:

Last kubelet log before freeze:  00:26:57 UTC
Kubelet recovered after:         ~01:25 UTC (after umount -l + systemctl restart kubelet)
Freeze duration: ~58 minutes
Pods stuck Terminating: ~65

Orphaned NFS mount (captured before fix)

127.0.0.1:/data/35bb13525aa542d595369f5f15c94439
  /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
  nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,
  noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,
  clientaddr=10.208.27.139,local_lock=none,addr=127.0.0.1,port=20361

Fix Applied

umount -l /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
systemctl restart kubelet
# All 65 stuck pods self-cleared within minutes.

Recommended Fixes

Fix 1 (Critical): watchdog must call umount before cleaning up state

In src/watchdog/__init__.py, clean_up_mount_state() and the "TLS tunnel not running" path must call umount -l <mountpoint> before removing state files:

# In clean_up_mount_state() and cleanup_mount_state_if_stunnel_not_running():
# Before removing state files, unmount the kernel NFS mount if it still exists.
mountpoint = state.get("mountpoint")
if mountpoint and os.path.ismount(mountpoint):
    try:
        subprocess.run(["umount", "-l", mountpoint], check=True, timeout=30)
        logging.info("Unmounted orphaned NFS mount at %s", mountpoint)
    except Exception as e:
        logging.warning("Failed to unmount %s: %s", mountpoint, e)

Fix 2 (Critical): Abort mount if efs-proxy fails to bind

In the mount helper, if efs-proxy exits with errno=101 (EADDRNOTAVAIL), abort the mount and return an error to the CSI NodePublishVolume call. Do not proceed to call mount.nfs4 with a dead port.

Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup

On efs-csi-node pod startup, scan /proc/mounts for NFS mounts with addr=127.0.0.1 where no live efs-proxy is listening on the corresponding port. Unmount any found.

Fix 4 (Defense-in-depth): NodeUnpublishVolume must verify mount is gone before returning success

After calling umount, verify the path is no longer in /proc/mounts before returning success to kubelet.

Intermittency

This is triggered by the combination of:

  1. High pod churn on a node (many EFS mounts/unmounts in a short window)
  2. Kubelet/systemd cgroup teardown stress (slow slice removal)
  3. Watchdog polling at the wrong moment during a partial unmount

Under normal conditions the race window is narrow. Under load it opens up reliably.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions