mount-watchdog cleans up state without calling umount, leaving orphaned hard NFS mounts that freeze kubelet

## Summary

When `amazon-efs-mount-watchdog` detects a dead efs-proxy/stunnel tunnel, it cleans up state files but **never calls `umount`** on the kernel NFS mount. This leaves orphaned `hard` NFS mounts with no server behind them. When the port is reused and a new pod mounts the same EFS filesystem, a second hard NFS mount is created on the dead port. The kernel NFS client then blocks indefinitely in `rpc_wait_bit_killable`, which causes kubelet goroutines to freeze when they call `stat()` during orphan cleanup — resulting in a full kubelet freeze lasting 58 minutes in our case.

## Environment

| Field | Value |
|---|---|
| efs-utils version | 3.1.1 (efs-csi-driver v2.3.0) |
| OS | Bottlerocket 1.56.0 (aws-k8s-1.33-fips) |
| Kernel | 6.1.161 |
| Kubernetes | v1.33.7-eks-ac2d5a0 |
| Region | eu-central-1 |

## Root Cause Chain

### Bug 1 (Primary): watchdog never calls `umount`

`check_efs_mounts()` in `src/watchdog/__init__.py` has two cleanup paths:

1. **Grace period expired** → `clean_up_mount_state()` → sends SIGTERM to efs-proxy → deletes state files. **No `umount`.**
2. **TLS tunnel not running** → `send_signal_to_running_stunnel_process_group()` → logs warning. **No `umount`.**

The watchdog assumes the NFS mount is already gone because it disappeared from `/proc/mounts`. But under kubelet/systemd cgroup teardown stress, there is a race: the NFS mount can transiently disappear from `/proc/mounts` (or the watchdog polls at the wrong moment during a partial unmount), causing the watchdog to declare it gone and clean up state — while the kernel mount is still live.

**Evidence from `mount-watchdog.log` (1431 lines, zero `umount` entries):**
```
00:11:27 WARNING - TLS tunnel is not running for fs-0672e82db7ffd86e9...mount.20370
00:11:27 INFO    - TLS tunnel: 279 is no longer running, cleaning up state
# [no umount — 230 "cleaning up state" events, 0 umount calls in entire log]

00:20:50 INFO    - No mount found for "...6cf20ec0...mount.20361"
00:21:20 INFO    - Unmount grace period expired for ...6cf20ec0...mount.20361
00:21:22 WARNING - TLS tunnel is not running for ...6cf20ec0...mount.20361
# [port 20361 kernel NFS mount remains active — no umount called]
```

### Bug 2 (Amplifier): Port reuse races against orphaned kernel mount

After watchdog removes the state file for port 20361, the port allocator considers it free and reassigns it to a new pod. The new efs-proxy cannot bind because the kernel NFS mount still holds the port (errno=101 EADDRNOTAVAIL).

**Evidence from `mount.log`:**
```
00:20:13  efs-proxy started on port 20361 for pod 6cf20ec0 → mount succeeds
[watchdog cleans up state at 00:20:50 — no umount — port 20361 kernel mount still live]
00:25:05  port 20361 reallocated for new pod b7f4edea
00:25:06  ERROR - Failed to start tunnel: "Failed to bind 127.0.0.1:20361" (errno=101)
```

### Bug 3 (Amplifier): `mount.nfs4` called even after efs-proxy bind failure

After efs-proxy panics with errno=101, the EFS mount helper still proceeds to call `mount.nfs4` with the dead port. This creates a new `hard` NFS mount pointing at a port with no server behind it.

**Evidence from `mount.log`:**
```
ERROR - Failed to start tunnel (errno=101), stderr="Failed to bind 127.0.0.1:20361"
INFO  - Executing: "/sbin/mount.nfs4 127.0.0.1:data/35bb13525aa542d595369f5f15c94439
         /var/lib/kubelet/pods/b7f4edea-.../mount
         -o rw,nfsvers=4.1,...,hard,...,port=20361" with 15 sec time limit.
```

### Bug 4 (Consequence): Kubelet freezes on `stat()` of hard NFS mount with dead server

Kubelet's orphan cleanup goroutine calls `stat()` on the orphaned mount path. This blocks indefinitely in the kernel (`rpc_wait_bit_killable`). Two kubelet worker threads become permanently blocked, starving the pod worker loop.

**Evidence from kernel journal:**
```
00:29:46  nfs: server 127.0.0.1 not responding, timed out
[129 total NFS timeout messages over 54 minutes: 00:29:46 → 01:24:29 UTC]
```

**D-state kernel thread:**
```
PID 200716: [127.0.0.1-manager]  state=D  wchan=rpc_wait_bit_killable
```

**Kubelet freeze window:**
```
Last kubelet log before freeze:  00:26:57 UTC
Kubelet recovered after:         ~01:25 UTC (after umount -l + systemctl restart kubelet)
Freeze duration: ~58 minutes
Pods stuck Terminating: ~65
```

## Orphaned NFS mount (captured before fix)

```
127.0.0.1:/data/35bb13525aa542d595369f5f15c94439
  /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
  nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,
  noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,
  clientaddr=10.208.27.139,local_lock=none,addr=127.0.0.1,port=20361
```

## Fix Applied

```bash
umount -l /var/lib/kubelet/pods/b7f4edea-.../volumes/kubernetes.io~csi/b96d29e6-.../mount
systemctl restart kubelet
# All 65 stuck pods self-cleared within minutes.
```

## Recommended Fixes

### Fix 1 (Critical): watchdog must call `umount` before cleaning up state

In `src/watchdog/__init__.py`, `clean_up_mount_state()` and the "TLS tunnel not running" path must call `umount -l <mountpoint>` before removing state files:

```python
# In clean_up_mount_state() and cleanup_mount_state_if_stunnel_not_running():
# Before removing state files, unmount the kernel NFS mount if it still exists.
mountpoint = state.get("mountpoint")
if mountpoint and os.path.ismount(mountpoint):
    try:
        subprocess.run(["umount", "-l", mountpoint], check=True, timeout=30)
        logging.info("Unmounted orphaned NFS mount at %s", mountpoint)
    except Exception as e:
        logging.warning("Failed to unmount %s: %s", mountpoint, e)
```

### Fix 2 (Critical): Abort mount if efs-proxy fails to bind

In the mount helper, if efs-proxy exits with errno=101 (EADDRNOTAVAIL), abort the mount and return an error to the CSI `NodePublishVolume` call. Do not proceed to call `mount.nfs4` with a dead port.

### Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup

On efs-csi-node pod startup, scan `/proc/mounts` for NFS mounts with `addr=127.0.0.1` where no live efs-proxy is listening on the corresponding port. Unmount any found.

### Fix 4 (Defense-in-depth): `NodeUnpublishVolume` must verify mount is gone before returning success

After calling `umount`, verify the path is no longer in `/proc/mounts` before returning success to kubelet.

## Intermittency

This is triggered by the combination of:
1. High pod churn on a node (many EFS mounts/unmounts in a short window)
2. Kubelet/systemd cgroup teardown stress (slow slice removal)
3. Watchdog polling at the wrong moment during a partial unmount

Under normal conditions the race window is narrow. Under load it opens up reliably.

## Related Issues

- https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/812 (stunnel restart on AL2)
- https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616 (stuck stunnel causing hang)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mount-watchdog cleans up state without calling umount, leaving orphaned hard NFS mounts that freeze kubelet #346

Summary

Environment

Root Cause Chain

Bug 1 (Primary): watchdog never calls `umount`

Bug 2 (Amplifier): Port reuse races against orphaned kernel mount

Bug 3 (Amplifier): `mount.nfs4` called even after efs-proxy bind failure

Bug 4 (Consequence): Kubelet freezes on `stat()` of hard NFS mount with dead server

Orphaned NFS mount (captured before fix)

Fix Applied

Recommended Fixes

Fix 1 (Critical): watchdog must call `umount` before cleaning up state

Fix 2 (Critical): Abort mount if efs-proxy fails to bind

Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup

Fix 4 (Defense-in-depth): `NodeUnpublishVolume` must verify mount is gone before returning success

Intermittency

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
efs-utils version	3.1.1 (efs-csi-driver v2.3.0)
OS	Bottlerocket 1.56.0 (aws-k8s-1.33-fips)
Kernel	6.1.161
Kubernetes	v1.33.7-eks-ac2d5a0
Region	eu-central-1

mount-watchdog cleans up state without calling umount, leaving orphaned hard NFS mounts that freeze kubelet #346

Description

Summary

Environment

Root Cause Chain

Bug 1 (Primary): watchdog never calls umount

Bug 2 (Amplifier): Port reuse races against orphaned kernel mount

Bug 3 (Amplifier): mount.nfs4 called even after efs-proxy bind failure

Bug 4 (Consequence): Kubelet freezes on stat() of hard NFS mount with dead server

Orphaned NFS mount (captured before fix)

Fix Applied

Recommended Fixes

Fix 1 (Critical): watchdog must call umount before cleaning up state

Fix 2 (Critical): Abort mount if efs-proxy fails to bind

Fix 3 (Defense-in-depth): Scan for orphaned 127.0.0.1 NFS mounts on startup

Fix 4 (Defense-in-depth): NodeUnpublishVolume must verify mount is gone before returning success

Intermittency

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 (Primary): watchdog never calls `umount`

Bug 3 (Amplifier): `mount.nfs4` called even after efs-proxy bind failure

Bug 4 (Consequence): Kubelet freezes on `stat()` of hard NFS mount with dead server

Fix 1 (Critical): watchdog must call `umount` before cleaning up state

Fix 4 (Defense-in-depth): `NodeUnpublishVolume` must verify mount is gone before returning success