tlshd fails due to faulty capability detection

**Description**
Using the key-utils/main, tlshd may malfunction. Specifically, the probe for the second optional Netlink attribute (HANDSHAKE_A_DONE_REMOTE_AUTH) fails 
(mainly because the wrong family_id=-12), causing the feature to be incorrectly reported as “not supported.” This can break NVMe over TCP TLS connections that rely on this feature.

> Kernel capabilities: session_tags=supported remote_peerids=not available

**Reproduction Steps**
I create a script to connect an NVMe device using TLS:


```
 ### load the kernel module
modprobe nvmet nvmet-tcp nvme nvme_tcp

### init the nvme over tcp tls
dd if=/dev/zero of=test.raw bs=1M count=0 seek=512
losetup /dev/loop100 test.raw
cd /sys/kernel/config/nvmet/subsystems
mkdir nqn.2014-08.org.nvmexpress.mytest
cd nqn.2014-08.org.nvmexpress.mytest
echo 1 > attr_allow_any_host
cd namespaces
mkdir 1
cd 1
echo /dev/loop100 > device_path
echo 1 > enable
cd /sys/kernel/config/nvmet/ports
mkdir 1234
cd 1234
echo tcp > addr_trtype
echo ipv4 > addr_adrfam
echo 0.0.0.0 > addr_traddr
echo 4420 > addr_trsvcid
echo tls1.3 > addr_tsas

cd subsystems
ln -s ../../../subsystems/nqn.2014-08.org.nvmexpress.mytest

### gennerate the key for discovery and connect
discovery_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.discovery)

echo discovery_key=$discovery_key

conn_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.mytest)

echo conn_key=$conn_key

### insert the key
nvme check-tls-key  --subsysnqn=nqn.2014-08.org.nvmexpress.mytest \
                    --hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${conn_key}

nvme check-tls-key  --subsysnqn=nqn.2014-08.org.nvmexpress.discovery \
                    --hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${discovery_key}

echo hostnqn=$(cat /etc/nvme/hostnqn)

systemctl stop tlshd
tlshd -c /etc/tlshd/config -s &
sleep 3

nvme discover -t tcp -a 127.0.0.1 -s 4420 --tls

echo "Target init complete"
```

Cleanup script:
```
        rm -rf /sys/kernel/config/nvmet/ports/1234/subsystems/nqn.2014-08.org.nvmexpress.mytest
        rmdir /sys/kernel/config/nvmet/ports/1234/
        echo 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/enable
        echo -n 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/device_path
        rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/
        rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/
        losetup -d /dev/loop100
        rm -rf ./test.raw
        killall tlshd
```

The scripts works well when revert "tlshd: Add extensible kernel capability detection". 

**Debug and my fix**

I think the problem is caused by state pollution in the probing logic. Each probe sends a deliberately invalid request to trigger a kernel error. The original 'tlshd_probe_attr' function does not read and handle the kernel’s error response after sending the request. This unread response remains in the Netlink socket’s receive buffer. When the second probe calls genl_ctrl_resolve,  causing the resolution to fail with -12 and corrupting the capability detection.
```

diff --git a/src/tlshd/netlink.c b/src/tlshd/netlink.c
index 8d47799..a79ff54 100644
--- a/src/tlshd/netlink.c
+++ b/src/tlshd/netlink.c
@@ -265,6 +265,7 @@ static bool tlshd_probe_attr(struct nl_sock *nls, int cmd, int attr_type)
         * the kernel accepted the message containing this attribute.
         */
        supported = (err >= 0);
+       nl_recvmsgs_default(nls);
 
        return supported;
 }
```

I will create a PR later.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tlshd fails due to faulty capability detection #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tlshd fails due to faulty capability detection #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions