Skip to content

tlshd fails due to faulty capability detection #137

@Dwyane-Yan

Description

@Dwyane-Yan

Description
Using the key-utils/main, tlshd may malfunction. Specifically, the probe for the second optional Netlink attribute (HANDSHAKE_A_DONE_REMOTE_AUTH) fails
(mainly because the wrong family_id=-12), causing the feature to be incorrectly reported as “not supported.” This can break NVMe over TCP TLS connections that rely on this feature.

Kernel capabilities: session_tags=supported remote_peerids=not available

Reproduction Steps
I create a script to connect an NVMe device using TLS:

 ### load the kernel module
modprobe nvmet nvmet-tcp nvme nvme_tcp

### init the nvme over tcp tls
dd if=/dev/zero of=test.raw bs=1M count=0 seek=512
losetup /dev/loop100 test.raw
cd /sys/kernel/config/nvmet/subsystems
mkdir nqn.2014-08.org.nvmexpress.mytest
cd nqn.2014-08.org.nvmexpress.mytest
echo 1 > attr_allow_any_host
cd namespaces
mkdir 1
cd 1
echo /dev/loop100 > device_path
echo 1 > enable
cd /sys/kernel/config/nvmet/ports
mkdir 1234
cd 1234
echo tcp > addr_trtype
echo ipv4 > addr_adrfam
echo 0.0.0.0 > addr_traddr
echo 4420 > addr_trsvcid
echo tls1.3 > addr_tsas

cd subsystems
ln -s ../../../subsystems/nqn.2014-08.org.nvmexpress.mytest

### gennerate the key for discovery and connect
discovery_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.discovery)

echo discovery_key=$discovery_key

conn_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.mytest)

echo conn_key=$conn_key

### insert the key
nvme check-tls-key  --subsysnqn=nqn.2014-08.org.nvmexpress.mytest \
                    --hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${conn_key}

nvme check-tls-key  --subsysnqn=nqn.2014-08.org.nvmexpress.discovery \
                    --hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${discovery_key}

echo hostnqn=$(cat /etc/nvme/hostnqn)

systemctl stop tlshd
tlshd -c /etc/tlshd/config -s &
sleep 3

nvme discover -t tcp -a 127.0.0.1 -s 4420 --tls

echo "Target init complete"

Cleanup script:

        rm -rf /sys/kernel/config/nvmet/ports/1234/subsystems/nqn.2014-08.org.nvmexpress.mytest
        rmdir /sys/kernel/config/nvmet/ports/1234/
        echo 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/enable
        echo -n 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/device_path
        rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/
        rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/
        losetup -d /dev/loop100
        rm -rf ./test.raw
        killall tlshd

The scripts works well when revert "tlshd: Add extensible kernel capability detection".

Debug and my fix

I think the problem is caused by state pollution in the probing logic. Each probe sends a deliberately invalid request to trigger a kernel error. The original 'tlshd_probe_attr' function does not read and handle the kernel’s error response after sending the request. This unread response remains in the Netlink socket’s receive buffer. When the second probe calls genl_ctrl_resolve, causing the resolution to fail with -12 and corrupting the capability detection.


diff --git a/src/tlshd/netlink.c b/src/tlshd/netlink.c
index 8d47799..a79ff54 100644
--- a/src/tlshd/netlink.c
+++ b/src/tlshd/netlink.c
@@ -265,6 +265,7 @@ static bool tlshd_probe_attr(struct nl_sock *nls, int cmd, int attr_type)
         * the kernel accepted the message containing this attribute.
         */
        supported = (err >= 0);
+       nl_recvmsgs_default(nls);
 
        return supported;
 }

I will create a PR later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions