-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Description
Using the key-utils/main, tlshd may malfunction. Specifically, the probe for the second optional Netlink attribute (HANDSHAKE_A_DONE_REMOTE_AUTH) fails
(mainly because the wrong family_id=-12), causing the feature to be incorrectly reported as “not supported.” This can break NVMe over TCP TLS connections that rely on this feature.
Kernel capabilities: session_tags=supported remote_peerids=not available
Reproduction Steps
I create a script to connect an NVMe device using TLS:
### load the kernel module
modprobe nvmet nvmet-tcp nvme nvme_tcp
### init the nvme over tcp tls
dd if=/dev/zero of=test.raw bs=1M count=0 seek=512
losetup /dev/loop100 test.raw
cd /sys/kernel/config/nvmet/subsystems
mkdir nqn.2014-08.org.nvmexpress.mytest
cd nqn.2014-08.org.nvmexpress.mytest
echo 1 > attr_allow_any_host
cd namespaces
mkdir 1
cd 1
echo /dev/loop100 > device_path
echo 1 > enable
cd /sys/kernel/config/nvmet/ports
mkdir 1234
cd 1234
echo tcp > addr_trtype
echo ipv4 > addr_adrfam
echo 0.0.0.0 > addr_traddr
echo 4420 > addr_trsvcid
echo tls1.3 > addr_tsas
cd subsystems
ln -s ../../../subsystems/nqn.2014-08.org.nvmexpress.mytest
### gennerate the key for discovery and connect
discovery_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.discovery)
echo discovery_key=$discovery_key
conn_key=$(nvme gen-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.mytest)
echo conn_key=$conn_key
### insert the key
nvme check-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.mytest \
--hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${conn_key}
nvme check-tls-key --subsysnqn=nqn.2014-08.org.nvmexpress.discovery \
--hostnqn=$(cat /etc/nvme/hostnqn) -i -d ${discovery_key}
echo hostnqn=$(cat /etc/nvme/hostnqn)
systemctl stop tlshd
tlshd -c /etc/tlshd/config -s &
sleep 3
nvme discover -t tcp -a 127.0.0.1 -s 4420 --tls
echo "Target init complete"
Cleanup script:
rm -rf /sys/kernel/config/nvmet/ports/1234/subsystems/nqn.2014-08.org.nvmexpress.mytest
rmdir /sys/kernel/config/nvmet/ports/1234/
echo 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/enable
echo -n 0 > /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/device_path
rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/namespaces/1/
rmdir /sys/kernel/config/nvmet/subsystems/nqn.2014-08.org.nvmexpress.mytest/
losetup -d /dev/loop100
rm -rf ./test.raw
killall tlshd
The scripts works well when revert "tlshd: Add extensible kernel capability detection".
Debug and my fix
I think the problem is caused by state pollution in the probing logic. Each probe sends a deliberately invalid request to trigger a kernel error. The original 'tlshd_probe_attr' function does not read and handle the kernel’s error response after sending the request. This unread response remains in the Netlink socket’s receive buffer. When the second probe calls genl_ctrl_resolve, causing the resolution to fail with -12 and corrupting the capability detection.
diff --git a/src/tlshd/netlink.c b/src/tlshd/netlink.c
index 8d47799..a79ff54 100644
--- a/src/tlshd/netlink.c
+++ b/src/tlshd/netlink.c
@@ -265,6 +265,7 @@ static bool tlshd_probe_attr(struct nl_sock *nls, int cmd, int attr_type)
* the kernel accepted the message containing this attribute.
*/
supported = (err >= 0);
+ nl_recvmsgs_default(nls);
return supported;
}
I will create a PR later.