-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
同一时间创建多个分配NPU时,会导致分配到同一个NPU,最终使用npu时报错:
DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020- schduler:volcano
- device-plugin:latest
- npu:910B(八张卡)
- 创建depoyment(2个副本)
apiVersion: apps/v1
kind: Deployment
metadata:
name: npu
spec:
replicas: 2
selector:
matchLabels:
run: npu
template:
metadata:
labels:
run: npu
name: npu
spec:
containers:
- command:
- sh
- -c
args:
- sleep 1d
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
imagePullPolicy: IfNotPresent
name: npu
resources:
limits:
cpu: "8"
huawei.com/Ascend910B4: "1"
memory: 16Gi
requests:
cpu: "8"
huawei.com/Ascend910B4: "1"
memory: 16Gi
securityContext:
privileged: false
schedulerName: volcano- device-plugin 日志
I0130 07:56:32.030596 1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-8120A9DA-54B03472-B9D00485-104301E3-3],},},}
I0130 07:56:32.053587 1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}
I0130 07:56:35.077178 1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-81808FDA-09B87472-B9D00485-104301E3-0],},},}
I0130 07:56:35.099104 1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}- 查看容器及分配的卡
# nerdctl ps |grep default/npu
67b466d3d67d swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 "sh -c sleep 1d" About a minute ago Up k8s://default/npu-dcfc4bdc4-ckmt9/npu
7ad9f89dfb0f sealos.hub:5000/pause:3.6 "/pause" About a minute ago Up k8s://default/npu-dcfc4bdc4-ckmt9
b75f0592887b swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 "sh -c sleep 1d" About a minute ago Up k8s://default/npu-dcfc4bdc4-9cpcv/npu
99d73958ca35 sealos.hub:5000/pause:3.6 "/pause" About a minute ago Up k8s://default/npu-dcfc4bdc4-9cpcv
# nerdctl inspect 67b466d3d67d |grep VIS
"ASCEND_VISIBLE_DEVICES=7",
# nerdctl inspect b75f0592887b |grep VIS
"ASCEND_VISIBLE_DEVICES=7",- pod内使用npu
# kubectl exec -it npu-dcfc4bdc4-9cpcv -- bash
root@npu-dcfc4bdc4-9cpcv:/# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.3.rc1 Version: 25.3.rc1 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 7 910B4 | Warning | 88.4 40 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 2887 / 32768 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+
# kubectl exec -it npu-dcfc4bdc4-ckmt9 -- bash
root@npu-dcfc4bdc4-ckmt9:/# npu-smi info
DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels