Skip to content

同时创建多个pod时会导致分配到同一个NPU #52

@lomtom

Description

@lomtom

同一时间创建多个分配NPU时,会导致分配到同一个NPU,最终使用npu时报错:

DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020
  • schduler:volcano
  • device-plugin:latest
  • npu:910B(八张卡)
  1. 创建depoyment(2个副本)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: npu
spec:
  replicas: 2
  selector:
    matchLabels:
      run: npu
  template:
    metadata:
      labels:
        run: npu
      name: npu
    spec:
      containers:
        - command:
            - sh
            - -c
          args:
            - sleep 1d
          image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
          imagePullPolicy: IfNotPresent
          name: npu
          resources:
            limits:
              cpu: "8"
              huawei.com/Ascend910B4: "1"
              memory: 16Gi
            requests:
              cpu: "8"
              huawei.com/Ascend910B4: "1"
              memory: 16Gi
          securityContext:
            privileged: false
      schedulerName: volcano
  1. device-plugin 日志
I0130 07:56:32.030596       1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-8120A9DA-54B03472-B9D00485-104301E3-3],},},}
I0130 07:56:32.053587       1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}
I0130 07:56:35.077178       1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-81808FDA-09B87472-B9D00485-104301E3-0],},},}
I0130 07:56:35.099104       1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}
  1. 查看容器及分配的卡
# nerdctl ps |grep default/npu
67b466d3d67d    swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04                                                                                                                   "sh -c sleep 1d"          About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-ckmt9/npu
7ad9f89dfb0f    sealos.hub:5000/pause:3.6                                                                                                                               "/pause"                  About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-ckmt9
b75f0592887b    swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04                                                                                                                   "sh -c sleep 1d"          About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-9cpcv/npu
99d73958ca35    sealos.hub:5000/pause:3.6                                                                                                                               "/pause"                  About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-9cpcv

# nerdctl inspect 67b466d3d67d |grep VIS
                "ASCEND_VISIBLE_DEVICES=7",
# nerdctl inspect b75f0592887b |grep VIS
                "ASCEND_VISIBLE_DEVICES=7",
  1. pod内使用npu
# kubectl exec -it npu-dcfc4bdc4-9cpcv -- bash
root@npu-dcfc4bdc4-9cpcv:/# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.3.rc1                 Version: 25.3.rc1                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 7     910B4               | Warning       | 88.4        40                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          2887 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+


# kubectl exec -it npu-dcfc4bdc4-ckmt9 -- bash      
root@npu-dcfc4bdc4-ckmt9:/# npu-smi info
DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions