Skip to content

Can only attach any vgpu once after host boot and not again #131

@cpmpal

Description

@cpmpal

I'm still not sure the most root bread crumb of this issue but hoping someone might be able to see something I'm missing. I suspect trying to downgrade drivers might help.

Currently I am running Proxmox VE 8.4.1 which is running on 6.8.12 kernel
Linux elysium 6.8.12-11-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-11 (2025-05-22T09:39Z) x86_64 GNU/Linux

I did the patch against NVIDIA driver version 535.230.02

Mon Jul  7 20:51:44 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: N/A      |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P4000                   On  | 00000000:2D:00.0 Off |                  N/A |
| 46%   38C    P8               9W / 105W |     30MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

To my recollection, though I am actually not sure how to check this I did download the nvidia vgpu drivers somewhere in the 16.X branch.

At minimum the vgpu does show up correctly

Mon Jul  7 20:54:04 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  Quadro P4000               | 00000000:2D:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

and is a proper mdev

root@elysium:/home/vgpu-proxmox# mdevctl list
5c3928a2-b61f-46a8-883a-36887adec34d 0000:2d:00.0 nvidia-47 manual

I did the manual steps of going into the pci device in sysfs, then mdev_supported_types, and for this base nvidia-47 creating the uuid. Then in vm conf (since I saw one bug this didn't work well in the proxmox gui and I was having trouble) I added the manual device parameter
args: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/5c3928a2-b61f-46a8-883a-36887adec34d -uuid 00000000-0000-0000-0000-000000000101 where the UUID is just my 101 VM id in UUID format

For whatever reason this works flawless on a cold boot of the underlying host. If I shutdown the VM I then get the message that

vm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/5c3928a2-b61f-46a8-883a-36887adec34d: vfio 5c3928a2-b61f-46a8-883a-36887adec34d: error getting device from group 172: Input/output error
Verify all devices in group 172 are bound to vfio-<bus> or pci-stub and not already in use
TASK ERROR: start failed: QEMU exited with code 1

Which seems sort of generic to other issues. On that first boot I see normal output in nvidia-vgpu-mgr

Jul 07 18:30:02 elysium nvidia-vgpu-mgr[2523]: Nv0000CtrlVgpuCreateDeviceParams {
                                                   vgpu_name: {5c3928a2-b61f-46a8-883a-36887adec34d},
                                                   gpu_pci_id: 0x2d00,
                                                   gpu_pci_bdf: 11520,
                                                   vgpu_type_id: 47,
                                                   vgpu_id: 1,
                                               }
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: Successfully updated env symbols!
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x20801322 failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x2080014b failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 47,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 47,
                                                        vgpu_name: "GRID P40-2Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: [],
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 12,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 36864000,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 0,
                                                        vdev_id: 0x1b3811e9,
                                                        pdev_id: 0x1b38,
                                                        profile_size: 0x80000000,
                                                        fb_length: 0x74000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0xc000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "GRID P40-2Q",
                                                        adapter_name_unicode: "GRID P40-2Q",
                                                        short_gpu_name_string: "GP104GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: [],
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 1,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Applying profile nvidia-47 overrides
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/num_heads: 4 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/cuda_enabled: 1 -> 1
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/ecc_supported: 1 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/fb_length: 1946157056 -> 1946157056
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/fb_reservation: 201326592 -> 201326592
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/frl_enable: 1 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0xa0810115 failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2d00
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x2080012f failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU migration enabled
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU manager is running in non-SRIOV mode.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: display_init inst: 0 successful
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: vGPU version: 0x120001
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jul 07 19:31:09 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Restricted)
Jul 07 20:25:55 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jul 07 20:26:21 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Guest driver unloaded!
Jul 07 20:26:23 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> stop_and_copy. QEMU migration state: STOPNCOPY_ACTIVE
Jul 07 20:26:24 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Stopping all vGPU migration threads
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: Successfully updated env symbols!
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x20801322 failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x2080014b failed.

And with the guest shut down there. Then on the next guest boot

Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Applying profile nvidia-47 overrides
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/num_heads: 4 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/cuda_enabled: 1 -> 1
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/ecc_supported: 1 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/fb_length: 1946157056 -> 1946157056
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/fb_reservation: 201326592 -> 201326592
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/frl_enable: 1 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0xa0810115 failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2d00
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x2080012f failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jul 07 20:26:55 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
Jul 07 20:26:55 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): failed to sync engine
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 7 (init frame copy engine)
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 7
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: display_init failed for inst: 0
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x65
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: Successfully updated env symbols!

Where it seems sort of like the vgpu driver just kind of goes out to lunch. Nvidia-smi still sees the device and the pci device is still attached to the host. At most I see these two messages in dmesg
[nvidia-vgpu-vfio] 5c3928a2-b61f-46a8-883a-36887adec34d: start failed. status: 0x1
[nvidia-vgpu-vfio] vGPU type info already present for GPU 0x2d00

But I am not sure what else to try here. If anyone has any leads since there's possible operator error on my end I can try stuff. Thank you for working on this vgpu project also!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions