-
Notifications
You must be signed in to change notification settings - Fork 471
Description
I'm still not sure the most root bread crumb of this issue but hoping someone might be able to see something I'm missing. I suspect trying to downgrade drivers might help.
Currently I am running Proxmox VE 8.4.1 which is running on 6.8.12 kernel
Linux elysium 6.8.12-11-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-11 (2025-05-22T09:39Z) x86_64 GNU/Linux
I did the patch against NVIDIA driver version 535.230.02
Mon Jul 7 20:51:44 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro P4000 On | 00000000:2D:00.0 Off | N/A |
| 46% 38C P8 9W / 105W | 30MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
To my recollection, though I am actually not sure how to check this I did download the nvidia vgpu drivers somewhere in the 16.X branch.
At minimum the vgpu does show up correctly
Mon Jul 7 20:54:04 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 Quadro P4000 | 00000000:2D:00.0 | 0% |
+---------------------------------+------------------------------+------------+
and is a proper mdev
root@elysium:/home/vgpu-proxmox# mdevctl list
5c3928a2-b61f-46a8-883a-36887adec34d 0000:2d:00.0 nvidia-47 manual
I did the manual steps of going into the pci device in sysfs, then mdev_supported_types, and for this base nvidia-47 creating the uuid. Then in vm conf (since I saw one bug this didn't work well in the proxmox gui and I was having trouble) I added the manual device parameter
args: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/5c3928a2-b61f-46a8-883a-36887adec34d -uuid 00000000-0000-0000-0000-000000000101 where the UUID is just my 101 VM id in UUID format
For whatever reason this works flawless on a cold boot of the underlying host. If I shutdown the VM I then get the message that
vm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/5c3928a2-b61f-46a8-883a-36887adec34d: vfio 5c3928a2-b61f-46a8-883a-36887adec34d: error getting device from group 172: Input/output error
Verify all devices in group 172 are bound to vfio-<bus> or pci-stub and not already in use
TASK ERROR: start failed: QEMU exited with code 1
Which seems sort of generic to other issues. On that first boot I see normal output in nvidia-vgpu-mgr
Jul 07 18:30:02 elysium nvidia-vgpu-mgr[2523]: Nv0000CtrlVgpuCreateDeviceParams {
vgpu_name: {5c3928a2-b61f-46a8-883a-36887adec34d},
gpu_pci_id: 0x2d00,
gpu_pci_bdf: 11520,
vgpu_type_id: 47,
vgpu_id: 1,
}
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: Successfully updated env symbols!
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x20801322 failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x2080014b failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
vgpu_type: 47,
vgpu_type_info: NvA081CtrlVgpuInfo {
vgpu_type: 47,
vgpu_name: "GRID P40-2Q",
vgpu_class: "Quadro",
vgpu_signature: [],
license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
max_instance: 12,
num_heads: 4,
max_resolution_x: 7680,
max_resolution_y: 4320,
max_pixels: 36864000,
frl_config: 60,
cuda_enabled: 1,
ecc_supported: 1,
gpu_instance_size: 0,
multi_vgpu_supported: 0,
vdev_id: 0x1b3811e9,
pdev_id: 0x1b38,
profile_size: 0x80000000,
fb_length: 0x74000000,
gsp_heap_size: 0x0,
fb_reservation: 0xc000000,
mappable_video_size: 0x400000,
encoder_capacity: 0x64,
bar1_length: 0x100,
frl_enable: 1,
adapter_name: "GRID P40-2Q",
adapter_name_unicode: "GRID P40-2Q",
short_gpu_name_string: "GP104GL-A",
licensed_product_name: "NVIDIA RTX Virtual Workstation",
vgpu_extra_params: [],
ftrace_enable: 0,
gpu_direct_supported: 0,
nvlink_p2p_supported: 0,
multi_vgpu_exclusive: 0,
exclusive_type: 0,
exclusive_size: 1,
gpu_instance_profile_id: 4294967295,
},
}
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Applying profile nvidia-47 overrides
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/num_heads: 4 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/cuda_enabled: 1 -> 1
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/ecc_supported: 1 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/fb_length: 1946157056 -> 1946157056
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/fb_reservation: 201326592 -> 201326592
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: Patching nvidia-47/frl_enable: 1 -> 0
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0xa0810115 failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2d00
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: cmd: 0x2080012f failed.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU migration enabled
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU manager is running in non-SRIOV mode.
Jul 07 19:10:39 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: display_init inst: 0 successful
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: vGPU version: 0x120001
Jul 07 19:11:08 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jul 07 19:31:09 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Restricted)
Jul 07 20:25:55 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jul 07 20:26:21 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: (0x0): Guest driver unloaded!
Jul 07 20:26:23 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> stop_and_copy. QEMU migration state: STOPNCOPY_ACTIVE
Jul 07 20:26:24 elysium nvidia-vgpu-mgr[20904]: notice: vmiop_log: Stopping all vGPU migration threads
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_env_log: Successfully updated env symbols!
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x20801322 failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x2080014b failed.
And with the guest shut down there. Then on the next guest boot
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Applying profile nvidia-47 overrides
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/num_heads: 4 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/cuda_enabled: 1 -> 1
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/ecc_supported: 1 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/fb_length: 1946157056 -> 1946157056
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/fb_reservation: 201326592 -> 201326592
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: Patching nvidia-47/frl_enable: 1 -> 0
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0xa0810115 failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2d00
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: Driver Version: 535.230.02
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: cmd: 0x2080012f failed.
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Jul 07 20:26:49 elysium nvidia-vgpu-mgr[58553]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jul 07 20:26:55 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
Jul 07 20:26:55 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): failed to sync engine
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 7 (init frame copy engine)
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 7
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_log: display_init failed for inst: 0
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Jul 07 20:26:56 elysium nvidia-vgpu-mgr[58553]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x65
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 5c3928a2-b61f-46a8-883a-36887adec34d GPU PCI id 00:2d:00.0 config par>
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
Jul 07 20:27:15 elysium nvidia-vgpu-mgr[58992]: notice: vmiop_env_log: Successfully updated env symbols!
Where it seems sort of like the vgpu driver just kind of goes out to lunch. Nvidia-smi still sees the device and the pci device is still attached to the host. At most I see these two messages in dmesg
[nvidia-vgpu-vfio] 5c3928a2-b61f-46a8-883a-36887adec34d: start failed. status: 0x1
[nvidia-vgpu-vfio] vGPU type info already present for GPU 0x2d00
But I am not sure what else to try here. If anyone has any leads since there's possible operator error on my end I can try stuff. Thank you for working on this vgpu project also!