Document DRA driver installation on OpenShift#82
Conversation
|
/cc @cdesiniotis since he has more OpenShift experience than me. |
6bf7bac to
c588614
Compare
d5feb81 to
2293f96
Compare
|
@klueska @elezar @cdesiniotis How can we go forward with this PR? |
|
Things are changing drastically for 1.31 (just as they did for the example driver). Once we have |
|
@klueska Hello, I noticed that this pull request aims to add documentation for deploying DRA on Openshift. I am currently attempting to deploy DRA on an RHOCP cluster but have encountered some challenges due to the lack of available documentation. But this PR seems to be on hold. |
|
@zhouhao3 I believe the answer is in this comment
Although the procedure in this PR may still work on older versions of OpenShift with the old "classic DRA" version of the NVIDA DRA driver, it needs revisiting after the new DRA API makes it into OpenShift. Also, we have discovered some corner cases that must be addressed in this document, notably the fact that applying |
|
@empovit We are currently in the process of setting up an environment with rhocp + DRA structured parameters, and your insights have been incredibly helpful. I wanted to ask if there is a timeline or any plans for when this PR might be merged. |
|
@zhouhao3 Second, an updated version of the PR will not specifically target DRA structured parameters, but the beta version of DRA. Third, for this to happen, we depend on the following factors:
Unfortunately I don't have an ETA for that. As far as I can tell from my other work related to NVIDIA MIG and DRA, in addition to the documented steps I would recommend setting migManager:
...
env:
- name: WITH_REBOOT
value: 'true'
- name: MIG_PARTED_MODE_CHANGE_ONLY
value: 'true'
...Assuming you are using DRA with MIG (an older version of the driver), this will make sure that the MIG manager only takes care of the MIG mode and never tries to delete partitions allocated by the NVIDIA DRA driver. Keep in mind though that it makes cleaning up any existing MIG partitions your responsibility (prior to configuring the DRA driver). E.g. you can apply |
|
@guptaNswati I have already explained the versions in #82 (comment). Hope it makes sense. |
|
@empovit yes understood. What is the ETA of OpenShift release with DRA v1beta1? The |
|
@guptaNswati I don't think I can disclose the ETA for 4.19 (the version likely to have K8s 1.32). The pre-release version of 4.18 has v1alpha3 (K8s 1.31). At the time I was testing the DRA driver on OpenShift, I had to build an image myself, so I can't point to an image tag. |
|
4.19 has been released with kubernetes 1.32. |
|
In order for the DRA driver to work properly on OpenShift, the following changes are required: |
|
I'll give this a try on 4.19. with those two changes, you'd expect it to be smooth sailing? |
|
@zoltan yes, definitely. It's been tested :) |
6be93e4 to
713bfb6
Compare
|
This depends on #569 |
1aeda5c to
a3fa9cd
Compare
| ```yaml | ||
| cdi: | ||
| enabled: true | ||
| ``` |
There was a problem hiding this comment.
That's the default now with GPU Operator 25.10.0. We can probably still leave this step in, right? (I suppose that's your point here... making damn sure :-)).
There was a problem hiding this comment.
Yes, people may change it to 'false' to keep the old behavior, i.e. before CDI support was introduced on OpenShift. Obviously, the driver won't work correctly with cdi: false.
jgehrcke
left a comment
There was a problem hiding this comment.
Thank you for working on this, that's great. We appreciate that a lot!
I've made a few comments. On the one hand nothing all too critical, but it would be nice to make some of the suggested changes before merging.
Signed-off-by: Vitaliy Emporopulo <vemporop@redhat.com>
a3fa9cd to
0a62419
Compare
jgehrcke
left a comment
There was a problem hiding this comment.
Thank you for the patience and for iterating.
Docs like this will get stale -- as discussed elsewhere, in the future we hopefully replace this with (or turn this into) a CI-backed source of truth. Where the installation instructions are tested as part of regular CI.
Let's land this now, after
(datetime(2025, 12, 15) - datetime(2024, 3, 2)).days
653
days since opening the PR ⌛
Document OpenShift-specific steps for installing and running the DRA driver.