Skip to content

Document DRA driver installation on OpenShift#82

Merged
jgehrcke merged 1 commit into
kubernetes-sigs:mainfrom
empovit:openshift-doc
Dec 15, 2025
Merged

Document DRA driver installation on OpenShift#82
jgehrcke merged 1 commit into
kubernetes-sigs:mainfrom
empovit:openshift-doc

Conversation

@empovit
Copy link
Copy Markdown
Contributor

@empovit empovit commented Mar 7, 2024

Document OpenShift-specific steps for installing and running the DRA driver.

@empovit empovit marked this pull request as ready for review March 7, 2024 13:13
@elezar elezar self-requested a review March 7, 2024 13:20
@elezar elezar self-assigned this Mar 7, 2024
Copy link
Copy Markdown
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @empovit. I have some minor comments / questions here.

Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/add-certified-catalog-source.sh Outdated
@elezar elezar requested a review from cdesiniotis March 7, 2024 13:28
@elezar
Copy link
Copy Markdown
Contributor

elezar commented Mar 7, 2024

/cc @cdesiniotis since he has more OpenShift experience than me.

@empovit empovit force-pushed the openshift-doc branch 2 times, most recently from 6bf7bac to c588614 Compare March 7, 2024 14:30
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
@empovit empovit marked this pull request as draft June 20, 2024 14:59
@empovit empovit force-pushed the openshift-doc branch 2 times, most recently from d5feb81 to 2293f96 Compare June 20, 2024 15:44
@empovit empovit marked this pull request as ready for review July 3, 2024 14:23
@empovit empovit marked this pull request as draft July 11, 2024 10:02
@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Sep 6, 2024

@klueska @elezar @cdesiniotis How can we go forward with this PR?

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Sep 6, 2024

Things are changing drastically for 1.31 (just as they did for the example driver). Once we have main updated with all of the changes for 1.31, we can revisit this.

@zhouhao3
Copy link
Copy Markdown

zhouhao3 commented Nov 7, 2024

@klueska Hello, I noticed that this pull request aims to add documentation for deploying DRA on Openshift. I am currently attempting to deploy DRA on an RHOCP cluster but have encountered some challenges due to the lack of available documentation. But this PR seems to be on hold.
If there are any known issues or limitations with deploying DRA on RHOCP at the moment? Should I switch to a certain version or wait for something to be fixed? Any guidance or updates on this documentation would be greatly appreciated. Thanks.

@zhouhao3
Copy link
Copy Markdown

zhouhao3 commented Dec 3, 2024

@elezar @klueska Hi, I wanted to kindly follow up on my previous question regarding this PR. If you have any updates or need further clarification from my side, please let me know. Your feedback would be greatly appreciated.

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Dec 4, 2024

@zhouhao3 I believe the answer is in this comment

Things are changing drastically for 1.31 (just as they did for the example driver). Once we have main updated with all of the changes for 1.31, we can revisit this.

Although the procedure in this PR may still work on older versions of OpenShift with the old "classic DRA" version of the NVIDA DRA driver, it needs revisiting after the new DRA API makes it into OpenShift.

Also, we have discovered some corner cases that must be addressed in this document, notably the fact that applying nvidia.com/mig.config=all-enabled may lead to undesirable wiping out of MIG partitions created by the DRA plugin - if the MIG manager pod crashes and is then re-created, for example.

@zhouhao3
Copy link
Copy Markdown

zhouhao3 commented Dec 5, 2024

@empovit
Thank you so much for your detailed response . I truly appreciate the time and effort you've put into addressing my queries.

We are currently in the process of setting up an environment with rhocp + DRA structured parameters, and your insights have been incredibly helpful. I wanted to ask if there is a timeline or any plans for when this PR might be merged.

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Dec 5, 2024

@zhouhao3
First, I don't think you need this PR merged to follow the steps. I mean if they work for you, just use them.

Second, an updated version of the PR will not specifically target DRA structured parameters, but the beta version of DRA.

Third, for this to happen, we depend on the following factors:

  • The upstream DRA 1.32 API is merged into OpenShift
  • The NVIDIA DRA driver fully supports DRA 1.32 API
  • The updated NVIDIA DRA driver is tested on an OpenShift version with DRA 1.32

Unfortunately I don't have an ETA for that.

As far as I can tell from my other work related to NVIDIA MIG and DRA, in addition to the documented steps I would recommend setting migManager.env.MIG_PARTED_MODE_CHANGE_ONLY to true:

  migManager:
    ...
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
    ...

Assuming you are using DRA with MIG (an older version of the driver), this will make sure that the MIG manager only takes care of the MIG mode and never tries to delete partitions allocated by the NVIDIA DRA driver. Keep in mind though that it makes cleaning up any existing MIG partitions your responsibility (prior to configuring the DRA driver). E.g. you can apply nvidia.com/mig.config=all-enabled, wait for it to succeed, and then add the MIG_PARTED_MODE_CHANGE_ONLY env variable to the cluster policy.

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Jan 18, 2025

@guptaNswati I have already explained the versions in #82 (comment). Hope it makes sense.

@guptaNswati
Copy link
Copy Markdown
Contributor

guptaNswati commented Jan 21, 2025

@empovit yes understood. What is the ETA of OpenShift release with DRA v1beta1? The v1alpha2 is old and it is better if the openshift is atleast updated to v1alpha3 for any useful testing. Meanwhile can you point which DRA image tag ghcr.io/nvidia/k8s-dra-driver:? did you use to test your PR?

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Jan 22, 2025

@guptaNswati I don't think I can disclose the ETA for 4.19 (the version likely to have K8s 1.32). The pre-release version of 4.18 has v1alpha3 (K8s 1.31). At the time I was testing the DRA driver on OpenShift, I had to build an image myself, so I can't point to an image tag.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska added this to the unscheduled milestone Aug 13, 2025
@klueska klueska added the documentation Issue/PR focused on fixing/editing/adding documentation bits label Aug 13, 2025
@zoltan
Copy link
Copy Markdown

zoltan commented Sep 25, 2025

4.19 has been released with kubernetes 1.32.

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Sep 28, 2025

In order for the DRA driver to work properly on OpenShift, the following changes are required:

@zoltan
Copy link
Copy Markdown

zoltan commented Sep 28, 2025

I'll give this a try on 4.19. with those two changes, you'd expect it to be smooth sailing?

@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Sep 28, 2025

@zoltan yes, definitely. It's been tested :)

@empovit empovit marked this pull request as ready for review December 10, 2025 06:44
@empovit
Copy link
Copy Markdown
Contributor Author

empovit commented Dec 10, 2025

This depends on #569

@empovit empovit changed the title Document deploying DRA to OpenShift Document DRA driver installation on OpenShift Dec 10, 2025
@empovit empovit force-pushed the openshift-doc branch 2 times, most recently from 1aeda5c to a3fa9cd Compare December 10, 2025 06:54
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
Comment thread demo/clusters/openshift/README.md Outdated
```yaml
cdi:
enabled: true
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default now with GPU Operator 25.10.0. We can probably still leave this step in, right? (I suppose that's your point here... making damn sure :-)).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, people may change it to 'false' to keep the old behavior, i.e. before CDI support was introduced on OpenShift. Obviously, the driver won't work correctly with cdi: false.

Comment thread demo/clusters/openshift/README.md
Comment thread README.md Outdated
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, that's great. We appreciate that a lot!

I've made a few comments. On the one hand nothing all too critical, but it would be nice to make some of the suggested changes before merging.

Signed-off-by: Vitaliy Emporopulo <vemporop@redhat.com>
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patience and for iterating.

Docs like this will get stale -- as discussed elsewhere, in the future we hopefully replace this with (or turn this into) a CI-backed source of truth. Where the installation instructions are tested as part of regular CI.

Let's land this now, after

(datetime(2025, 12, 15) - datetime(2024, 3, 2)).days
653

days since opening the PR ⌛

@jgehrcke jgehrcke merged commit 4cb4dfc into kubernetes-sigs:main Dec 15, 2025
1 check passed
@empovit empovit deleted the openshift-doc branch January 12, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issue/PR focused on fixing/editing/adding documentation bits

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants