Skip to content

Define FG graduation#954

Open
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:FG-policy
Open

Define FG graduation#954
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:FG-policy

Conversation

@guptaNswati
Copy link
Copy Markdown
Contributor

Address #931

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati guptaNswati requested a review from Copilot March 18, 2026 21:14
@guptaNswati guptaNswati self-assigned this Mar 18, 2026
@guptaNswati guptaNswati added documentation Issue/PR focused on fixing/editing/adding documentation bits maintenance/chores issue/pr for maintenance, release work, code cleanup, chores labels Mar 18, 2026
@guptaNswati guptaNswati added this to the v26.4.0 milestone Mar 18, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a formal policy document describing how feature gates in the NVIDIA DRA Driver for GPUs should progress from Alpha to Beta to Stable, including evidence expectations and a snapshot of the current gate inventory.

Changes:

  • Introduces graduation criteria (entry/graduation requirements) for Alpha, Beta, and Stable feature gates.
  • Defines deprecation/removal expectations and upstream Kubernetes dependency coupling rules.
  • Documents current feature-gate inventory and highlights gaps to reach/maintain desired stages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


### 2.3 Stable (GA) — Production Grade

**Default:** `true`, **Locked:** feature gate cannot be disabled
Comment on lines +155 to +156
| `DynamicMIG` | Alpha | `false` | v25.12 | [KEP-4815] (Alpha 1.35, Beta target 1.36) | Mutually exclusive with PassthroughSupport, NVMLDeviceHealthCheck, MPSSupport |
| `NVMLDeviceHealthCheck` | Alpha | `false` | v25.12 | [KEP-5055] (Alpha 1.33, Beta target 1.36) | Mutually exclusive with DynamicMIG |
@@ -0,0 +1,184 @@
# Policy on Feature Gate Graduation
Comment on lines +24 to +25
**Default:** `false` (opt-in)
**Signal:** "Try it out and give us feedback."
Comment on lines +44 to +46
**Default:** `true` (opt-out)
**Signal:** "We're confident in the design. Early production use is
encouraged."
When the upstream dependency is not at the required level, the feature must
detect and degrade gracefully, require and fail loudly, or defer promotion.

## 3. Current Feature Gate Inventory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to make this doc just about policy and adding the details of current feature gates on a different doc.
Even better would be that each feature gate has a dedicated page with details about it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good idea. Rn, we dont have any doc on the FGs. A dedicated page would allow us to have design, discussions and different stages at a single place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should at least split the static policy section and the dynamic feature gate section.

Copy link
Copy Markdown
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we reconcile information here with roadmap/issues?

When the upstream dependency is not at the required level, the feature must
detect and degrade gracefully, require and fail loudly, or defer promotion.

## 3. Current Feature Gate Inventory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should at least split the static policy section and the dynamic feature gate section.

@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 2, 2026

> **Note:** Some existing gates (`IMEXDaemonsWithDNSNames`, `ComputeDomainCliques`,
> `CrashOnNVLinkFabricErrors`) were introduced directly at Beta to fix some critical bugs.
> The criteria below apply to **new feature gates going forward** and to **promotions of existing gates** to the next stage.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth calling out an official release where this takes place. "going forward" is a non-specific measure of time. Like maybe, new feature gates introduced after 25.12.0 release" or something.

Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lightweight graduation criteria are well-scaled to the project's size, and the upstream KEP coupling in §2.5 is the right framing. Three small calibration nits inline — none block the policy landing.

| B1 | All critical bugs from Alpha are fixed | Linked issues closed |
| B2 | BATS tests covering primary user workflows passing in CI | Test + CI links |
| B3 | Negative / error-path tests | Test file links |
| B4 | Prometheus metrics for key operational signals (where applicable) | Metric names in proposal |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Where applicable" is doing a lot of work here, and §4 "Current Gaps" admits there is no metrics endpoint for gpu-kubelet-plugin today. As written, B4 is likely to be waived case-by-case and won't bind any gate at Beta. Consider either (a) making B4 mandatory for any control-loop or health-monitoring gate, or (b) moving it to an explicit "Beta prerequisites pending infrastructure" subsection so the gap is visible rather than absorbed into each promotion's prose.

| Stable | Stable (strongly preferred) |

When the upstream dependency is not at the required level, the feature must
detect and degrade gracefully, require and fail loudly, or defer promotion.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three options with no decision process. For a driver feature at Beta whose upstream KEP is Alpha, all three are technically available — without a tiebreaker this gets resolved ad hoc per PR. Suggest naming who decides (maintainers? single maintainer + recorded rationale?) and the default choice when no explicit decision is recorded (probably "defer promotion" to keep the bar honest).

| `IMEXDaemonsWithDNSNames` | Beta | `true` | v25.8 | — | DNS names for IMEX daemons |
| `PassthroughSupport` | Alpha | `false` | v25.12 | — | VFIO-PCI passthrough |
| `DynamicMIG` | Alpha | `false` | v25.12 | [KEP-4815] (Alpha 1.35, Beta target 1.36) | Mutually exclusive with PassthroughSupport, NVMLDeviceHealthCheck, MPSSupport |
| `NVMLDeviceHealthCheck` | Alpha | `false` | v25.12 | [KEP-5055] (Alpha 1.33, Beta target 1.36) | Mutually exclusive with DynamicMIG |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a footnote here indicating that the current NVMLDeviceHealthCheck implementation (binary healthy/unhealthy via removal from ResourceSlice) is under active rework and may change shape before any Beta promotion. Readers cross-checking the inventory against pkg/featuregates/featuregates.go + cmd/gpu-kubelet-plugin/device_health.go today will see the current model; the footnote keeps the table honest when the rework lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. documentation Issue/PR focused on fixing/editing/adding documentation bits maintenance/chores issue/pr for maintenance, release work, code cleanup, chores

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

9 participants