Define FG graduation#954
Conversation
Signed-off-by: Swati Gupta <swatig@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Adds a formal policy document describing how feature gates in the NVIDIA DRA Driver for GPUs should progress from Alpha to Beta to Stable, including evidence expectations and a snapshot of the current gate inventory.
Changes:
- Introduces graduation criteria (entry/graduation requirements) for Alpha, Beta, and Stable feature gates.
- Defines deprecation/removal expectations and upstream Kubernetes dependency coupling rules.
- Documents current feature-gate inventory and highlights gaps to reach/maintain desired stages.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ### 2.3 Stable (GA) — Production Grade | ||
|
|
||
| **Default:** `true`, **Locked:** feature gate cannot be disabled |
| | `DynamicMIG` | Alpha | `false` | v25.12 | [KEP-4815] (Alpha 1.35, Beta target 1.36) | Mutually exclusive with PassthroughSupport, NVMLDeviceHealthCheck, MPSSupport | | ||
| | `NVMLDeviceHealthCheck` | Alpha | `false` | v25.12 | [KEP-5055] (Alpha 1.33, Beta target 1.36) | Mutually exclusive with DynamicMIG | |
| @@ -0,0 +1,184 @@ | |||
| # Policy on Feature Gate Graduation | |||
| **Default:** `false` (opt-in) | ||
| **Signal:** "Try it out and give us feedback." |
| **Default:** `true` (opt-out) | ||
| **Signal:** "We're confident in the design. Early production use is | ||
| encouraged." |
| When the upstream dependency is not at the required level, the feature must | ||
| detect and degrade gracefully, require and fail loudly, or defer promotion. | ||
|
|
||
| ## 3. Current Feature Gate Inventory |
There was a problem hiding this comment.
I think its better to make this doc just about policy and adding the details of current feature gates on a different doc.
Even better would be that each feature gate has a dedicated page with details about it.
There was a problem hiding this comment.
Thats a good idea. Rn, we dont have any doc on the FGs. A dedicated page would allow us to have design, discussions and different stages at a single place.
There was a problem hiding this comment.
Yes. We should at least split the static policy section and the dynamic feature gate section.
rajatchopra
left a comment
There was a problem hiding this comment.
How do we reconcile information here with roadmap/issues?
| When the upstream dependency is not at the required level, the feature must | ||
| detect and degrade gracefully, require and fail loudly, or defer promotion. | ||
|
|
||
| ## 3. Current Feature Gate Inventory |
There was a problem hiding this comment.
Yes. We should at least split the static policy section and the dynamic feature gate section.
|
Unknown CLA label state. Rechecking for CLA labels. Send feedback to sig-contributor-experience at kubernetes/community. /check-cla |
|
|
||
| > **Note:** Some existing gates (`IMEXDaemonsWithDNSNames`, `ComputeDomainCliques`, | ||
| > `CrashOnNVLinkFabricErrors`) were introduced directly at Beta to fix some critical bugs. | ||
| > The criteria below apply to **new feature gates going forward** and to **promotions of existing gates** to the next stage. |
There was a problem hiding this comment.
Maybe worth calling out an official release where this takes place. "going forward" is a non-specific measure of time. Like maybe, new feature gates introduced after 25.12.0 release" or something.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Lightweight graduation criteria are well-scaled to the project's size, and the upstream KEP coupling in §2.5 is the right framing. Three small calibration nits inline — none block the policy landing.
| | B1 | All critical bugs from Alpha are fixed | Linked issues closed | | ||
| | B2 | BATS tests covering primary user workflows passing in CI | Test + CI links | | ||
| | B3 | Negative / error-path tests | Test file links | | ||
| | B4 | Prometheus metrics for key operational signals (where applicable) | Metric names in proposal | |
There was a problem hiding this comment.
"Where applicable" is doing a lot of work here, and §4 "Current Gaps" admits there is no metrics endpoint for gpu-kubelet-plugin today. As written, B4 is likely to be waived case-by-case and won't bind any gate at Beta. Consider either (a) making B4 mandatory for any control-loop or health-monitoring gate, or (b) moving it to an explicit "Beta prerequisites pending infrastructure" subsection so the gap is visible rather than absorbed into each promotion's prose.
| | Stable | Stable (strongly preferred) | | ||
|
|
||
| When the upstream dependency is not at the required level, the feature must | ||
| detect and degrade gracefully, require and fail loudly, or defer promotion. |
There was a problem hiding this comment.
Three options with no decision process. For a driver feature at Beta whose upstream KEP is Alpha, all three are technically available — without a tiebreaker this gets resolved ad hoc per PR. Suggest naming who decides (maintainers? single maintainer + recorded rationale?) and the default choice when no explicit decision is recorded (probably "defer promotion" to keep the bar honest).
| | `IMEXDaemonsWithDNSNames` | Beta | `true` | v25.8 | — | DNS names for IMEX daemons | | ||
| | `PassthroughSupport` | Alpha | `false` | v25.12 | — | VFIO-PCI passthrough | | ||
| | `DynamicMIG` | Alpha | `false` | v25.12 | [KEP-4815] (Alpha 1.35, Beta target 1.36) | Mutually exclusive with PassthroughSupport, NVMLDeviceHealthCheck, MPSSupport | | ||
| | `NVMLDeviceHealthCheck` | Alpha | `false` | v25.12 | [KEP-5055] (Alpha 1.33, Beta target 1.36) | Mutually exclusive with DynamicMIG | |
There was a problem hiding this comment.
Worth a footnote here indicating that the current NVMLDeviceHealthCheck implementation (binary healthy/unhealthy via removal from ResourceSlice) is under active rework and may change shape before any Beta promotion. Readers cross-checking the inventory against pkg/featuregates/featuregates.go + cmd/gpu-kubelet-plugin/device_health.go today will see the current model; the footnote keeps the table honest when the rework lands.
Address #931