Conversation
There was a problem hiding this comment.
Pull request overview
This PR appears to scope NodeSets, rendered Kubernetes resource names, and config/configmap handling more explicitly to a parent SlurmCluster (to better support multiple clusters per namespace), moving the NodeSet→SlurmCluster linkage from an annotation to a required spec field and updating Helm/CRDs accordingly.
Changes:
- Introduces
spec.slurmClusterRefNamefor NodeSets (CRD + API type), and updates controllers/utilities to use it instead of an annotation. - Updates naming helpers and call sites so generated resource names include
clusterName. - Scopes topology/JailedConfig configmaps and the sconfigcontroller’s reconciliation to a cluster, and updates Helm templates/tests to match new names.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/webhook/v1alpha1/nodeset_webhook_test.go | Removes tests tied to the deleted NodeSet parental-cluster annotation defaulting behavior. |
| internal/webhook/v1alpha1/nodeset_webhook.go | Removes NodeSet parental-cluster annotation defaulting logic (defaulter becomes a no-op). |
| internal/values/slurm_nodeset.go | Updates NodeSet StatefulSet naming to include clusterName. |
| internal/values/slurm_login.go | Updates login StatefulSet naming to include clusterName. |
| internal/values/slurm_controller.go | Updates controller StatefulSet/DaemonSet naming to include clusterName. |
| internal/values/slurm_accounting_test.go | Updates expected accounting Deployment name to include clusterName. |
| internal/values/slurm_accounting.go | Updates accounting Deployment naming to include clusterName. |
| internal/utils/resourcegetter/nodeset.go | Switches NodeSet lookup-by-cluster from annotation-based to spec.slurmClusterRefName. |
| internal/utils/resourcegetter/cluster.go | Removes GetClusterInNamespace helper (no longer used). |
| internal/render/sconfigcontroller/deployment.go | Updates sconfigcontroller Deployment naming to include clusterName. |
| internal/render/rest/rest.go | Updates REST Deployment naming to include clusterName. |
| internal/render/common/configmap.go | Derives SlurmctldHost pod hostname from computed StatefulSet name (cluster-scoped). |
| internal/render/accounting/deployment_test.go | Updates expected accounting Deployment name to include clusterName. |
| internal/render/accounting/deployment.go | Updates accounting Deployment naming to include clusterName. |
| internal/naming/naming.go | Changes multiple naming helpers to incorporate clusterName into generated names. |
| internal/controller/topologyconfcontroller/workertopology_controller_update_test.go | Updates topology controller tests for cluster-scoped topology configmap name and new function signature. |
| internal/controller/topologyconfcontroller/workertopology_controller.go | Introduces cluster-scoped topology configmap naming + regex-based delete filtering; updates update/ensure paths to take clusterName. |
| internal/controller/sconfigcontroller/jailedconfig_controller_test.go | Updates constructor call for new clusterName parameter. |
| internal/controller/sconfigcontroller/jailedconfig_controller.go | Adds cluster scoping via instance-label filtering and controller naming; constructor now requires clusterName. |
| internal/controller/nodesetcontroller/reconcile.go | Switches parent cluster lookup to use spec.slurmClusterRefName (instead of annotation). |
| internal/controller/clustercontroller/sconfigcontroller.go | Updates deployment name lookup for sconfigcontroller to include clusterName. |
| internal/controller/clustercontroller/accounting.go | Updates accounting deployment cleanup name to include clusterName. |
| internal/consts/annotation.go | Removes parental cluster ref annotation constant. |
| helm/soperator/crds/slurmcluster-crd.yaml | Adds slurmClusterRefName to NodeSet schema and marks it required in CRD bundle. |
| helm/soperator-crds/templates/slurmcluster-crd.yaml | Same CRD schema update in templated CRDs chart. |
| helm/slurm-cluster/tests/volume-sources-filtering_test.yaml | Updates expected slurm-scripts volumeSource name to be cluster-prefixed. |
| helm/slurm-cluster/tests/exporter-rbac_test.yaml | Updates expected exporter ServiceAccount name to be cluster-prefixed. |
| helm/slurm-cluster/templates/slurm-scripts-cm.yaml | Renames slurm-scripts ConfigMap via a helper to include cluster/release prefix. |
| helm/slurm-cluster/templates/slurm-cluster-cr.yaml | Updates slurm-scripts volumeSource name and mount references to use the new helper. |
| helm/slurm-cluster/templates/_helpers.tpl | Updates exporter SA default naming to include cluster/release prefix; adds helper for slurm-scripts ConfigMap name. |
| helm/nodesets/values.yaml | Adds top-level slurmClusterRefName value; adjusts example/default jail PVC claim name. |
| helm/nodesets/tests/basic_test.yaml | Adds a test ensuring spec.slurmClusterRefName is set from values. |
| helm/nodesets/templates/nodeset.yaml | Sets NodeSet spec.slurmClusterRefName from chart values. |
| config/crd/bases/slurm.nebius.ai_nodesets.yaml | Updates NodeSet CRD schema to include required slurmClusterRefName. |
| cmd/sconfigcontroller/main.go | Makes leader election ID cluster-specific; passes clusterName into JailedConfig reconciler. |
| api/v1alpha1/nodeset_types.go | Adds required SlurmClusterRefName field to NodeSetSpec. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func BuildStatefulSetName(componentType consts.ComponentType, clusterName string) string { | ||
| return namedEntity{ | ||
| componentType: &componentType, | ||
| clusterName: "", | ||
| clusterName: clusterName, | ||
| entity: "", | ||
| }.String() | ||
| } | ||
|
|
||
| func BuildNodeSetStatefulSetName(nodeSetName string) string { | ||
| func BuildNodeSetStatefulSetName(clusterName, nodeSetName string) string { | ||
| return namedEntity{ | ||
| componentType: nil, | ||
| clusterName: "", | ||
| clusterName: clusterName, | ||
| entity: nodeSetName, | ||
| }.String() | ||
| } | ||
|
|
||
| func BuildDaemonSetName(componentType consts.ComponentType) string { | ||
| func BuildDaemonSetName(componentType consts.ComponentType, clusterName string) string { | ||
| return namedEntity{ | ||
| componentType: &componentType, | ||
| clusterName: "", | ||
| clusterName: clusterName, | ||
| entity: "", | ||
| }.String() | ||
| } | ||
|
|
||
| func BuildDeploymentName(componentType consts.ComponentType) string { | ||
| func BuildDeploymentName(componentType consts.ComponentType, clusterName string) string { | ||
| return namedEntity{ | ||
| componentType: &componentType, | ||
| clusterName: clusterName, | ||
| entity: "", | ||
| }.String() | ||
| } |
There was a problem hiding this comment.
These naming helpers now include clusterName where they previously produced cluster-agnostic names. This will change the names of Deployments/StatefulSets/DaemonSets created by the operator, so an in-place upgrade will likely create new resources while leaving the old ones orphaned (and cleanup/lookup code may no longer find existing objects). If this renaming is intentional, consider adding upgrade/migration handling (e.g., reconcile/cleanup both old and new names for a period) or clearly documenting the breaking change.
| clusterPredicate := predicate.NewPredicateFuncs(func(obj client.Object) bool { | ||
| return obj.GetLabels()[consts.LabelInstanceKey] == r.clusterName | ||
| }) | ||
|
|
||
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&slurmv1alpha1.JailedConfig{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})). | ||
| For(&slurmv1alpha1.JailedConfig{}, builder.WithPredicates(predicate.GenerationChangedPredicate{}, clusterPredicate)). | ||
| Watches( | ||
| &corev1.ConfigMap{}, | ||
| handler.EnqueueRequestsFromMapFunc(r.findObjectsForConfigMap), | ||
| builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}), | ||
| builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}, clusterPredicate), | ||
| ). | ||
| Named("jailedconfig"). | ||
| Named(r.clusterName + "-jailedconfig"). |
There was a problem hiding this comment.
The new clusterPredicate requires app.kubernetes.io/instance == r.clusterName for both JailedConfig and ConfigMap events. There are existing producers of JailedConfig/ConfigMap without this label (e.g. internal/controller/soperatorchecks/activecheck_prolog_controller.go creates both without labels), which will now never be reconciled by sconfigcontroller. Consider allowing missing instance labels (backward compat) or ensuring all created JailedConfig/ConfigMap objects are labeled consistently.
| // SlurmClusterRefName is the name of the parent SlurmCluster this NodeSet belongs to. | ||
| // Must be in the same namespace as the NodeSet. | ||
| // | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
SlurmClusterRefName is marked as Required, but the CRD will still accept an empty string. Other required string fields in this repo typically also set +kubebuilder:validation:MinLength=1 (e.g. api/v1/slurmcluster_types.go:372). Consider adding MinLength=1 here to enforce non-empty at admission time instead of failing later in the controller.
| // +kubebuilder:validation:Required | |
| // +kubebuilder:validation:Required | |
| // +kubebuilder:validation:MinLength=1 |
| cm := &corev1.ConfigMap{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: consts.ConfigMapNameTopologyConfig, | ||
| Name: cmName, | ||
| Namespace: namespace, | ||
| }, | ||
| Data: map[string]string{ | ||
| consts.ConfigMapKeyTopologyConfig: config, | ||
| }, | ||
| } |
There was a problem hiding this comment.
When creating the topology ConfigMap on NotFound, the object is missing the instance label (app.kubernetes.io/instance) that renderTopologyConfigMap would add. Since the JailedConfig controller now filters ConfigMap events by this label, ConfigMap changes may not trigger reconciliation. Add the instance label (and consider reusing renderTopologyConfigMap) on create, and also ensure the label is present/updated on the update path.
| - name: slurm-scripts | ||
| - name: {{ include "slurm-cluster.slurmScripts.configMapName" . }} | ||
| configMap: | ||
| name: slurm-scripts |
There was a problem hiding this comment.
The SlurmCluster CR now names the volumeSource using the new helper, but the referenced ConfigMap name under configMap.name is still hardcoded to slurm-scripts. This will point to a non-existent ConfigMap now that the slurm-scripts ConfigMap is renamed. Update configMap.name to use the same helper so the reference stays consistent.
| name: slurm-scripts | |
| name: {{ include "slurm-cluster.slurmScripts.configMapName" . }} |
| WebhookServer: webhookServer, | ||
| HealthProbeBindAddress: probeAddr, | ||
| LeaderElection: enableLeaderElection, | ||
| LeaderElectionID: "vqeyz6ae.nebius.ai", | ||
| LeaderElectionID: clusterName + ".vqeyz6ae.nebius.ai", | ||
| LeaderElectionReleaseOnCancel: true, |
There was a problem hiding this comment.
LeaderElectionID is now derived from clusterName. If clusterName contains characters not allowed in Kubernetes object names (e.g., underscore/uppercase), the manager will fail to start because the Lease name becomes invalid. Consider validating/sanitizing clusterName (DNS-1123) before constructing LeaderElectionID, or fall back to a safe default when it is invalid.
No description provided.