Skip to content

notask: check pr 2182#2228

Open
Uburro wants to merge 4 commits intomainfrom
notask/check-pr-2182/0
Open

notask: check pr 2182#2228
Uburro wants to merge 4 commits intomainfrom
notask/check-pr-2182/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Feb 26, 2026

No description provided.

@Uburro Uburro added the ignore-for-release Don't include this PR into changelog label Feb 26, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to scope NodeSets, rendered Kubernetes resource names, and config/configmap handling more explicitly to a parent SlurmCluster (to better support multiple clusters per namespace), moving the NodeSet→SlurmCluster linkage from an annotation to a required spec field and updating Helm/CRDs accordingly.

Changes:

  • Introduces spec.slurmClusterRefName for NodeSets (CRD + API type), and updates controllers/utilities to use it instead of an annotation.
  • Updates naming helpers and call sites so generated resource names include clusterName.
  • Scopes topology/JailedConfig configmaps and the sconfigcontroller’s reconciliation to a cluster, and updates Helm templates/tests to match new names.

Reviewed changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/webhook/v1alpha1/nodeset_webhook_test.go Removes tests tied to the deleted NodeSet parental-cluster annotation defaulting behavior.
internal/webhook/v1alpha1/nodeset_webhook.go Removes NodeSet parental-cluster annotation defaulting logic (defaulter becomes a no-op).
internal/values/slurm_nodeset.go Updates NodeSet StatefulSet naming to include clusterName.
internal/values/slurm_login.go Updates login StatefulSet naming to include clusterName.
internal/values/slurm_controller.go Updates controller StatefulSet/DaemonSet naming to include clusterName.
internal/values/slurm_accounting_test.go Updates expected accounting Deployment name to include clusterName.
internal/values/slurm_accounting.go Updates accounting Deployment naming to include clusterName.
internal/utils/resourcegetter/nodeset.go Switches NodeSet lookup-by-cluster from annotation-based to spec.slurmClusterRefName.
internal/utils/resourcegetter/cluster.go Removes GetClusterInNamespace helper (no longer used).
internal/render/sconfigcontroller/deployment.go Updates sconfigcontroller Deployment naming to include clusterName.
internal/render/rest/rest.go Updates REST Deployment naming to include clusterName.
internal/render/common/configmap.go Derives SlurmctldHost pod hostname from computed StatefulSet name (cluster-scoped).
internal/render/accounting/deployment_test.go Updates expected accounting Deployment name to include clusterName.
internal/render/accounting/deployment.go Updates accounting Deployment naming to include clusterName.
internal/naming/naming.go Changes multiple naming helpers to incorporate clusterName into generated names.
internal/controller/topologyconfcontroller/workertopology_controller_update_test.go Updates topology controller tests for cluster-scoped topology configmap name and new function signature.
internal/controller/topologyconfcontroller/workertopology_controller.go Introduces cluster-scoped topology configmap naming + regex-based delete filtering; updates update/ensure paths to take clusterName.
internal/controller/sconfigcontroller/jailedconfig_controller_test.go Updates constructor call for new clusterName parameter.
internal/controller/sconfigcontroller/jailedconfig_controller.go Adds cluster scoping via instance-label filtering and controller naming; constructor now requires clusterName.
internal/controller/nodesetcontroller/reconcile.go Switches parent cluster lookup to use spec.slurmClusterRefName (instead of annotation).
internal/controller/clustercontroller/sconfigcontroller.go Updates deployment name lookup for sconfigcontroller to include clusterName.
internal/controller/clustercontroller/accounting.go Updates accounting deployment cleanup name to include clusterName.
internal/consts/annotation.go Removes parental cluster ref annotation constant.
helm/soperator/crds/slurmcluster-crd.yaml Adds slurmClusterRefName to NodeSet schema and marks it required in CRD bundle.
helm/soperator-crds/templates/slurmcluster-crd.yaml Same CRD schema update in templated CRDs chart.
helm/slurm-cluster/tests/volume-sources-filtering_test.yaml Updates expected slurm-scripts volumeSource name to be cluster-prefixed.
helm/slurm-cluster/tests/exporter-rbac_test.yaml Updates expected exporter ServiceAccount name to be cluster-prefixed.
helm/slurm-cluster/templates/slurm-scripts-cm.yaml Renames slurm-scripts ConfigMap via a helper to include cluster/release prefix.
helm/slurm-cluster/templates/slurm-cluster-cr.yaml Updates slurm-scripts volumeSource name and mount references to use the new helper.
helm/slurm-cluster/templates/_helpers.tpl Updates exporter SA default naming to include cluster/release prefix; adds helper for slurm-scripts ConfigMap name.
helm/nodesets/values.yaml Adds top-level slurmClusterRefName value; adjusts example/default jail PVC claim name.
helm/nodesets/tests/basic_test.yaml Adds a test ensuring spec.slurmClusterRefName is set from values.
helm/nodesets/templates/nodeset.yaml Sets NodeSet spec.slurmClusterRefName from chart values.
config/crd/bases/slurm.nebius.ai_nodesets.yaml Updates NodeSet CRD schema to include required slurmClusterRefName.
cmd/sconfigcontroller/main.go Makes leader election ID cluster-specific; passes clusterName into JailedConfig reconciler.
api/v1alpha1/nodeset_types.go Adds required SlurmClusterRefName field to NodeSetSpec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +86 to 116
func BuildStatefulSetName(componentType consts.ComponentType, clusterName string) string {
return namedEntity{
componentType: &componentType,
clusterName: "",
clusterName: clusterName,
entity: "",
}.String()
}

func BuildNodeSetStatefulSetName(nodeSetName string) string {
func BuildNodeSetStatefulSetName(clusterName, nodeSetName string) string {
return namedEntity{
componentType: nil,
clusterName: "",
clusterName: clusterName,
entity: nodeSetName,
}.String()
}

func BuildDaemonSetName(componentType consts.ComponentType) string {
func BuildDaemonSetName(componentType consts.ComponentType, clusterName string) string {
return namedEntity{
componentType: &componentType,
clusterName: "",
clusterName: clusterName,
entity: "",
}.String()
}

func BuildDeploymentName(componentType consts.ComponentType) string {
func BuildDeploymentName(componentType consts.ComponentType, clusterName string) string {
return namedEntity{
componentType: &componentType,
clusterName: clusterName,
entity: "",
}.String()
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These naming helpers now include clusterName where they previously produced cluster-agnostic names. This will change the names of Deployments/StatefulSets/DaemonSets created by the operator, so an in-place upgrade will likely create new resources while leaving the old ones orphaned (and cleanup/lookup code may no longer find existing objects). If this renaming is intentional, consider adding upgrade/migration handling (e.g., reconcile/cleanup both old and new names for a period) or clearly documenting the breaking change.

Copilot uses AI. Check for mistakes.
Comment on lines +601 to +612
clusterPredicate := predicate.NewPredicateFuncs(func(obj client.Object) bool {
return obj.GetLabels()[consts.LabelInstanceKey] == r.clusterName
})

return ctrl.NewControllerManagedBy(mgr).
For(&slurmv1alpha1.JailedConfig{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
For(&slurmv1alpha1.JailedConfig{}, builder.WithPredicates(predicate.GenerationChangedPredicate{}, clusterPredicate)).
Watches(
&corev1.ConfigMap{},
handler.EnqueueRequestsFromMapFunc(r.findObjectsForConfigMap),
builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}, clusterPredicate),
).
Named("jailedconfig").
Named(r.clusterName + "-jailedconfig").
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new clusterPredicate requires app.kubernetes.io/instance == r.clusterName for both JailedConfig and ConfigMap events. There are existing producers of JailedConfig/ConfigMap without this label (e.g. internal/controller/soperatorchecks/activecheck_prolog_controller.go creates both without labels), which will now never be reconciled by sconfigcontroller. Consider allowing missing instance labels (backward compat) or ensuring all created JailedConfig/ConfigMap objects are labeled consistently.

Copilot uses AI. Check for mistakes.
// SlurmClusterRefName is the name of the parent SlurmCluster this NodeSet belongs to.
// Must be in the same namespace as the NodeSet.
//
// +kubebuilder:validation:Required
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SlurmClusterRefName is marked as Required, but the CRD will still accept an empty string. Other required string fields in this repo typically also set +kubebuilder:validation:MinLength=1 (e.g. api/v1/slurmcluster_types.go:372). Consider adding MinLength=1 here to enforce non-empty at admission time instead of failing later in the controller.

Suggested change
// +kubebuilder:validation:Required
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=1

Copilot uses AI. Check for mistakes.
Comment on lines 289 to 297
cm := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: consts.ConfigMapNameTopologyConfig,
Name: cmName,
Namespace: namespace,
},
Data: map[string]string{
consts.ConfigMapKeyTopologyConfig: config,
},
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating the topology ConfigMap on NotFound, the object is missing the instance label (app.kubernetes.io/instance) that renderTopologyConfigMap would add. Since the JailedConfig controller now filters ConfigMap events by this label, ConfigMap changes may not trigger reconciliation. Add the instance label (and consider reusing renderTopologyConfigMap) on create, and also ensure the label is present/updated on the update path.

Copilot uses AI. Check for mistakes.
- name: slurm-scripts
- name: {{ include "slurm-cluster.slurmScripts.configMapName" . }}
configMap:
name: slurm-scripts
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SlurmCluster CR now names the volumeSource using the new helper, but the referenced ConfigMap name under configMap.name is still hardcoded to slurm-scripts. This will point to a non-existent ConfigMap now that the slurm-scripts ConfigMap is renamed. Update configMap.name to use the same helper so the reference stays consistent.

Suggested change
name: slurm-scripts
name: {{ include "slurm-cluster.slurmScripts.configMapName" . }}

Copilot uses AI. Check for mistakes.
Comment on lines 176 to 180
WebhookServer: webhookServer,
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "vqeyz6ae.nebius.ai",
LeaderElectionID: clusterName + ".vqeyz6ae.nebius.ai",
LeaderElectionReleaseOnCancel: true,
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LeaderElectionID is now derived from clusterName. If clusterName contains characters not allowed in Kubernetes object names (e.g., underscore/uppercase), the manager will fail to start because the Lease name becomes invalid. Consider validating/sanitizing clusterName (DNS-1123) before constructing LeaderElectionID, or fall back to a safe default when it is invalid.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ignore-for-release Don't include this PR into changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants