This guide provides detailed instructions for configuring Slonk for your environment.
Slonk requires several configuration changes before deployment. This guide walks through each section and what needs to be updated.
Replace all container image references with your registry:
In charts/slurm/values.yaml:
image: gcr.io/your-org/slonk:latestIn cluster-addons/slurm/*.yaml:
image: gcr.io/your-org/slonk-h100:latest-
Main Slonk Image:
cd containers/slonk ./build.sh -
H100-specific Image:
cd containers/slonk-h100 ./build.sh
Configure your external secret store in charts/slurm/values.yaml:
secrets:
externalSecretStore: your-external-secret-store
crowdstrikeCredsExternalSecretName: your-crowdstrike-creds-secret
idRsaClusterExternalSecretName: your-ssh-keys-secret
ldapClientCredsExternalSecretName: your-ldap-client-creds-secret
mungeKeyExternalSecretName: your-munge-key-secret
yubikeyPamExternalSecretName: your-yubikey-pam-secretYou need to create the following secrets in your external secret store:
- Crowdstrike Credentials: For endpoint protection
- SSH Keys: For cluster node communication
- LDAP Client Credentials: For user authentication
- Munge Key: For SLURM authentication
- YubiKey PAM: For 2FA (optional)
In charts/slurm/values.yaml:
ldap:
uri: "ldaps://your-ldap-server.com"
userBase: "ou=Users,dc=yourcompany,dc=com"
userFilter: "memberOf=cn=engineering,ou=Groups,dc=yourcompany,dc=com"
groupBase: "dc=yourcompany,dc=com"
groupFilter: "|(cn=engineering)(cn=security)(cn=developers)(memberOf=cn=engineering,ou=Groups,dc=yourcompany,dc=com)"
commonGid: "1000"Update the LDAP bind credentials in charts/slurm/templates/configmaps.yaml:
# Replace these placeholders with your actual values
ldap_bind_dn = "REPLACEME-LDAP-BIND-DN" # TODO: Replace with your LDAP bind DN
ldap_bind_password = "REPLACEME-LDAP-BIND-PASSWORD" # TODO: Replace with your LDAP bind passwordUpdate IP addresses in cluster-addons/slurm/*.yaml:
Example for cluster configurations:
loginClusterIP: 10.96.4.16 # TODO: Replace with your cluster IPAnother example:
loginClusterIP: 10.43.4.14 # TODO: Replace with your cluster IPUpdate filestore configurations:
filestore:
ip: 172.23.35.66 # TODO: Replace with your filestore IP
volumeHandle: "modeInstance/us-central1/your-k8s-cluster-filestore/home"Update data mount IPs in charts/slurm/templates/configmaps.yaml:
mount -o rw,intr,nolock 10.121.6.2:/data /data # TODO: Replace with your data mount IPIf using Cloudflare for SSH access:
cloudflare:
enabled: true
externalSecretStore: your-external-secret-store
tunnelExternalSecretName: your-cloudflare-tunnel-secret
shortLivedCertExternalSecretName: your-cloudflare-cert-secret- Tunnel Token: For Cloudflare tunnel setup
- Short-lived Certificates: For SSH certificate authentication
Update storage class names in your cluster configurations:
storageClassName: "premium-rwo" # Replace with your storage class
storageClassName: "enterprise-rwx" # Replace with your storage classConfigure persistent volume settings:
volumeClaimTemplates:
- metadata:
name: slurm-controller-pvc
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "your-storage-class"
resources:
requests:
storage: 100GiIn charts/slurm/values.yaml:
gitSyncRepos:
- name: k8s
externalSecretName: your-git-creds-secret
repo: git@github.com:your-org/your-k8s-repo.git
branch: main
destination: /home/common/git-sync/k8s
wait: 30
timeout: 600Create a secret with your Git credentials:
kubectl create secret generic your-git-creds-secret \
--from-file=ssh-privatekey=~/.ssh/id_rsa \
--from-file=ssh-knownhosts=~/.ssh/known_hostsIn user/slonk/slonk/health/weka.py:
# Replace with your actual IP addresses
bash("timeout 1.0 ping -c 1 172.20.7.141") # TODO: Replace with your ceph cluster IPs
bash("timeout 1.0 ping -c 1 172.20.7.142") # TODO: Replace with your ceph cluster IPs
bash("timeout 1.0 ping -c 1 172.20.7.159") # TODO: Replace with your weka cluster IPAdd your own health checks in user/slonk/slonk/health/:
from slonk.health.base import HealthCheck
class YourCustomHealthCheck(HealthCheck):
def check(self):
# Your health check logic here
passThe operator uses custom resources. Update the API group in Go files:
// In all Go files, replace the API group:
"slonk.example.com" -> "slonk.your-org.com"Update all import statements in Go files:
// Replace import paths:
"github.com/example/slonklet" -> "github.com/your-org/slonklet"Update node pool specifications in cluster-addons/slurm/*.yaml:
nodepools:
controller:
replicas: 1
resources:
requests:
cpu: 8
memory: 32Gi
login:
replicas: 1
resources:
requests:
cpu: 62
memory: 220Gi
cpu:
replicas: 0 # Adjust based on your needs
slurmConfig:
CPUs: 16
RealMemory: 65536
Features: ["cpu"]For GPU nodes:
gscConfigs:
h100:
nodepoolPrefix: h100
numSlices: 5
replicasPerSlice: 80
gscCommonNodeTemplate:
slurmConfig:
CPUs: 96
RealMemory: "1048576"
Features: ["h100"]
Gres: "gpu:8"
resources:
limits:
cpu: 96
memory: 1024Gi
nvidia.com/gpu: 8Create network policies for your cluster:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: slurm-network-policy
namespace: slurm
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: slurm
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-systemUpdate RBAC rules in charts/slurm/templates/serviceaccount.yaml:
- apiGroups:
- slonk.your-org.com
resources:
- physicalnodes
- slurmjobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watchConfigure Prometheus monitoring:
# In your values file
monitoring:
enabled: true
prometheus:
enabled: true
port: 8071Configure logging for different components:
# SLURM logging
slurm:
SlurmctldLogFile: /var/log/slurm/slurmctld.log
SlurmdLogFile: /var/log/slurm/slurmd.log
# Application logging
logging:
level: info
format: jsonSet up backup for critical data:
backup:
enabled: true
schedule: "0 2 * * *" # Daily at 2 AM
retention: 30 # days
storage:
type: gcs
bucket: your-backup-bucketBefore deploying, ensure you have:
- Updated all container image references (
gcr.io/your-org/→ your registry) - Configured external secrets and secret stores
- Updated LDAP configuration (server, user base, group filters)
- Replaced all hardcoded IP addresses with your cluster IPs
- Updated storage configurations and mount points
- Configured Cloudflare tunnel (if using for SSH access)
- Updated Git repository URLs to your organization's repos
- Replaced API group references (
slonk.your-org.com→ your domain)
- Set up SSH key management for cluster communication
- Configured LDAP bind credentials and client certificates
- Set up Munge key for SLURM authentication
- Configured YubiKey PAM (if using 2FA)
- Set up Crowdstrike endpoint protection (if using)
- Created appropriate network policies for your cluster
- Configured persistent volume storage classes
- Updated node pool configurations for your infrastructure
- Updated health check IP addresses and endpoints
- Set up Prometheus monitoring and alerting
- Tested health checks with your infrastructure
- Verified network connectivity between components
- Validated SLURM configuration and job submission
- Tested authentication and user access
- Verified backup and disaster recovery procedures
-
Validate Helm Charts:
helm lint charts/slurm/
-
Test Configuration:
helm template slurm charts/slurm/ -f your-values.yaml
-
Validate Secrets:
kubectl get secrets -n slurm
-
Check SLURM Status:
scontrol ping scontrol show nodes
-
Verify Health Checks:
kubectl get physicalnodes -n slurm
-
Test Authentication:
ssh user@login-node
- Image Pull Errors: Verify image references and registry access
- Secret Not Found: Check external secret store configuration
- LDAP Connection Failures: Verify LDAP server connectivity and credentials
- Network Connectivity: Check IP addresses and firewall rules
- Storage Issues: Verify storage class names and permissions
# Check pod status
kubectl get pods -n slurm
# View logs
kubectl logs -n slurm deployment/slurm-controller
# Check events
kubectl get events -n slurm
# Verify SLURM configuration
scontrol show configFor configuration issues:
- Check the troubleshooting section
- Review logs for error messages
- Verify all required secrets are present
- Test network connectivity
- Validate Kubernetes resources