Configuration Guide

This guide provides detailed instructions for configuring Slonk for your environment.

Overview

Slonk requires several configuration changes before deployment. This guide walks through each section and what needs to be updated.

1. Container Images

Update Image References

Replace all container image references with your registry:

In charts/slurm/values.yaml:

image: gcr.io/your-org/slonk:latest

In cluster-addons/slurm/*.yaml:

image: gcr.io/your-org/slonk-h100:latest

Build Your Images

Main Slonk Image:
```
cd containers/slonk
./build.sh
```
H100-specific Image:
```
cd containers/slonk-h100
./build.sh
```

2. External Secrets Configuration

Secret Store Setup

Configure your external secret store in charts/slurm/values.yaml:

secrets:
  externalSecretStore: your-external-secret-store
  crowdstrikeCredsExternalSecretName: your-crowdstrike-creds-secret
  idRsaClusterExternalSecretName: your-ssh-keys-secret
  ldapClientCredsExternalSecretName: your-ldap-client-creds-secret
  mungeKeyExternalSecretName: your-munge-key-secret
  yubikeyPamExternalSecretName: your-yubikey-pam-secret

Required Secrets

You need to create the following secrets in your external secret store:

Crowdstrike Credentials: For endpoint protection
SSH Keys: For cluster node communication
LDAP Client Credentials: For user authentication
Munge Key: For SLURM authentication
YubiKey PAM: For 2FA (optional)

3. LDAP Configuration

Update LDAP Settings

In charts/slurm/values.yaml:

ldap:
  uri: "ldaps://your-ldap-server.com"
  userBase: "ou=Users,dc=yourcompany,dc=com"
  userFilter: "memberOf=cn=engineering,ou=Groups,dc=yourcompany,dc=com"
  groupBase: "dc=yourcompany,dc=com"
  groupFilter: "|(cn=engineering)(cn=security)(cn=developers)(memberOf=cn=engineering,ou=Groups,dc=yourcompany,dc=com)"
  commonGid: "1000"

LDAP Bind Credentials

Update the LDAP bind credentials in charts/slurm/templates/configmaps.yaml:

# Replace these placeholders with your actual values
ldap_bind_dn = "REPLACEME-LDAP-BIND-DN"  # TODO: Replace with your LDAP bind DN
ldap_bind_password = "REPLACEME-LDAP-BIND-PASSWORD"  # TODO: Replace with your LDAP bind password

4. Network Configuration

Cluster IP Addresses

Update IP addresses in cluster-addons/slurm/*.yaml:

Example for cluster configurations:

loginClusterIP: 10.96.4.16  # TODO: Replace with your cluster IP

Another example:

loginClusterIP: 10.43.4.14  # TODO: Replace with your cluster IP

Filestore IP Addresses

Update filestore configurations:

filestore:
  ip: 172.23.35.66  # TODO: Replace with your filestore IP
  volumeHandle: "modeInstance/us-central1/your-k8s-cluster-filestore/home"

Data Mount IPs

Update data mount IPs in charts/slurm/templates/configmaps.yaml:

mount -o rw,intr,nolock 10.121.6.2:/data /data  # TODO: Replace with your data mount IP

5. Cloudflare Configuration

Tunnel Setup

If using Cloudflare for SSH access:

cloudflare:
  enabled: true
  externalSecretStore: your-external-secret-store
  tunnelExternalSecretName: your-cloudflare-tunnel-secret
  shortLivedCertExternalSecretName: your-cloudflare-cert-secret

Required Cloudflare Secrets

Tunnel Token: For Cloudflare tunnel setup
Short-lived Certificates: For SSH certificate authentication

6. Storage Configuration

Storage Classes

Update storage class names in your cluster configurations:

storageClassName: "premium-rwo"  # Replace with your storage class
storageClassName: "enterprise-rwx"  # Replace with your storage class

Persistent Volumes

Configure persistent volume settings:

volumeClaimTemplates:
  - metadata:
      name: slurm-controller-pvc
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "your-storage-class"
      resources:
        requests:
          storage: 100Gi

7. Git Repository Configuration

Update Repository URLs

In charts/slurm/values.yaml:

gitSyncRepos:
  - name: k8s
    externalSecretName: your-git-creds-secret
    repo: git@github.com:your-org/your-k8s-repo.git
    branch: main
    destination: /home/common/git-sync/k8s
    wait: 30
    timeout: 600

Git Credentials

Create a secret with your Git credentials:

kubectl create secret generic your-git-creds-secret \
  --from-file=ssh-privatekey=~/.ssh/id_rsa \
  --from-file=ssh-knownhosts=~/.ssh/known_hosts

8. Health Check Configuration

Update Health Check IPs

In user/slonk/slonk/health/weka.py:

# Replace with your actual IP addresses
bash("timeout 1.0 ping -c 1 172.20.7.141")  # TODO: Replace with your ceph cluster IPs
bash("timeout 1.0 ping -c 1 172.20.7.142")  # TODO: Replace with your ceph cluster IPs
bash("timeout 1.0 ping -c 1 172.20.7.159")  # TODO: Replace with your weka cluster IP

Custom Health Checks

Add your own health checks in user/slonk/slonk/health/:

from slonk.health.base import HealthCheck

class YourCustomHealthCheck(HealthCheck):
    def check(self):
        # Your health check logic here
        pass

9. Kubernetes Operator Configuration

Update API Group

The operator uses custom resources. Update the API group in Go files:

// In all Go files, replace the API group:
"slonk.example.com" -> "slonk.your-org.com"

Update Import Paths

Update all import statements in Go files:

// Replace import paths:
"github.com/example/slonklet" -> "github.com/your-org/slonklet"

10. Cluster-Specific Configurations

Node Pool Configuration

Update node pool specifications in cluster-addons/slurm/*.yaml:

nodepools:
  controller:
    replicas: 1
    resources:
      requests:
        cpu: 8
        memory: 32Gi
  login:
    replicas: 1
    resources:
      requests:
        cpu: 62
        memory: 220Gi
  cpu:
    replicas: 0  # Adjust based on your needs
    slurmConfig:
      CPUs: 16
      RealMemory: 65536
      Features: ["cpu"]

GPU Configuration

For GPU nodes:

gscConfigs:
  h100:
    nodepoolPrefix: h100
    numSlices: 5
    replicasPerSlice: 80
gscCommonNodeTemplate:
  slurmConfig:
    CPUs: 96
    RealMemory: "1048576"
    Features: ["h100"]
    Gres: "gpu:8"
  resources:
    limits:
      cpu: 96
      memory: 1024Gi
      nvidia.com/gpu: 8

11. Security Configuration

Network Policies

Create network policies for your cluster:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: slurm-network-policy
  namespace: slurm
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: slurm
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system

RBAC Configuration

Update RBAC rules in charts/slurm/templates/serviceaccount.yaml:

- apiGroups:
  - slonk.your-org.com
  resources:
  - physicalnodes
  - slurmjobs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

12. Monitoring Configuration

Prometheus Integration

Configure Prometheus monitoring:

# In your values file
monitoring:
  enabled: true
  prometheus:
    enabled: true
    port: 8071

Logging Configuration

Configure logging for different components:

# SLURM logging
slurm:
  SlurmctldLogFile: /var/log/slurm/slurmctld.log
  SlurmdLogFile: /var/log/slurm/slurmd.log

# Application logging
logging:
  level: info
  format: json

13. Backup and Disaster Recovery

Backup Configuration

Set up backup for critical data:

backup:
  enabled: true
  schedule: "0 2 * * *"  # Daily at 2 AM
  retention: 30  # days
  storage:
    type: gcs
    bucket: your-backup-bucket

14. Validation Checklist

Before deploying, ensure you have:

Critical Configuration

Updated all container image references (gcr.io/your-org/ → your registry)
Configured external secrets and secret stores
Updated LDAP configuration (server, user base, group filters)
Replaced all hardcoded IP addresses with your cluster IPs
Updated storage configurations and mount points
Configured Cloudflare tunnel (if using for SSH access)
Updated Git repository URLs to your organization's repos
Replaced API group references (slonk.your-org.com → your domain)

Security and Authentication

Set up SSH key management for cluster communication
Configured LDAP bind credentials and client certificates
Set up Munge key for SLURM authentication
Configured YubiKey PAM (if using 2FA)
Set up Crowdstrike endpoint protection (if using)

Infrastructure and Networking

Created appropriate network policies for your cluster
Configured persistent volume storage classes
Updated node pool configurations for your infrastructure
Updated health check IP addresses and endpoints
Set up Prometheus monitoring and alerting

Testing and Validation

Tested health checks with your infrastructure
Verified network connectivity between components
Validated SLURM configuration and job submission
Tested authentication and user access
Verified backup and disaster recovery procedures

15. Testing Your Configuration

Pre-deployment Tests

Validate Helm Charts:
```
helm lint charts/slurm/
```

Test Configuration:

helm template slurm charts/slurm/ -f your-values.yaml

Validate Secrets:
```
kubectl get secrets -n slurm
```

Post-deployment Verification

Check SLURM Status:
```
scontrol ping
scontrol show nodes
```
Verify Health Checks:
```
kubectl get physicalnodes -n slurm
```
Test Authentication:
```
ssh user@login-node
```

Troubleshooting

Common Configuration Issues

Image Pull Errors: Verify image references and registry access
Secret Not Found: Check external secret store configuration
LDAP Connection Failures: Verify LDAP server connectivity and credentials
Network Connectivity: Check IP addresses and firewall rules
Storage Issues: Verify storage class names and permissions

Debug Commands

# Check pod status
kubectl get pods -n slurm

# View logs
kubectl logs -n slurm deployment/slurm-controller

# Check events
kubectl get events -n slurm

# Verify SLURM configuration
scontrol show config

Support

For configuration issues:

Check the troubleshooting section
Review logs for error messages
Verify all required secrets are present
Test network connectivity
Validate Kubernetes resources

FilesExpand file tree

CONFIGURATION.md

Latest commit

History