A production-ready Kubernetes-native SLURM cluster solution with advanced GPU management, automated health monitoring, and custom operator support.
- Full SLURM on Kubernetes: Complete SLURM deployment with controller, login, and compute nodes
- GPU/TPU Support: First-class support for NVIDIA GPUs (including H100) and Google Cloud TPUs
- Health Monitoring: Automated node health checks with remediation capabilities
- Custom Operator: Kubernetes operator (
slonklet) for managing physical nodes and SLURM jobs via CRDs - Multi-cloud Ready: Designed for cloud-agnostic deployment with GCP-specific optimizations
- LDAP Integration: User authentication and synchronization with 2FA support
- Security Hardened: Cloudflare tunnels, YubiKey 2FA, network policies, and endpoint protection
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ SLURM │ │ Login │ │ Compute Nodes │ │
│ │ Controller │ │ Node │ │ (CPU/GPU/TPU) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Slonklet │ │ Health │ │ Monitoring │ │
│ │ Operator │ │ Checks │ │ (Prometheus) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
- SLURM Controller: Cluster state management and job scheduling
- Login Node: User access point with SSH (via Cloudflare tunnel) and dev tools
- Compute Nodes: CPU, GPU (H100), and TPU nodes for job execution
- Slonklet Operator: Custom resource definitions for
PhysicalNodeandSlurmJob - Health System: Automated monitoring (GPU burn-in, NCCL tests, disk checks, network validation)
- Kubernetes 1.20+
- Helm 3.x
- External secret store (e.g., Google Secret Manager, AWS Secrets Manager)
- LDAP server (optional but recommended)
- Container registry access
-
Configure your environment:
git clone https://github.com/your-org/slonk.git cd slonk cp values-example.yaml values-custom.yaml # Edit values-custom.yaml with your settings (see Configuration section)
-
Build container images:
# Base image cd containers/slonk && ./build.sh -t your-registry/slonk:latest # H100-specific image (optional) cd ../slonk-h100 && ./build.sh -t your-registry/slonk-h100:latest
-
Deploy SLURM cluster:
helm install slurm charts/slurm/ -f values-custom.yaml -n slurm --create-namespace
-
Verify deployment:
kubectl get pods -n slurm scontrol show nodes # From login node
Critical settings to customize (see CONFIGURATION.md for details):
| Setting | Location | Description |
|---|---|---|
| Container images | values.yaml |
Update all gcr.io/your-org/* references |
| External secrets | values.yaml |
Configure secret store and secret names |
| LDAP settings | values.yaml |
LDAP server, user/group filters |
| Network IPs | cluster-addons/slurm/*.yaml |
Cluster IPs, filestore IPs |
| API group | config/crd/bases/*.yaml |
Replace slonk.your-org.com |
| Git repos | values.yaml |
Update gitSyncRepos URLs |
Example minimal configuration:
image: your-registry/slonk:latest
secrets:
externalSecretStore: your-secret-store
mungeKeyExternalSecretName: slurm-munge-key
ldap:
uri: "ldaps://ldap.yourcompany.com"
userBase: "ou=Users,dc=yourcompany,dc=com"See CONFIGURATION.md for comprehensive setup guide.
slonk/
├── charts/slurm/ # Helm chart for SLURM cluster
│ ├── templates/ # K8s manifests (StatefulSets, ConfigMaps, CRDs)
│ ├── scripts/ # Shell scripts for prolog/epilog, health checks
│ └── values.yaml # Default configuration
├── cluster-addons/ # Cluster-specific configs (ArgoCD ApplicationSets)
├── containers/ # Container build definitions
│ ├── slonk/ # Base SLURM container
│ └── slonk-h100/ # H100-optimized container
├── user/slonk/ # Python package and operator
│ ├── slonk/ # Python modules (health checks, lifecycle management)
│ └── operators/ # Kubernetes operator (Go)
└── CONFIGURATION.md # Detailed configuration guide
cd user/slonk/operators/slonklet
make manifests # Generate CRDs
make install # Install CRDs to cluster
make run # Run operator locallyImplement slonk.health.base.HealthCheck interface:
from slonk.health.base import HealthCheck
class CustomHealthCheck(HealthCheck):
def check(self):
# Your health check logic
# Raise exception on failure
passRegister in slonk/health/__init__.py.
- Metrics: Prometheus endpoint on port 8071
- Logs:
- SLURM:
/var/log/slurm/*.log - Operator:
kubectl logs -n slurm deployment/slonklet-controller
- SLURM:
- Custom Resources:
kubectl get physicalnodes,slurmjobs -n slurm
| Issue | Check | Solution |
|---|---|---|
| Pods not starting | kubectl describe pod -n slurm |
Verify image pull secrets, node resources |
| SLURM nodes down | scontrol show nodes |
Check slurmd logs, network connectivity |
| GPU allocation fails | nvidia-smi on compute node |
Verify device plugin, driver versions |
| Auth failures | LDAP sync logs | Verify LDAP credentials, user filters |
Debug commands:
# Check cluster state
kubectl get all -n slurm
sinfo && squeue
# View logs
kubectl logs -n slurm -l app=slurm-controller --tail=100
kubectl logs -n slurm -l app=slurm-compute --tail=100
# Operator status
kubectl get physicalnodes -n slurm -o wide
kubectl describe physicalnode <node-name> -n slurm- Authentication: LDAP + optional YubiKey 2FA
- Network: Cloudflare tunnel for SSH, Kubernetes network policies
- Secrets: External secret store integration (no secrets in Git)
- Access Control: RBAC for operator, SLURM accounts via LDAP
- Fork the repository
- Create a feature branch
- Add tests for new health checks or operator features
- Submit a pull request
See LICENSE file.
- Issues: GitHub Issues
- Documentation: See CONFIGURATION.md for setup details
- Example configs:
cluster-addons/slurm/andvalues-example.yaml