Multi-user containerised ML workload management for data science research labs
DS01 brings container-based compute allocation, per-user resource limits, and automated lifecycle management to small-to-medium research organisations running shared GPU servers.
| Challenge | DS01 Solution |
|---|---|
| GPU contention | MIG-aware allocation with priority scheduling |
| Resource hogging | Per-user/group limits via YAML + systemd cgroups |
| Stale containers | Automated idle detection and cleanup |
| Complex onboarding | Educational wizards guide new users |
| Container sprawl | Ephemeral model - GPUs freed on retire |
| Observsability stack | Prometheus & Grafana dashboard configs |
| Green Computing | Energy use & carbon emission estimation tracking |
Built on proven foundations:
- AIME ML Containers for container management
- Docker + NVIDIA Container Toolkit for GPU passthrough
- VS Code Dev Containers for IDE integration
- systemd cgroups for resource isolation
- Prometheus + Grafana for monitoring and metrics
TODO: this is out of date
# Clone to standard location
sudo git clone https://github.com/hertie-data-science-lab/ds01-infra /opt/ds01-infra
cd /opt/ds01-infra
# Deploy commands and configure slices
sudo scripts/system/deploy-commands.sh
sudo scripts/system/setup-resource-slices.sh
# Add a user
sudo scripts/system/add-user-to-docker.sh aliceSee the End User Quickstart Guide for getting started with DS01.
user-setup # Guided onboarding
project init my-thesis # Create project with Dockerfile
container deploy # Launch container with GPU- MIG-aware allocation - Track and assign MIG instances or full GPUs
- Priority scheduling - Admins/researchers get priority over students
- Access control - Students get MIG only, researchers get full GPUs
- Automatic release - GPUs freed when containers stop
# config/resource-limits.yaml
defaults:
max_mig_instances: 1
max_cpus: 8
memory: "32g"
idle_timeout: "48h"
groups:
researchers:
max_mig_instances: 2
memory: "64g"
allow_full_gpu: true- Idle detection - Auto-stop containers below CPU threshold
- Runtime limits - Enforce maximum container runtime
- GPU cleanup - Release allocations from stopped containers
- Container removal - Auto-remove stale stopped containers
- 4-tier help system -
--help,--info,--concepts,--guided - Interactive wizards - Guide users through complex operations
- Project-centric model - Dockerfiles persist, containers are ephemeral
DS01 uses a 5-layer modular architecture:
┌─────────────────────────────────────────────────────────────────┐
│ L4: Wizards user-setup, project-init, project-launch │
├─────────────────────────────────────────────────────────────────┤
│ L3: Orchestrators container deploy, container retire │
├─────────────────────────────────────────────────────────────────┤
│ L2: Atomic container-*, image-* │
├─────────────────────────────────────────────────────────────────┤
│ L1: MLC mlc-patched.py (AIME + custom images) │
├─────────────────────────────────────────────────────────────────┤
│ L0: Docker Foundation runtime │
└─────────────────────────────────────────────────────────────────┘
Design principles:
- Wrap AIME, don't replace it (2.5% patch for custom image support)
- Single-purpose commands compose into workflows
- Universal enforcement via Docker wrapper + systemd cgroups
- OS: Ubuntu 20.04+ / Debian 11+
- GPU: NVIDIA GPU with MIG support (A100, H100) or any CUDA GPU
- Docker: 20.10+ with NVIDIA Container Toolkit
- Python: 3.8+ with PyYAML
- AIME: aime-ml-containers v2
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Install AIME ML Containers
sudo git clone https://github.com/aime-team/aime-ml-containers /opt/aime-ml-containers# Clone repository
sudo git clone https://github.com/hertie-data-science-lab/ds01-infra /opt/ds01-infra
cd /opt/ds01-infra
# Make scripts executable
find scripts -type f \( -name "*.sh" -o -name "*.py" \) -exec chmod +x {} \;
# Deploy commands to PATH
sudo scripts/system/deploy-commands.sh
# Configure systemd slices
sudo scripts/system/setup-resource-slices.sh
sudo systemctl daemon-reloadsudo vim config/resource-limits.yamlDefine your groups and limits. See config/README.md for full reference.
# Add user to docker group with DS01 slice
sudo scripts/system/add-user-to-docker.sh username
# User must log out and back in# First-time setup
user-setup # Guided onboarding wizard
# Project workflow
project init my-thesis --type=cv # Create project structure
project launch my-thesis # Build image + deploy container
# Container management
container deploy my-project # Launch container
container retire my-project # Stop + remove + free GPU
# Check your limits
check-limits # View resource usage and limitsTODO: this needs updating
# System status
dashboard # GPU, containers, system overview
dashboard users # Per-user breakdown
# User management
sudo scripts/system/add-user-to-docker.sh newuser
# GPU status
python3 scripts/docker/gpu_allocator.py status
# Logs
tail -f /var/log/ds01/gpu-allocations.logTODO: this needs updating
ds01-infra/
├── config/
│ ├── resource-limits.yaml # Central resource configuration
│ └── groups/ # Group membership files
├── scripts/
│ ├── docker/ # GPU allocation, container creation
│ ├── user/ # User-facing commands (L2-L4)
│ ├── admin/ # Admin tools and dashboards
│ ├── lib/ # Shared libraries
│ ├── system/ # System administration
│ ├── monitoring/ # Metrics and health checks
│ └── maintenance/ # Cleanup automation
├── testing/ # Test suites
└── docs-user/ # User documentation
TODO: this needs updating
| Document | Purpose |
|---|---|
| ds01-UI_UX_GUIDE.md | CLI design standards |
| config/README.md | Resource configuration reference |
| scripts/user/README.md | User command reference |
| scripts/admin/README.md | Admin tools reference |
| scripts/docker/README.md | GPU allocation internals |
| scripts/monitoring/README.md | Monitoring setup |
| scripts/maintenance/README.md | Cleanup automation |
We welcome contributions! See CONTRIBUTING.md for guidelines.
Commit format: Conventional Commits
feat: add feature # MINOR bump
fix: resolve bug # PATCH bump
feat!: breaking # MAJOR bumpSee TODO.md for current priorities:
- HIGH: Dev Container GPU integration, monitoring fixes
- MEDIUM: OPA authorization, bare metal restriction
- LOW: SLURM integration, dynamic MIG partitioning
MIT License - see LICENSE for details.
Developed by Henry Baker for the Hertie School Data Science Lab