DS01 Infrastructure

Multi-user containerised ML workload management for data science research labs

DS01 brings container-based compute allocation, per-user resource limits, and automated lifecycle management to small-to-medium research organisations running shared GPU servers.

Why DS01?

Challenge	DS01 Solution
GPU contention	MIG-aware allocation with priority scheduling
Resource hogging	Per-user/group limits via YAML + systemd cgroups
Stale containers	Automated idle detection and cleanup
Complex onboarding	Educational wizards guide new users
Container sprawl	Ephemeral model - GPUs freed on retire
Observsability stack	Prometheus & Grafana dashboard configs
Green Computing	Energy use & carbon emission estimation tracking

Built on proven foundations:

AIME ML Containers for container management
Docker + NVIDIA Container Toolkit for GPU passthrough
VS Code Dev Containers for IDE integration
systemd cgroups for resource isolation
Prometheus + Grafana for monitoring and metrics

Quick Start

For Administrators

TODO: this is out of date

# Clone to standard location
sudo git clone https://github.com/hertie-data-science-lab/ds01-infra /opt/ds01-infra
cd /opt/ds01-infra

# Deploy commands and configure slices
sudo scripts/system/deploy-commands.sh
sudo scripts/system/setup-resource-slices.sh

# Add a user
sudo scripts/system/add-user-to-docker.sh alice

For End Users

See the End User Quickstart Guide for getting started with DS01.

user-setup              # Guided onboarding
project init my-thesis  # Create project with Dockerfile
container deploy        # Launch container with GPU

Features

GPU Resource Management

MIG-aware allocation - Track and assign MIG instances or full GPUs
Priority scheduling - Admins/researchers get priority over students
Access control - Students get MIG only, researchers get full GPUs
Automatic release - GPUs freed when containers stop

Per-User Resource Limits

# config/resource-limits.yaml
defaults:
  max_mig_instances: 1
  max_cpus: 8
  memory: "32g"
  idle_timeout: "48h"

groups:
  researchers:
    max_mig_instances: 2
    memory: "64g"
    allow_full_gpu: true

Container Lifecycle Automation

Idle detection - Auto-stop containers below CPU threshold
Runtime limits - Enforce maximum container runtime
GPU cleanup - Release allocations from stopped containers
Container removal - Auto-remove stale stopped containers

User-Friendly Workflows

4-tier help system - --help, --info, --concepts, --guided
Interactive wizards - Guide users through complex operations
Project-centric model - Dockerfiles persist, containers are ephemeral

Architecture

DS01 uses a 5-layer modular architecture:

┌─────────────────────────────────────────────────────────────────┐
│ L4: Wizards      user-setup, project-init, project-launch       │
├─────────────────────────────────────────────────────────────────┤
│ L3: Orchestrators   container deploy, container retire          │
├─────────────────────────────────────────────────────────────────┤
│ L2: Atomic          container-*, image-*                        │
├─────────────────────────────────────────────────────────────────┤
│ L1: MLC             mlc-patched.py (AIME + custom images)       │
├─────────────────────────────────────────────────────────────────┤
│ L0: Docker          Foundation runtime                          │
└─────────────────────────────────────────────────────────────────┘

Design principles:

Wrap AIME, don't replace it (2.5% patch for custom image support)
Single-purpose commands compose into workflows
Universal enforcement via Docker wrapper + systemd cgroups

Requirements

OS: Ubuntu 20.04+ / Debian 11+
GPU: NVIDIA GPU with MIG support (A100, H100) or any CUDA GPU
Docker: 20.10+ with NVIDIA Container Toolkit
Python: 3.8+ with PyYAML
AIME: aime-ml-containers v2

Installation

1. Prerequisites

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Install AIME ML Containers
sudo git clone https://github.com/aime-team/aime-ml-containers /opt/aime-ml-containers

2. Install DS01

# Clone repository
sudo git clone https://github.com/hertie-data-science-lab/ds01-infra /opt/ds01-infra
cd /opt/ds01-infra

# Make scripts executable
find scripts -type f \( -name "*.sh" -o -name "*.py" \) -exec chmod +x {} \;

# Deploy commands to PATH
sudo scripts/system/deploy-commands.sh

# Configure systemd slices
sudo scripts/system/setup-resource-slices.sh
sudo systemctl daemon-reload

3. Configure Resource Limits

sudo vim config/resource-limits.yaml

Define your groups and limits. See config/README.md for full reference.

4. Add Users

# Add user to docker group with DS01 slice
sudo scripts/system/add-user-to-docker.sh username

# User must log out and back in

Usage

For Users

# First-time setup
user-setup                              # Guided onboarding wizard

# Project workflow
project init my-thesis --type=cv        # Create project structure
project launch my-thesis                # Build image + deploy container

# Container management
container deploy my-project             # Launch container
container retire my-project             # Stop + remove + free GPU

# Check your limits
check-limits                            # View resource usage and limits

For Administrators

TODO: this needs updating

# System status
dashboard                               # GPU, containers, system overview
dashboard users                         # Per-user breakdown

# User management
sudo scripts/system/add-user-to-docker.sh newuser

# GPU status
python3 scripts/docker/gpu_allocator.py status

# Logs
tail -f /var/log/ds01/gpu-allocations.log

Directory Structure

TODO: this needs updating

ds01-infra/
├── config/
│   ├── resource-limits.yaml    # Central resource configuration
│   └── groups/                 # Group membership files
├── scripts/
│   ├── docker/                 # GPU allocation, container creation
│   ├── user/                   # User-facing commands (L2-L4)
│   ├── admin/                  # Admin tools and dashboards
│   ├── lib/                    # Shared libraries
│   ├── system/                 # System administration
│   ├── monitoring/             # Metrics and health checks
│   └── maintenance/            # Cleanup automation
├── testing/                    # Test suites
└── docs-user/                  # User documentation

Documentation

TODO: this needs updating

Document	Purpose
ds01-UI_UX_GUIDE.md	CLI design standards
config/README.md	Resource configuration reference
scripts/user/README.md	User command reference
scripts/admin/README.md	Admin tools reference
scripts/docker/README.md	GPU allocation internals
scripts/monitoring/README.md	Monitoring setup
scripts/maintenance/README.md	Cleanup automation

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Commit format: Conventional Commits

feat: add feature     # MINOR bump
fix: resolve bug      # PATCH bump
feat!: breaking       # MAJOR bump

Roadmap

See TODO.md for current priorities:

HIGH: Dev Container GPU integration, monitoring fixes
MEDIUM: OPA authorization, bare metal restriction
LOW: SLURM integration, dynamic MIG partitioning

License

MIT License - see LICENSE for details.

Developed by Henry Baker for the Hertie School Data Science Lab

📖 User Documentation · 🐛 Report Issues

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
.github		.github
aime-ml-containers @ 882b205		aime-ml-containers @ 882b205
archive		archive
config		config
docs-admin		docs-admin
docs-user		docs-user
lib		lib
logs		logs
monitoring		monitoring
scripts		scripts
testing		testing
.gitignore		.gitignore
.gitmodules		.gitmodules
.idea		.idea
.pre-commit-config.yaml		.pre-commit-config.yaml
.shellcheckrc		.shellcheckrc
.vscode		.vscode
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS01 Infrastructure

Why DS01?

Quick Start

For Administrators

For End Users

Features

GPU Resource Management

Per-User Resource Limits

Container Lifecycle Automation

User-Friendly Workflows

Architecture

Requirements

Installation

1. Prerequisites

2. Install DS01

3. Configure Resource Limits

4. Add Users

Usage

For Users

For Administrators

Directory Structure

Documentation

Contributing

Roadmap

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DS01 Infrastructure

Why DS01?

Quick Start

For Administrators

For End Users

Features

GPU Resource Management

Per-User Resource Limits

Container Lifecycle Automation

User-Friendly Workflows

Architecture

Requirements

Installation

1. Prerequisites

2. Install DS01

3. Configure Resource Limits

4. Add Users

Usage

For Users

For Administrators

Directory Structure

Documentation

Contributing

Roadmap

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages