Skip to content

danielphan-dp/model-sandbox

Repository files navigation

Model Sandbox

LLM benchmarking framework using DeepSpeed inference on HPC SLURM clusters.

Features

  • DeepSpeed-accelerated inference for code generation models
  • Multi-GPU support with automatic sharding
  • SLURM job templates for HPC environments
  • Evaluation metrics (ROUGE-L, fuzzy matching, Levenshtein distance)
  • Prompt templates for various code generation tasks:
    • Code completion
    • Next line prediction
    • Bug localization
    • Bug fixing
    • JUnit test generation

Setup

Local Development

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e .[dev]

SLURM Environment

# Clone repository
git clone <your-repo-url>
cd model-sandbox

# Run setup script
bash scripts/setup-remote.sh

Usage

Single GPU

python src/model_sandbox/ds_inference.py --model_identifier codellama/CodeLlama-7b-hf

Multi-GPU with DeepSpeed

deepspeed --num_gpus=4 src/model_sandbox/ds_inference.py \
    --model_identifier codellama/CodeLlama-13b-hf \
    --deepspeed \
    --deepspeed_config ds_config_inference.json

SLURM Job Submission

# Submit inference job
sbatch scripts/slurm-inference.sh codellama/CodeLlama-7b-hf

# Check job status
squeue -u $USER

# View logs
tail -f logs/inference-<job-id>.out

Interactive Session

SLURM interactive sessions are ideal for development, debugging, and running long-running inference jobs that can span multiple days.

# Standard interactive session (4 GPUs, 8 hours)
salloc --gres=gpu:4 --cpus-per-task=8 --mem=256G --time=8:00:00

# Long-running inference (multi-day job with maximum hardware)
salloc --gres=gpu:8 --cpus-per-task=16 --mem=512G --time=72:00:00

# Large-scale benchmarking (full node allocation)
salloc --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=1024G --time=96:00:00 --exclusive

Tips for long-running jobs:

  • Use tmux or screen to keep sessions alive after disconnection
  • Monitor GPU utilization with watch -n 1 nvidia-smi
  • Set checkpointing in your code to resume from failures
  • Request resources conservatively to reduce queue time

Configuration

Edit ds_config_inference.json to adjust DeepSpeed settings:

  • Tensor parallelism degree
  • Memory optimization levels
  • Batch sizes
  • Inference precision (fp16/fp32)

Dev Container

For local development with GPU support:

# Open in VS Code with Dev Containers extension
code .

The container includes:

  • CUDA 12.1 + cuDNN
  • PyTorch with GPU support
  • DeepSpeed optimized for H100/H200
  • Python 3.11 environment

About

CodeGen LLM benchmarking framework on HPC SLURM clusters

Topics

Resources

License

Stars

Watchers

Forks