LLM benchmarking framework using DeepSpeed inference on HPC SLURM clusters.
- DeepSpeed-accelerated inference for code generation models
- Multi-GPU support with automatic sharding
- SLURM job templates for HPC environments
- Evaluation metrics (ROUGE-L, fuzzy matching, Levenshtein distance)
- Prompt templates for various code generation tasks:
- Code completion
- Next line prediction
- Bug localization
- Bug fixing
- JUnit test generation
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e .[dev]# Clone repository
git clone <your-repo-url>
cd model-sandbox
# Run setup script
bash scripts/setup-remote.shpython src/model_sandbox/ds_inference.py --model_identifier codellama/CodeLlama-7b-hfdeepspeed --num_gpus=4 src/model_sandbox/ds_inference.py \
--model_identifier codellama/CodeLlama-13b-hf \
--deepspeed \
--deepspeed_config ds_config_inference.json# Submit inference job
sbatch scripts/slurm-inference.sh codellama/CodeLlama-7b-hf
# Check job status
squeue -u $USER
# View logs
tail -f logs/inference-<job-id>.outSLURM interactive sessions are ideal for development, debugging, and running long-running inference jobs that can span multiple days.
# Standard interactive session (4 GPUs, 8 hours)
salloc --gres=gpu:4 --cpus-per-task=8 --mem=256G --time=8:00:00
# Long-running inference (multi-day job with maximum hardware)
salloc --gres=gpu:8 --cpus-per-task=16 --mem=512G --time=72:00:00
# Large-scale benchmarking (full node allocation)
salloc --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=1024G --time=96:00:00 --exclusiveTips for long-running jobs:
- Use
tmuxorscreento keep sessions alive after disconnection - Monitor GPU utilization with
watch -n 1 nvidia-smi - Set checkpointing in your code to resume from failures
- Request resources conservatively to reduce queue time
Edit ds_config_inference.json to adjust DeepSpeed settings:
- Tensor parallelism degree
- Memory optimization levels
- Batch sizes
- Inference precision (fp16/fp32)
For local development with GPU support:
# Open in VS Code with Dev Containers extension
code .The container includes:
- CUDA 12.1 + cuDNN
- PyTorch with GPU support
- DeepSpeed optimized for H100/H200
- Python 3.11 environment