A comprehensive benchmarking suite for Large Language Models with special optimization for Apple Silicon and Metal acceleration.
LLM-Bench provides a sophisticated suite of tools for benchmarking, analyzing, and visualizing the performance of Large Language Models running on Ollama, with specific optimizations for Apple Silicon hardware. This standalone benchmarking tool can work independently or alongside the llm-stack project.
- Performance Benchmarking: Measure token generation speed, memory usage, and CPU utilization
- Metal Acceleration: Detect and leverage Apple Silicon's Metal GPU acceleration
- Visualization Tools: Generate charts and graphs to compare model performance
- Analysis Reports: Create detailed memory and performance analysis reports
- Model Comparison: Compare multiple models across different metrics
- Standalone Operation: Works independently or integrates with llm-stack
- macOS (optimized for Apple Silicon)
- Ollama installed (
brew install ollama) - Python 3.6+ with matplotlib and pandas
- Required CLI tools: curl, jq, bc, awk, column
# Clone the repository
git clone https://github.com/rjamestaylor/llm-bench.git
cd llm-bench
# Make scripts executable
chmod +x *.sh
chmod +x utils/*.sh# Clone both repositories side by side
git clone https://github.com/rjamestaylor/llm-stack.git
git clone https://github.com/rjamestaylor/llm-bench.git
# llm-bench will automatically detect llm-stack if in sibling directory# Start Ollama (if not already running)
ollama serve
# Basic benchmark of all available models
./benchmark-models.sh
# Run with more efficient sequential mode
./benchmark-models.sh --sequential
# Include GPU metrics (requires sudo)
./benchmark-models.sh --gpu-metrics# Visualize the latest benchmark session
python visualize_benchmarks.py --latest
# Interactive visualization with options
./example_run.sh
# Create specific chart types
python visualize_benchmarks.py --performance --efficiency --summary-path 'benchmark-reports/SESSION_TIMESTAMP/summary.csv'--overview: Generate a 2x2 overview of key metrics--performance: Create performance comparison charts--efficiency: Display efficiency score charts--memory: Show memory usage with efficiency annotations--all: Generate all visualization types--include-gpu: Include GPU metrics in visualizations--format {png,pdf,svg}: Select output format
# Generate memory utilization analysis
./model-memory-report.sh SESSION_TIMESTAMP
# Generate performance metrics analysis
./model-performance-report.sh SESSION_TIMESTAMPFor detailed API responses and troubleshooting:
# Run benchmark with debugging output
./run-benchmark-debug.shLLM-Bench is designed to work seamlessly with the llm-stack project, but can also operate completely independently.
LLM-Bench will automatically detect if llm-stack is installed in a sibling directory and leverage its scripts and configuration when available. This provides the best of both worlds:
- Standalone Mode: All functionality works without requiring llm-stack
- Integrated Mode: Enhanced functionality when llm-stack is available
You can explicitly set the path to llm-stack:
export LLM_STACK_DIR="/path/to/llm-stack"
./benchmark-models.sh- Token generation speed (tokens per second)
- Execution time for standardized prompts
- Token counts (prompt tokens and generated tokens)
- Throughput score (tokens/sec per CPU%)
- Tokens per MB (memory efficiency)
- Metal acceleration efficiency
- Memory usage (baseline, peak, and used)
- CPU usage (baseline, peak, and average)
- System memory utilization percentage
- Metal acceleration status and performance
- GPU power usage (when --gpu-metrics is enabled)
llm-bench/
├── benchmark-models.sh # Main benchmarking script
├── example_run.sh # Interactive visualization script
├── model-memory-report.sh # Memory analysis tool
├── model-performance-report.sh # Performance analysis tool
├── run-benchmark-debug.sh # Debugging tool
├── visualize_benchmarks.py # Visualization script
├── config.sh # Configuration and integration settings
├── utils/ # Utility scripts
│ └── list-models.sh # Model listing utility
└── benchmark-reports/ # Reports and data
└── sample/ # Sample benchmark data
# Start Ollama with Metal acceleration
METAL_DEVICE_WRAPPER_ENABLED=1 ollama serve
# Run benchmarks with GPU metrics
./benchmark-models.sh --gpu-metrics
# Generate visualizations
python visualize_benchmarks.py --latest --all
# Analyze memory efficiency
./model-memory-report.sh $(python visualize_benchmarks.py --list-sessions | head -2 | tail -1 | awk '{print $2}')To modify the benchmark prompts or add new test scenarios, edit the benchmark-models.sh script:
SHORT_PROMPT: Simple, quick responsesMEDIUM_PROMPT: Moderate complexity responsesLONG_PROMPT: Complex, detailed responsesCODE_PROMPT: Programming and technical responses
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.