LLM-Bench: LLM Benchmarking for Apple Silicon

A comprehensive benchmarking suite for Large Language Models with special optimization for Apple Silicon and Metal acceleration.

Overview

LLM-Bench provides a sophisticated suite of tools for benchmarking, analyzing, and visualizing the performance of Large Language Models running on Ollama, with specific optimizations for Apple Silicon hardware. This standalone benchmarking tool can work independently or alongside the llm-stack project.

Key Features

Performance Benchmarking: Measure token generation speed, memory usage, and CPU utilization
Metal Acceleration: Detect and leverage Apple Silicon's Metal GPU acceleration
Visualization Tools: Generate charts and graphs to compare model performance
Analysis Reports: Create detailed memory and performance analysis reports
Model Comparison: Compare multiple models across different metrics
Standalone Operation: Works independently or integrates with llm-stack

Installation

Prerequisites

macOS (optimized for Apple Silicon)
Ollama installed (brew install ollama)
Python 3.6+ with matplotlib and pandas
Required CLI tools: curl, jq, bc, awk, column

Installation Options

Standalone Installation

# Clone the repository
git clone https://github.com/rjamestaylor/llm-bench.git
cd llm-bench

# Make scripts executable
chmod +x *.sh
chmod +x utils/*.sh

Installation with llm-stack

# Clone both repositories side by side
git clone https://github.com/rjamestaylor/llm-stack.git
git clone https://github.com/rjamestaylor/llm-bench.git

# llm-bench will automatically detect llm-stack if in sibling directory

Usage

Running Benchmarks

# Start Ollama (if not already running)
ollama serve

# Basic benchmark of all available models
./benchmark-models.sh

# Run with more efficient sequential mode
./benchmark-models.sh --sequential

# Include GPU metrics (requires sudo)
./benchmark-models.sh --gpu-metrics

Visualizing Results

# Visualize the latest benchmark session
python visualize_benchmarks.py --latest

# Interactive visualization with options
./example_run.sh

# Create specific chart types
python visualize_benchmarks.py --performance --efficiency --summary-path 'benchmark-reports/SESSION_TIMESTAMP/summary.csv'

Available Visualization Options

--overview: Generate a 2x2 overview of key metrics
--performance: Create performance comparison charts
--efficiency: Display efficiency score charts
--memory: Show memory usage with efficiency annotations
--all: Generate all visualization types
--include-gpu: Include GPU metrics in visualizations
--format {png,pdf,svg}: Select output format

Analyzing Results

# Generate memory utilization analysis
./model-memory-report.sh SESSION_TIMESTAMP

# Generate performance metrics analysis
./model-performance-report.sh SESSION_TIMESTAMP

Debugging

For detailed API responses and troubleshooting:

# Run benchmark with debugging output
./run-benchmark-debug.sh

Integration with llm-stack

LLM-Bench is designed to work seamlessly with the llm-stack project, but can also operate completely independently.

Automatic Detection

LLM-Bench will automatically detect if llm-stack is installed in a sibling directory and leverage its scripts and configuration when available. This provides the best of both worlds:

Standalone Mode: All functionality works without requiring llm-stack
Integrated Mode: Enhanced functionality when llm-stack is available

Manual Configuration

You can explicitly set the path to llm-stack:

export LLM_STACK_DIR="/path/to/llm-stack"
./benchmark-models.sh

Metrics Collected

Performance Metrics

Token generation speed (tokens per second)
Execution time for standardized prompts
Token counts (prompt tokens and generated tokens)
Throughput score (tokens/sec per CPU%)
Tokens per MB (memory efficiency)
Metal acceleration efficiency

Hardware Utilization Metrics

Memory usage (baseline, peak, and used)
CPU usage (baseline, peak, and average)
System memory utilization percentage
Metal acceleration status and performance
GPU power usage (when --gpu-metrics is enabled)

Directory Structure

llm-bench/
├── benchmark-models.sh        # Main benchmarking script
├── example_run.sh             # Interactive visualization script
├── model-memory-report.sh     # Memory analysis tool
├── model-performance-report.sh # Performance analysis tool
├── run-benchmark-debug.sh     # Debugging tool
├── visualize_benchmarks.py    # Visualization script
├── config.sh                  # Configuration and integration settings
├── utils/                     # Utility scripts
│   └── list-models.sh         # Model listing utility
└── benchmark-reports/         # Reports and data
    └── sample/                # Sample benchmark data

Example Workflow

# Start Ollama with Metal acceleration
METAL_DEVICE_WRAPPER_ENABLED=1 ollama serve

# Run benchmarks with GPU metrics
./benchmark-models.sh --gpu-metrics

# Generate visualizations
python visualize_benchmarks.py --latest --all

# Analyze memory efficiency
./model-memory-report.sh $(python visualize_benchmarks.py --list-sessions | head -2 | tail -1 | awk '{print $2}')

Customizing Benchmarks

To modify the benchmark prompts or add new test scenarios, edit the benchmark-models.sh script:

SHORT_PROMPT: Simple, quick responses
MEDIUM_PROMPT: Moderate complexity responses
LONG_PROMPT: Complex, detailed responses
CODE_PROMPT: Programming and technical responses

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built to work with Ollama
Originally developed as part of the llm-stack project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Bench: LLM Benchmarking for Apple Silicon

Overview

Key Features

Installation

Prerequisites

Installation Options

Standalone Installation

Installation with llm-stack

Usage

Running Benchmarks

Visualizing Results

Available Visualization Options

Analyzing Results

Debugging

Integration with llm-stack

Automatic Detection

Manual Configuration

Metrics Collected

Performance Metrics

Hardware Utilization Metrics

Directory Structure

Example Workflow

Customizing Benchmarks

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmark-reports/sample		benchmark-reports/sample
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark-models.sh		benchmark-models.sh
config.sh		config.sh
example_run.sh		example_run.sh
model-memory-report.sh		model-memory-report.sh
model-performance-report.sh		model-performance-report.sh
run-benchmark-debug.sh		run-benchmark-debug.sh
visualize_benchmarks.py		visualize_benchmarks.py

Folders and files

Latest commit

History

Repository files navigation

LLM-Bench: LLM Benchmarking for Apple Silicon

Overview

Key Features

Installation

Prerequisites

Installation Options

Standalone Installation

Installation with llm-stack

Usage

Running Benchmarks

Visualizing Results

Available Visualization Options

Analyzing Results

Debugging

Integration with llm-stack

Automatic Detection

Manual Configuration

Metrics Collected

Performance Metrics

Hardware Utilization Metrics

Directory Structure

Example Workflow

Customizing Benchmarks

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages