A complete local LLM setup combining Ollama with Metal acceleration (for Apple Silicon) and Docker-based Open WebUI. This stack enables running powerful large language models locally with an intuitive web interface, optimized performance, and comprehensive benchmarking tools.
This repository provides:
- Local LLM Inference: Run state-of-the-art language models on your local machine
- Metal Acceleration: Optimized for Apple Silicon GPUs with Metal acceleration
- User-Friendly Interface: Web-based UI for interacting with models
- Performance Benchmarking: Tools to measure and compare model performance
- Model Management: Scripts for downloading and organizing models
- Install Ollama:
brew install ollama(if not already installed) - Launch Stack:
./scripts/start-stack.sh - Pull Models:
./scripts/pull-models-fp16.sh - Access WebUI: http://localhost:3000
start-stack.sh- Start both Ollama and Open WebUIstop-stack.sh- Stop both Ollama and WebUI servicesrestart-stack.sh- Restart all servicesupdate-stack.sh- Update Ollama and WebUI to latest versionsstatus.sh- Check system status
All components can be managed independently:
start-ollama.sh- Start only Ollama with Metal accelerationstop-ollama.sh- Stop only Ollamastart-webui.sh- Start only the Open WebUI containerstop-webui.sh- Stop only the Open WebUI container
pull-models.sh- Download recommended quantized models optimized for Metal accelerationpull-models-fp16.sh- Download recommended FP16 models for optimal quality (larger but more accurate)list-models.sh- List available models
For model benchmarking, we recommend using the dedicated llm-bench repository which has been extracted from this project.
- Comprehensive Performance Testing: Measure token generation speed, memory usage, CPU utilization
- Model Comparison: Compare different models on the same hardware
- Visualization Tools: Generate charts and reports for easy analysis
- Hardware Optimization: Identify the best models for your specific system
The llm-bench repository provides a complete set of tools for benchmarking LLM performance with Ollama, including detailed documentation, visualization tools, and analysis capabilities.
- Supported Hardware:
- Apple Silicon Macs (with Metal acceleration)
- x86 systems (with reduced performance)
- Recommended RAM: 16GB+ (32GB+ for larger models)
- Storage: 20GB+ for Ollama + WebUI, additional space for models
- Software:
- Native Ollama installation
- Docker Desktop
- Python 3.6+ (for visualization tools)
This stack automatically utilizes Metal acceleration on Apple Silicon Macs for significantly improved performance:
- Automatic Detection: The stack detects Apple Silicon and enables Metal acceleration
- Environment Variables: Key optimizations are pre-configured in scripts
- Performance Metrics: Benchmarking tools measure Metal acceleration benefits
For the best balance of performance and quality:
- General Purpose: Llama 3.1 (8B or 70B quantized versions)
- Coding: Codestral 22B or Mixtral 8x7B
- Reasoning: Qwen2.5 72B or similar
- Smaller Models: Phi-3, Mistral 7B, or Llama3 8B
Model recommendations may change as new models are released.
- Run
./scripts/status.shto check system status - Check Ollama logs:
cat ~/.ollama/ollama.log - Check WebUI logs:
docker logs open-webui - Restart the stack:
./scripts/restart-stack.sh
For more specific issues, see the Ollama and Open WebUI documentation.