|
| 1 | +# llama.cpp Server Helm Chart |
| 2 | + |
| 3 | +A Helm chart for deploying llama.cpp server with NVIDIA RTX 5090 GPU support on Kubernetes. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This chart deploys llama.cpp server optimized for NVIDIA RTX 5090 GPUs (Blackwell architecture). It's based on the [phymbert/llama.cpp Kubernetes example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp) with adaptations for: |
| 8 | + |
| 9 | +- **Longhorn persistent storage** |
| 10 | +- **MetalLB LoadBalancer with sealed secrets** |
| 11 | +- **RTX 5090 GPU optimizations** (CUDA 12.8+, flash attention, 99 GPU layers) |
| 12 | +- **OpenAI gpt-oss-20b model** (12.1GB MXFP4 quantization) |
| 13 | + |
| 14 | +## Features |
| 15 | + |
| 16 | +- ✅ **Automatic model download** via Kubernetes Job (PreSync hook) |
| 17 | +- ✅ **GPU acceleration** with NVIDIA runtime and device plugin |
| 18 | +- ✅ **Flash attention** enabled for better performance |
| 19 | +- ✅ **Persistent storage** using Longhorn |
| 20 | +- ✅ **LoadBalancer service** with sealed secret IP management |
| 21 | +- ✅ **Prometheus metrics** endpoint |
| 22 | +- ✅ **Health probes** for liveness and readiness |
| 23 | +- ✅ **RBAC** and service account setup |
| 24 | +- ✅ **Horizontal pod autoscaling** (optional) |
| 25 | + |
| 26 | +## Prerequisites |
| 27 | + |
| 28 | +- Kubernetes cluster with GPU nodes |
| 29 | +- NVIDIA GPU operator or device plugin installed |
| 30 | +- Node labeled with `gpu: "true"` |
| 31 | +- NVIDIA runtime class configured (`runtimeClassName: nvidia`) |
| 32 | +- Longhorn storage class deployed |
| 33 | +- Sealed secrets controller (for LoadBalancer IP management) |
| 34 | +- MetalLB or another LoadBalancer provider (optional) |
| 35 | + |
| 36 | +## Installation |
| 37 | + |
| 38 | +### Via Argo CD (Recommended) |
| 39 | + |
| 40 | +The chart is managed by Argo CD. Apply the application manifest: |
| 41 | + |
| 42 | +```bash |
| 43 | +kubectl apply -f argocd/apps/applications/llamacpp/llamacpp-app.yaml |
| 44 | +``` |
| 45 | + |
| 46 | +The deployment will: |
| 47 | +1. Create PVC with Longhorn storage (50Gi) |
| 48 | +2. Run model download job (PreSync hook) to fetch gpt-oss-20b from HuggingFace |
| 49 | +3. Deploy llama.cpp server with GPU support |
| 50 | +4. Patch LoadBalancer IP from sealed secret (PostSync hook) |
| 51 | + |
| 52 | +### Manual Installation |
| 53 | + |
| 54 | +```bash |
| 55 | +helm install llamacpp-app helm/llamacpp-app --namespace applications --create-namespace |
| 56 | +``` |
| 57 | + |
| 58 | +## Architecture |
| 59 | + |
| 60 | +### Model Download Flow |
| 61 | + |
| 62 | +1. **PreSync Job** (`jobs.yaml`): |
| 63 | + - Downloads `gpt-oss-20b-mxfp4.gguf` (12.1GB) from HuggingFace |
| 64 | + - Verifies SHA256 checksum (if provided) |
| 65 | + - Stores model in Longhorn PVC |
| 66 | + - Uses curl with resume capability (`-c` flag) |
| 67 | + - Idempotent (skips download if model exists and is valid) |
| 68 | + |
| 69 | +2. **Deployment**: |
| 70 | + - Mounts PVC as read-only volume |
| 71 | + - Loads model from `/models/gpt-oss-20b-mxfp4.gguf` |
| 72 | + - Runs llama-server with GPU acceleration |
| 73 | + |
| 74 | +### GPU Configuration |
| 75 | + |
| 76 | +The chart configures RTX 5090 optimizations: |
| 77 | + |
| 78 | +- **CUDA Version**: 12.8+ (via environment variable) |
| 79 | +- **Flash Attention**: Enabled (`-fa` flag) |
| 80 | +- **GPU Layers**: 99 layers offloaded (`-ngl 99`) |
| 81 | +- **Runtime Class**: `nvidia` |
| 82 | +- **GPU Tolerations**: Automatic for `nvidia.com/gpu` taints |
| 83 | +- **Node Selector**: `gpu: "true"` |
| 84 | + |
| 85 | +## Configuration |
| 86 | + |
| 87 | +### Key Values |
| 88 | + |
| 89 | +| Parameter | Description | Default | |
| 90 | +|-----------|-------------|---------| |
| 91 | +| `replicaCount` | Number of replicas | `1` | |
| 92 | +| `images.server.repository` | Server image repository | `ghcr.io/ggerganov/llama.cpp` | |
| 93 | +| `images.server.name` | Server image name | `server-cuda` | |
| 94 | +| `images.server.tag` | Server image tag | `latest` | |
| 95 | +| `model.path` | Model storage path | `/models` | |
| 96 | +| `model.file` | Model file to download | `gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf` | |
| 97 | +| `model.size` | PVC size | `50Gi` | |
| 98 | +| `model.sha256` | SHA256 checksum (optional) | `""` | |
| 99 | +| `server.port` | Server port | `8080` | |
| 100 | +| `server.kvCache.size` | Context window size | `8192` | |
| 101 | +| `server.slots` | Parallel processing slots | `4` | |
| 102 | +| `server.metrics` | Enable Prometheus metrics | `true` | |
| 103 | +| `server.completions` | Enable completions endpoint | `true` | |
| 104 | +| `server.embeddings` | Enable embeddings endpoint | `false` | |
| 105 | +| `server.extraArgs` | Additional server arguments | `["-fa", "-ngl", "99"]` | |
| 106 | +| `gpu.enabled` | Enable GPU support | `true` | |
| 107 | +| `gpu.nvidiaResource` | GPU resource name | `nvidia.com/gpu` | |
| 108 | +| `gpu.number` | Number of GPUs | `1` | |
| 109 | +| `resources.requests.memory` | Memory request | `32Gi` | |
| 110 | +| `resources.limits.memory` | Memory limit | `64Gi` | |
| 111 | +| `persistence.storageClass` | Storage class | `longhorn` | |
| 112 | +| `service.type` | Service type | `LoadBalancer` | |
| 113 | +| `runtimeClassName` | Runtime class | `nvidia` | |
| 114 | +| `nodeSelector.gpu` | GPU node selector | `"true"` | |
| 115 | + |
| 116 | +### Sealed Secret Configuration |
| 117 | + |
| 118 | +To configure the LoadBalancer IP: |
| 119 | + |
| 120 | +```bash |
| 121 | +# Create sealed secret for your MetalLB IP |
| 122 | +echo -n "YOUR_IP_ADDRESS" | kubeseal --raw --from-file=/dev/stdin \ |
| 123 | + --namespace applications --name llamacpp-app-lb-ip |
| 124 | + |
| 125 | +# Add the encrypted value to values.yaml |
| 126 | +# sealedSecret: |
| 127 | +# encryptedData: |
| 128 | +# loadBalancerIP: "<encrypted-value-here>" |
| 129 | +``` |
| 130 | + |
| 131 | +### Custom Model |
| 132 | + |
| 133 | +To use a different model: |
| 134 | + |
| 135 | +```yaml |
| 136 | +model: |
| 137 | + path: /models |
| 138 | + alias: my-model |
| 139 | + repo: username |
| 140 | + file: repo-name/model-file.gguf |
| 141 | + size: 100Gi # Adjust based on model size |
| 142 | + sha256: "" # Optional checksum |
| 143 | +``` |
| 144 | +
|
| 145 | +## Usage |
| 146 | +
|
| 147 | +### API Endpoints |
| 148 | +
|
| 149 | +Once deployed, the server exposes: |
| 150 | +
|
| 151 | +- **Health Check**: `GET /health` |
| 152 | +- **Completions**: `POST /v1/completions` |
| 153 | +- **Chat Completions**: `POST /v1/chat/completions` |
| 154 | +- **Embeddings**: `POST /v1/embeddings` (if enabled) |
| 155 | +- **Metrics**: `GET /metrics` (Prometheus format) |
| 156 | + |
| 157 | +### Example Requests |
| 158 | + |
| 159 | +#### Text Completion |
| 160 | + |
| 161 | +```bash |
| 162 | +SERVICE_IP=$(kubectl get svc -n applications llamacpp-app -o jsonpath='{.status.loadBalancer.ingress[0].ip}') |
| 163 | +
|
| 164 | +curl -X POST http://$SERVICE_IP:8080/v1/completions \ |
| 165 | + -H "Content-Type: application/json" \ |
| 166 | + -d '{ |
| 167 | + "prompt": "Explain quantum computing in simple terms:", |
| 168 | + "max_tokens": 150, |
| 169 | + "temperature": 0.7, |
| 170 | + "top_p": 0.9 |
| 171 | + }' |
| 172 | +``` |
| 173 | + |
| 174 | +#### Chat Completion |
| 175 | + |
| 176 | +```bash |
| 177 | +curl -X POST http://$SERVICE_IP:8080/v1/chat/completions \ |
| 178 | + -H "Content-Type: application/json" \ |
| 179 | + -d '{ |
| 180 | + "messages": [ |
| 181 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 182 | + {"role": "user", "content": "What is the capital of France?"} |
| 183 | + ], |
| 184 | + "max_tokens": 50 |
| 185 | + }' |
| 186 | +``` |
| 187 | + |
| 188 | +#### Health Check |
| 189 | + |
| 190 | +```bash |
| 191 | +curl http://$SERVICE_IP:8080/health |
| 192 | +``` |
| 193 | + |
| 194 | +## Monitoring |
| 195 | + |
| 196 | +### Prometheus Integration |
| 197 | + |
| 198 | +The chart includes Prometheus pod annotations for automatic metric scraping: |
| 199 | + |
| 200 | +```yaml |
| 201 | +annotations: |
| 202 | + prometheus.io/scrape: 'true' |
| 203 | + prometheus.io/port: '8080' |
| 204 | +``` |
| 205 | + |
| 206 | +Metrics are available at `/metrics` when `server.metrics: true`. |
| 207 | + |
| 208 | +## Troubleshooting |
| 209 | + |
| 210 | +### Model Download Issues |
| 211 | + |
| 212 | +Check the download job logs: |
| 213 | + |
| 214 | +```bash |
| 215 | +kubectl logs -n applications job/llamacpp-app-download-model |
| 216 | +``` |
| 217 | + |
| 218 | +If the download fails: |
| 219 | +- Verify network connectivity to HuggingFace |
| 220 | +- Check PVC has sufficient space (50Gi for gpt-oss-20b) |
| 221 | +- Manually trigger download job deletion to retry |
| 222 | + |
| 223 | +### GPU Not Detected |
| 224 | + |
| 225 | +Verify GPU resources: |
| 226 | + |
| 227 | +```bash |
| 228 | +kubectl describe node <gpu-node-name> | grep nvidia.com/gpu |
| 229 | +kubectl get pods -n kube-system | grep nvidia-device-plugin |
| 230 | +``` |
| 231 | + |
| 232 | +Check pod GPU allocation: |
| 233 | + |
| 234 | +```bash |
| 235 | +kubectl describe pod -n applications <pod-name> | grep -A5 "Limits:" |
| 236 | +``` |
| 237 | + |
| 238 | +### Pod Not Scheduling |
| 239 | + |
| 240 | +Check node labels and taints: |
| 241 | + |
| 242 | +```bash |
| 243 | +kubectl get nodes --show-labels | grep gpu |
| 244 | +kubectl describe node <gpu-node-name> | grep Taints |
| 245 | +``` |
| 246 | + |
| 247 | +Verify tolerations match your node taints. |
| 248 | + |
| 249 | +### Out of Memory |
| 250 | + |
| 251 | +The RTX 5090 has 32GB VRAM. For gpt-oss-20b (21B parameters, 12.1GB model file): |
| 252 | + |
| 253 | +- Model file: ~12GB |
| 254 | +- KV cache (8192 ctx): ~4-6GB |
| 255 | +- Activations: ~2-4GB |
| 256 | +- Total: ~18-22GB (fits comfortably) |
| 257 | + |
| 258 | +To reduce memory: |
| 259 | +- Decrease `server.kvCache.size` (context window) |
| 260 | +- Reduce `server.slots` (parallel requests) |
| 261 | +- Use a smaller quantization (Q4 instead of MXFP4) |
| 262 | + |
| 263 | +### LoadBalancer IP Not Applied |
| 264 | + |
| 265 | +Check the patch job: |
| 266 | + |
| 267 | +```bash |
| 268 | +kubectl logs -n applications job/llamacpp-app-lb-ip-patch |
| 269 | +``` |
| 270 | + |
| 271 | +Verify sealed secret decrypts correctly: |
| 272 | + |
| 273 | +```bash |
| 274 | +kubectl get secret -n applications llamacpp-app-lb-ip -o jsonpath='{.data.loadBalancerIP}' | base64 -d |
| 275 | +``` |
| 276 | + |
| 277 | +## Performance Tips |
| 278 | + |
| 279 | +1. **Context Size**: Balance between capability and memory. 8192 is a good default. |
| 280 | +2. **Parallel Slots**: Increase for higher concurrency, decrease for longer contexts. |
| 281 | +3. **Flash Attention**: Keep enabled (`-fa`) for best performance with long contexts. |
| 282 | +4. **GPU Layers**: 99 offloads nearly all layers to GPU for maximum acceleration. |
| 283 | +5. **Batch Size**: Continuous batching (`--cont-batching`) improves throughput. |
| 284 | + |
| 285 | +## Model Information |
| 286 | + |
| 287 | +### gpt-oss-20b |
| 288 | + |
| 289 | +- **Parameters**: 21B total, 3.6B active (Mixture of Experts) |
| 290 | +- **Quantization**: MXFP4 (mixed-precision floating-point) |
| 291 | +- **File Size**: 12.1GB |
| 292 | +- **Context**: Supports extended context (tested up to 128K) |
| 293 | +- **License**: Apache 2.0 |
| 294 | +- **Source**: [HuggingFace - ggml-org/gpt-oss-20b-GGUF](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) |
| 295 | + |
| 296 | +## Differences from Reference Chart |
| 297 | + |
| 298 | +This chart extends the [phymbert/llama.cpp example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp) with: |
| 299 | + |
| 300 | +- **Longhorn storage** instead of generic storage class |
| 301 | +- **Sealed secrets + patch job** for LoadBalancer IP management |
| 302 | +- **RTX 5090 optimizations** (CUDA 12.8, runtime class, tolerations) |
| 303 | +- **RBAC and ServiceAccount** for Argo CD integration |
| 304 | +- **GPU resource management** in deployment spec |
| 305 | +- **gpt-oss-20b model** configuration (vs. tinyllamas example) |
| 306 | +- **Enhanced job with SHA256 validation** and resume capability |
| 307 | + |
| 308 | +## References |
| 309 | + |
| 310 | +- [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp) |
| 311 | +- [llama.cpp Docker Images](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md) |
| 312 | +- [phymbert/llama.cpp Kubernetes Example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp) |
| 313 | +- [OpenAI gpt-oss](https://github.com/openai/gpt-oss) |
| 314 | +- [gpt-oss-20b GGUF Models](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) |
| 315 | +- [RTX 5090 Specifications](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/) |
0 commit comments