Skip to content

Commit 1b44d2e

Browse files
committed
Add initial llamaccp-app files
1 parent df89d17 commit 1b44d2e

15 files changed

Lines changed: 1162 additions & 0 deletions

File tree

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
apiVersion: argoproj.io/v1alpha1
2+
kind: Application
3+
metadata:
4+
name: llamacpp-app
5+
namespace: argocd
6+
finalizers:
7+
- resources-finalizer.argocd.argoproj.io
8+
spec:
9+
project: default
10+
source:
11+
repoURL: https://github.com/nwthomas/gitops.git
12+
targetRevision: main
13+
path: helm/llamacpp-app
14+
destination:
15+
server: https://kubernetes.default.svc
16+
namespace: applications
17+
syncPolicy:
18+
syncOptions:
19+
- CreateNamespace=true
20+
- Prune=true

helm/llamacpp-app/Chart.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: v2
2+
name: llamacpp-app
3+
description: llama.cpp Helm chart for Kubernetes with RTX 5090 GPU support
4+
type: application
5+
version: 0.1.0
6+
appVersion: "latest"

helm/llamacpp-app/README.md

Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
# llama.cpp Server Helm Chart
2+
3+
A Helm chart for deploying llama.cpp server with NVIDIA RTX 5090 GPU support on Kubernetes.
4+
5+
## Overview
6+
7+
This chart deploys llama.cpp server optimized for NVIDIA RTX 5090 GPUs (Blackwell architecture). It's based on the [phymbert/llama.cpp Kubernetes example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp) with adaptations for:
8+
9+
- **Longhorn persistent storage**
10+
- **MetalLB LoadBalancer with sealed secrets**
11+
- **RTX 5090 GPU optimizations** (CUDA 12.8+, flash attention, 99 GPU layers)
12+
- **OpenAI gpt-oss-20b model** (12.1GB MXFP4 quantization)
13+
14+
## Features
15+
16+
-**Automatic model download** via Kubernetes Job (PreSync hook)
17+
-**GPU acceleration** with NVIDIA runtime and device plugin
18+
-**Flash attention** enabled for better performance
19+
-**Persistent storage** using Longhorn
20+
-**LoadBalancer service** with sealed secret IP management
21+
-**Prometheus metrics** endpoint
22+
-**Health probes** for liveness and readiness
23+
-**RBAC** and service account setup
24+
-**Horizontal pod autoscaling** (optional)
25+
26+
## Prerequisites
27+
28+
- Kubernetes cluster with GPU nodes
29+
- NVIDIA GPU operator or device plugin installed
30+
- Node labeled with `gpu: "true"`
31+
- NVIDIA runtime class configured (`runtimeClassName: nvidia`)
32+
- Longhorn storage class deployed
33+
- Sealed secrets controller (for LoadBalancer IP management)
34+
- MetalLB or another LoadBalancer provider (optional)
35+
36+
## Installation
37+
38+
### Via Argo CD (Recommended)
39+
40+
The chart is managed by Argo CD. Apply the application manifest:
41+
42+
```bash
43+
kubectl apply -f argocd/apps/applications/llamacpp/llamacpp-app.yaml
44+
```
45+
46+
The deployment will:
47+
1. Create PVC with Longhorn storage (50Gi)
48+
2. Run model download job (PreSync hook) to fetch gpt-oss-20b from HuggingFace
49+
3. Deploy llama.cpp server with GPU support
50+
4. Patch LoadBalancer IP from sealed secret (PostSync hook)
51+
52+
### Manual Installation
53+
54+
```bash
55+
helm install llamacpp-app helm/llamacpp-app --namespace applications --create-namespace
56+
```
57+
58+
## Architecture
59+
60+
### Model Download Flow
61+
62+
1. **PreSync Job** (`jobs.yaml`):
63+
- Downloads `gpt-oss-20b-mxfp4.gguf` (12.1GB) from HuggingFace
64+
- Verifies SHA256 checksum (if provided)
65+
- Stores model in Longhorn PVC
66+
- Uses curl with resume capability (`-c` flag)
67+
- Idempotent (skips download if model exists and is valid)
68+
69+
2. **Deployment**:
70+
- Mounts PVC as read-only volume
71+
- Loads model from `/models/gpt-oss-20b-mxfp4.gguf`
72+
- Runs llama-server with GPU acceleration
73+
74+
### GPU Configuration
75+
76+
The chart configures RTX 5090 optimizations:
77+
78+
- **CUDA Version**: 12.8+ (via environment variable)
79+
- **Flash Attention**: Enabled (`-fa` flag)
80+
- **GPU Layers**: 99 layers offloaded (`-ngl 99`)
81+
- **Runtime Class**: `nvidia`
82+
- **GPU Tolerations**: Automatic for `nvidia.com/gpu` taints
83+
- **Node Selector**: `gpu: "true"`
84+
85+
## Configuration
86+
87+
### Key Values
88+
89+
| Parameter | Description | Default |
90+
|-----------|-------------|---------|
91+
| `replicaCount` | Number of replicas | `1` |
92+
| `images.server.repository` | Server image repository | `ghcr.io/ggerganov/llama.cpp` |
93+
| `images.server.name` | Server image name | `server-cuda` |
94+
| `images.server.tag` | Server image tag | `latest` |
95+
| `model.path` | Model storage path | `/models` |
96+
| `model.file` | Model file to download | `gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf` |
97+
| `model.size` | PVC size | `50Gi` |
98+
| `model.sha256` | SHA256 checksum (optional) | `""` |
99+
| `server.port` | Server port | `8080` |
100+
| `server.kvCache.size` | Context window size | `8192` |
101+
| `server.slots` | Parallel processing slots | `4` |
102+
| `server.metrics` | Enable Prometheus metrics | `true` |
103+
| `server.completions` | Enable completions endpoint | `true` |
104+
| `server.embeddings` | Enable embeddings endpoint | `false` |
105+
| `server.extraArgs` | Additional server arguments | `["-fa", "-ngl", "99"]` |
106+
| `gpu.enabled` | Enable GPU support | `true` |
107+
| `gpu.nvidiaResource` | GPU resource name | `nvidia.com/gpu` |
108+
| `gpu.number` | Number of GPUs | `1` |
109+
| `resources.requests.memory` | Memory request | `32Gi` |
110+
| `resources.limits.memory` | Memory limit | `64Gi` |
111+
| `persistence.storageClass` | Storage class | `longhorn` |
112+
| `service.type` | Service type | `LoadBalancer` |
113+
| `runtimeClassName` | Runtime class | `nvidia` |
114+
| `nodeSelector.gpu` | GPU node selector | `"true"` |
115+
116+
### Sealed Secret Configuration
117+
118+
To configure the LoadBalancer IP:
119+
120+
```bash
121+
# Create sealed secret for your MetalLB IP
122+
echo -n "YOUR_IP_ADDRESS" | kubeseal --raw --from-file=/dev/stdin \
123+
--namespace applications --name llamacpp-app-lb-ip
124+
125+
# Add the encrypted value to values.yaml
126+
# sealedSecret:
127+
# encryptedData:
128+
# loadBalancerIP: "<encrypted-value-here>"
129+
```
130+
131+
### Custom Model
132+
133+
To use a different model:
134+
135+
```yaml
136+
model:
137+
path: /models
138+
alias: my-model
139+
repo: username
140+
file: repo-name/model-file.gguf
141+
size: 100Gi # Adjust based on model size
142+
sha256: "" # Optional checksum
143+
```
144+
145+
## Usage
146+
147+
### API Endpoints
148+
149+
Once deployed, the server exposes:
150+
151+
- **Health Check**: `GET /health`
152+
- **Completions**: `POST /v1/completions`
153+
- **Chat Completions**: `POST /v1/chat/completions`
154+
- **Embeddings**: `POST /v1/embeddings` (if enabled)
155+
- **Metrics**: `GET /metrics` (Prometheus format)
156+
157+
### Example Requests
158+
159+
#### Text Completion
160+
161+
```bash
162+
SERVICE_IP=$(kubectl get svc -n applications llamacpp-app -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
163+
164+
curl -X POST http://$SERVICE_IP:8080/v1/completions \
165+
-H "Content-Type: application/json" \
166+
-d '{
167+
"prompt": "Explain quantum computing in simple terms:",
168+
"max_tokens": 150,
169+
"temperature": 0.7,
170+
"top_p": 0.9
171+
}'
172+
```
173+
174+
#### Chat Completion
175+
176+
```bash
177+
curl -X POST http://$SERVICE_IP:8080/v1/chat/completions \
178+
-H "Content-Type: application/json" \
179+
-d '{
180+
"messages": [
181+
{"role": "system", "content": "You are a helpful assistant."},
182+
{"role": "user", "content": "What is the capital of France?"}
183+
],
184+
"max_tokens": 50
185+
}'
186+
```
187+
188+
#### Health Check
189+
190+
```bash
191+
curl http://$SERVICE_IP:8080/health
192+
```
193+
194+
## Monitoring
195+
196+
### Prometheus Integration
197+
198+
The chart includes Prometheus pod annotations for automatic metric scraping:
199+
200+
```yaml
201+
annotations:
202+
prometheus.io/scrape: 'true'
203+
prometheus.io/port: '8080'
204+
```
205+
206+
Metrics are available at `/metrics` when `server.metrics: true`.
207+
208+
## Troubleshooting
209+
210+
### Model Download Issues
211+
212+
Check the download job logs:
213+
214+
```bash
215+
kubectl logs -n applications job/llamacpp-app-download-model
216+
```
217+
218+
If the download fails:
219+
- Verify network connectivity to HuggingFace
220+
- Check PVC has sufficient space (50Gi for gpt-oss-20b)
221+
- Manually trigger download job deletion to retry
222+
223+
### GPU Not Detected
224+
225+
Verify GPU resources:
226+
227+
```bash
228+
kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
229+
kubectl get pods -n kube-system | grep nvidia-device-plugin
230+
```
231+
232+
Check pod GPU allocation:
233+
234+
```bash
235+
kubectl describe pod -n applications <pod-name> | grep -A5 "Limits:"
236+
```
237+
238+
### Pod Not Scheduling
239+
240+
Check node labels and taints:
241+
242+
```bash
243+
kubectl get nodes --show-labels | grep gpu
244+
kubectl describe node <gpu-node-name> | grep Taints
245+
```
246+
247+
Verify tolerations match your node taints.
248+
249+
### Out of Memory
250+
251+
The RTX 5090 has 32GB VRAM. For gpt-oss-20b (21B parameters, 12.1GB model file):
252+
253+
- Model file: ~12GB
254+
- KV cache (8192 ctx): ~4-6GB
255+
- Activations: ~2-4GB
256+
- Total: ~18-22GB (fits comfortably)
257+
258+
To reduce memory:
259+
- Decrease `server.kvCache.size` (context window)
260+
- Reduce `server.slots` (parallel requests)
261+
- Use a smaller quantization (Q4 instead of MXFP4)
262+
263+
### LoadBalancer IP Not Applied
264+
265+
Check the patch job:
266+
267+
```bash
268+
kubectl logs -n applications job/llamacpp-app-lb-ip-patch
269+
```
270+
271+
Verify sealed secret decrypts correctly:
272+
273+
```bash
274+
kubectl get secret -n applications llamacpp-app-lb-ip -o jsonpath='{.data.loadBalancerIP}' | base64 -d
275+
```
276+
277+
## Performance Tips
278+
279+
1. **Context Size**: Balance between capability and memory. 8192 is a good default.
280+
2. **Parallel Slots**: Increase for higher concurrency, decrease for longer contexts.
281+
3. **Flash Attention**: Keep enabled (`-fa`) for best performance with long contexts.
282+
4. **GPU Layers**: 99 offloads nearly all layers to GPU for maximum acceleration.
283+
5. **Batch Size**: Continuous batching (`--cont-batching`) improves throughput.
284+
285+
## Model Information
286+
287+
### gpt-oss-20b
288+
289+
- **Parameters**: 21B total, 3.6B active (Mixture of Experts)
290+
- **Quantization**: MXFP4 (mixed-precision floating-point)
291+
- **File Size**: 12.1GB
292+
- **Context**: Supports extended context (tested up to 128K)
293+
- **License**: Apache 2.0
294+
- **Source**: [HuggingFace - ggml-org/gpt-oss-20b-GGUF](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF)
295+
296+
## Differences from Reference Chart
297+
298+
This chart extends the [phymbert/llama.cpp example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp) with:
299+
300+
- **Longhorn storage** instead of generic storage class
301+
- **Sealed secrets + patch job** for LoadBalancer IP management
302+
- **RTX 5090 optimizations** (CUDA 12.8, runtime class, tolerations)
303+
- **RBAC and ServiceAccount** for Argo CD integration
304+
- **GPU resource management** in deployment spec
305+
- **gpt-oss-20b model** configuration (vs. tinyllamas example)
306+
- **Enhanced job with SHA256 validation** and resume capability
307+
308+
## References
309+
310+
- [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp)
311+
- [llama.cpp Docker Images](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md)
312+
- [phymbert/llama.cpp Kubernetes Example](https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes/llama-cpp)
313+
- [OpenAI gpt-oss](https://github.com/openai/gpt-oss)
314+
- [gpt-oss-20b GGUF Models](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF)
315+
- [RTX 5090 Specifications](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/)
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
1. Get the application URL by running these commands:
2+
{{- if and .Values.server.completions .Values.ingresses.completions.enabled }}
3+
{{- range $host := .Values.ingresses.completions.hosts }}
4+
{{- range .paths }}
5+
Completions API: http{{ if $.Values.ingresses.completions.tls }}s{{ end }}://{{ $host.host }}{{ .path }}
6+
{{- end }}
7+
{{- end }}
8+
{{- end }}
9+
{{- if and .Values.server.embeddings .Values.ingresses.embeddings.enabled }}
10+
{{- range $host := .Values.ingresses.embeddings.hosts }}
11+
{{- range .paths }}
12+
Embeddings API: http{{ if $.Values.ingresses.embeddings.tls }}s{{ end }}://{{ $host.host }}{{ .path }}
13+
{{- end }}
14+
{{- end }}
15+
{{- end }}
16+
{{- if contains "LoadBalancer" .Values.service.type }}
17+
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
18+
You can watch the status by running 'kubectl get --namespace {{ .Release.Namespace }} svc -w {{ include "server.llama.cpp.fullname" . }}'
19+
export SERVICE_IP=$(kubectl get svc --namespace {{ .Release.Namespace }} {{ include "server.llama.cpp.fullname" . }} --template "{{"{{ range (index .status.loadBalancer.ingress 0) }}{{.}}{{ end }}"}}")
20+
echo "llama.cpp server: http://$SERVICE_IP:{{ .Values.service.port }}"
21+
{{- else if contains "ClusterIP" .Values.service.type }}
22+
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "server.llama.cpp.name" . }},app.kubernetes.io/instance={{ .Release.Name }}" -o jsonpath="{.items[0].metadata.name}")
23+
export CONTAINER_PORT=$(kubectl get pod --namespace {{ .Release.Namespace }} $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
24+
echo "Visit http://127.0.0.1:8080 to use your application"
25+
kubectl --namespace {{ .Release.Namespace }} port-forward $POD_NAME 8080:$CONTAINER_PORT
26+
{{- end }}
27+
28+
2. llama.cpp server is configured with:
29+
- Model: {{ .Values.model.alias }} ({{ .Values.model.file }})
30+
- GPU support: {{ .Values.gpu.enabled }}
31+
- Context size: {{ .Values.server.kvCache.size }}
32+
- Parallel slots: {{ .Values.server.slots }}
33+
- Flash attention: enabled (via -fa flag)
34+
- GPU layers offloaded: 99
35+
36+
3. API endpoints:
37+
- Health: http://<SERVICE_IP>:{{ .Values.service.port }}/health
38+
- Completions: http://<SERVICE_IP>:{{ .Values.service.port }}/v1/completions
39+
{{- if .Values.server.embeddings }}
40+
- Embeddings: http://<SERVICE_IP>:{{ .Values.service.port }}/v1/embeddings
41+
{{- end }}
42+
{{- if .Values.server.metrics }}
43+
- Metrics (Prometheus): http://<SERVICE_IP>:{{ .Values.service.port }}/metrics
44+
{{- end }}
45+
46+
4. Example curl command:
47+
curl -X POST http://<SERVICE_IP>:{{ .Values.service.port }}/v1/completions \
48+
-H "Content-Type: application/json" \
49+
-d '{"prompt": "Once upon a time", "max_tokens": 100}'

0 commit comments

Comments
 (0)