Skip to content

Commit 4e2e7ff

Browse files
committed
workflows: Add vLLM workflow for LLM inference and production deployment
Add support for deploying and testing vLLM inference engine and the vLLM Production Stack. The workflow enables automated testing of both vLLM as a single-node inference server and the production stack's cluster-wide orchestration capabilities including routing, scaling, and distributed caching. We start off with CPU support for both. For the production stack two replicas are requested so two engines, each one requiring 16 GiB memory. Given other requirements we ask for at least 64 GiB RAM for the production stack vllm CPU test. To get the production stack up and running you just use: make defconfig-vllm-production-stack-cpu KDEVOPS_HOSTS_PREFIX="demo" make make bringup make vllm AV=2 At this point you end up with two replicas serving through the vLLM production stack router. vLLM is a high-performance inference engine for large language models, optimized for throughput and memory efficiency through PagedAttention and continuous batching. The vLLM Production Stack builds on top of this engine to provide cluster-wide serving with intelligent request routing, distributed KV cache sharing via LMCache, unified observability, and autoscaling across multiple model replicas. The implementation supports three deployment methods: simple Docker containers for development, Kubernetes with the official Production Stack Helm chart for cluster deployments (https://github.com/vllm-project/production-stack), and bare metal with systemd for direct hardware access. Each method shares common configuration through Kconfig while maintaining deployment-specific optimizations. Testing can be performed with either CPU-only or GPU-accelerated inference. CPU testing uses openeuler/vllm-cpu images to validate the vLLM API and the production stack's orchestration layer without requiring GPU hardware, making it suitable for CI/CD pipelines and development workflows. This enables testing of the router's routing algorithms (round-robin, session affinity, prefix-aware), service discovery, load balancing, and API compatibility. GPU testing validates full production scenarios including LMCache distributed cache sharing, tensor parallelism, and autoscaling behavior. The workflow integrates Docker registry mirror support with automatic detection via 9P mounts. When /mirror/docker is available, the system automatically configures Docker daemon registry-mirrors for transparent pull-through caching, reducing deployment time without requiring manual configuration. The detection uses the libvirt gateway IP to ensure proper routing from containers and minikube pods. Image configuration follows Docker's native registry-mirrors pattern rather than rewriting image names. This preserves the original repository paths like 'openeuler/vllm-cpu:latest' and 'ghcr.io/vllm-project/production-stack/router:latest' while still benefiting from mirror caching when available. Status monitoring is provided through: make vllm-status make vllm-status-simplified which parse deployment state and present it with context-aware guidance about next steps. The vllm-quick-test target provides rapid smoke testing across all configured nodes with timing measurements and proper exit codes for CI integration. To test an LLM query: make vllm-quick-test We provide basic documentation to help clarify the distinction between vLLM (the inference engine) and the Production Stack (the orchestration layer). For more details refer to the official release announcement at: https://blog.lmcache.ai/2025-01-21-stack-release/ The long term plan is to scale with mocked engines, and then also real GPUs support both bare metal and on the cloud, leveraging kdevops's cloud agnostic power for any workflow. Here's an example quick test: mcgrof@beefy-server /xfs1/mcgrof/vllm/kdevops (git::vllm-v2)$ make vllm-quick-test ======================================== vLLM Quick Test ======================================== Prompt: "kdevops is" Max tokens: 30 Nodes to test: 1 Testing Baseline node: lpc-vllm ---------------------------------------- Node IP: 192.168.122.170 Starting kubectl port-forward... Sending request: "kdevops is" ✓ Success! Duration: 15.747292458s Full response: "kdevops iseasily a higher level doctor than your list. really it depends on as on what doc is what 15 less ifmay its just personal preferences." Full JSON response: { "id": "cmpl-2f031a35c5364d3aaf2b9f0007d46ae5", "object": "text_completion", "created": 1759424719, "model": "facebook/opt-125m", "choices": [ { "index": 0, "text": " easily a higher level doctor than your list.\nreally it depends on as on what doc is what 15 less ifmay its just personal preferences.\n", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 35, "completion_tokens": 30, "prompt_tokens_details": null }, "kv_transfer_params": null } ======================================== All tests passed! ======================================== Then for a synthetic benchmark: make vllm-benchmark You should end up with results in workflows/vllm/results/html/ I have put demo results of a synthetic run and also a real workload on a virtual 64 vcpus 64 GiB DRAM here: https://github.com/mcgrof/demo-vllm-benchmark Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
1 parent 343fbdf commit 4e2e7ff

40 files changed

+5076
-2
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ playbooks/roles/linux-mirror/linux-mirror-systemd/mirrors.yaml
9191
workflows/selftests/results/
9292

9393
workflows/minio/results/
94+
workflows/vllm/results/
9495

9596
workflows/linux/refs/default/Kconfig.linus
9697
workflows/linux/refs/default/Kconfig.next

PROMPTS.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,37 @@ and example commits and their outcomes, and notes by users of the AI agent
55
grading. It is also instructive for humans to learn how to use generative
66
AI to easily extend kdevops for their own needs.
77

8+
## Adding new AI/ML workflows
9+
10+
### Adding vLLM Production Stack workflow
11+
12+
**Prompt:**
13+
I have placed in ../production-stack/ the https://github.com/vllm-project/production-stack.git
14+
project. Familiarize yourself with it and then add support for as a new
15+
I workflow, other than Milvus AI on kdevops.
16+
17+
**AI:** Claude Code
18+
**Commit:** TBD
19+
**Result:** Tough
20+
**Grading:** 50%
21+
22+
**Notes:**
23+
24+
Adding just vllm was fairly trivial. However the production stack project
25+
lacked any clear documentation about what docker container image could be
26+
used for CPU support, and all docker container images had one or another
27+
obscure issue.
28+
29+
So while getting the vllm and the production stack generally supported was
30+
faily trivial, the lack of proper docs make it hard to figure out exactly what
31+
to do.
32+
33+
Fortunately the implementation correctly identified the need for Kubernetes
34+
orchestration, included support for various deployment options (Minikube vs
35+
existing clusters), and integrated monitoring with Prometheus/Grafana. The
36+
workflow supports A/B testing, multiple routing algorithms, and performance
37+
benchmarking capabilities.
38+
839
## Extending existing Linux kernel selftests
940

1041
Below are a set of example prompts / result commits of extending existing

README.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -285,10 +285,30 @@ For detailed documentation and demo results, see the
285285

286286
### AI workflow
287287

288-
kdevops now supports AI/ML system benchmarking, starting with vector databases
289-
like Milvus. Similar to fstests, you can quickly set up and benchmark AI
288+
kdevops now supports AI/ML system benchmarking, including vector databases
289+
and LLM serving infrastructure. Similar to fstests, you can quickly set up and benchmark AI
290290
infrastructure with just a few commands:
291291

292+
#### vLLM Production Stack
293+
Deploy and benchmark large language models using the vLLM Production Stack:
294+
295+
```bash
296+
make defconfig-vllm
297+
make bringup
298+
make vllm
299+
make vllm-benchmark
300+
```
301+
302+
The vLLM workflow provides:
303+
- **Production LLM Deployment**: Kubernetes-based vLLM serving with Helm
304+
- **Request Routing**: Multiple algorithms (round-robin, session affinity, prefix-aware)
305+
- **Observability**: Integrated Prometheus and Grafana monitoring
306+
- **Performance Features**: Prefix caching, chunked prefill, KV cache offloading
307+
- **A/B Testing**: Compare different model configurations
308+
309+
#### Milvus Vector Database
310+
Benchmark vector database performance for AI applications:
311+
292312
```bash
293313
make defconfig-ai-milvus-docker
294314
make bringup
@@ -303,6 +323,7 @@ The AI workflow supports:
303323
- **Demo Results**: View actual benchmark HTML reports and performance visualizations
304324

305325
For details and demo results, see:
326+
- [kdevops vLLM workflow documentation](workflows/vllm/)
306327
- [kdevops AI workflow documentation](docs/ai/README.md)
307328
- [Milvus performance demo results](docs/ai/vector-databases/milvus.md#demo-results)
308329

@@ -358,6 +379,7 @@ want to just use the kernel that comes with your Linux distribution.
358379
* [kdevops selftests docs](docs/selftests.md)
359380
* [kdevops reboot-limit docs](docs/reboot-limit.md)
360381
* [kdevops AI workflow docs](docs/ai/README.md)
382+
* [kdevops vLLM workflow docs](workflows/vllm/)
361383

362384
# kdevops general documentation
363385

defconfigs/vllm

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# vLLM configuration with Latest Docker deployment
2+
CONFIG_KDEVOPS_FIRST_RUN=n
3+
CONFIG_LIBVIRT=y
4+
CONFIG_LIBVIRT_VCPUS=8
5+
CONFIG_LIBVIRT_MEM_32G=y
6+
7+
# Workflow configuration
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
12+
CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
13+
14+
# vLLM specific configuration
15+
CONFIG_VLLM_LATEST_DOCKER=y
16+
CONFIG_VLLM_K8S_MINIKUBE=y
17+
CONFIG_VLLM_HELM_RELEASE_NAME="vllm"
18+
CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
19+
CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
20+
CONFIG_VLLM_MODEL_NAME="opt-125m"
21+
CONFIG_VLLM_REPLICA_COUNT=1
22+
CONFIG_VLLM_USE_CPU_INFERENCE=y
23+
CONFIG_VLLM_REQUEST_CPU=8
24+
CONFIG_VLLM_REQUEST_MEMORY="32Gi"
25+
CONFIG_VLLM_REQUEST_GPU=0
26+
CONFIG_VLLM_MAX_MODEL_LEN=2048
27+
CONFIG_VLLM_DTYPE="float32"
28+
CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
29+
CONFIG_VLLM_ROUTER_ENABLED=y
30+
CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
31+
CONFIG_VLLM_OBSERVABILITY_ENABLED=y
32+
CONFIG_VLLM_GRAFANA_PORT=3000
33+
CONFIG_VLLM_PROMETHEUS_PORT=9090
34+
CONFIG_VLLM_API_PORT=8000
35+
CONFIG_VLLM_API_KEY=""
36+
CONFIG_VLLM_HF_TOKEN=""
37+
CONFIG_VLLM_BENCHMARK_ENABLED=y
38+
CONFIG_VLLM_BENCHMARK_DURATION=60
39+
CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
40+
CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# vLLM Production Stack configuration with official Helm chart
2+
CONFIG_KDEVOPS_FIRST_RUN=n
3+
CONFIG_LIBVIRT=y
4+
CONFIG_LIBVIRT_VCPUS=64
5+
CONFIG_LIBVIRT_MEM_64G=y
6+
7+
# Workflow configuration
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
12+
CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
13+
14+
# vLLM Production Stack specific configuration
15+
CONFIG_VLLM_PRODUCTION_STACK=y
16+
CONFIG_VLLM_K8S_MINIKUBE=y
17+
CONFIG_VLLM_VERSION_LATEST=y
18+
CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
19+
CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
20+
CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
21+
CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
22+
CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
23+
CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
24+
CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
25+
CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=n
26+
CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
27+
CONFIG_VLLM_MODEL_NAME="opt-125m"
28+
CONFIG_VLLM_REPLICA_COUNT=2
29+
CONFIG_VLLM_USE_CPU_INFERENCE=y
30+
CONFIG_VLLM_REQUEST_CPU=8
31+
CONFIG_VLLM_REQUEST_MEMORY="20Gi"
32+
CONFIG_VLLM_REQUEST_GPU=0
33+
CONFIG_VLLM_MAX_MODEL_LEN=2048
34+
CONFIG_VLLM_DTYPE="float32"
35+
CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
36+
CONFIG_VLLM_ROUTER_ENABLED=y
37+
CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
38+
CONFIG_VLLM_OBSERVABILITY_ENABLED=y
39+
CONFIG_VLLM_GRAFANA_PORT=3000
40+
CONFIG_VLLM_PROMETHEUS_PORT=9090
41+
CONFIG_VLLM_API_PORT=8000
42+
CONFIG_VLLM_BENCHMARK_ENABLED=y
43+
CONFIG_VLLM_BENCHMARK_DURATION=60
44+
CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
45+
CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"

defconfigs/vllm-quick-test

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# vLLM Production Stack quick test configuration (CI/demo)
2+
CONFIG_KDEVOPS_FIRST_RUN=n
3+
CONFIG_LIBVIRT=y
4+
CONFIG_LIBVIRT_VCPUS=4
5+
CONFIG_LIBVIRT_MEM_16G=y
6+
7+
# Workflow configuration
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
12+
CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
13+
14+
# vLLM specific configuration - Quick test mode
15+
CONFIG_VLLM_PRODUCTION_STACK=y
16+
CONFIG_VLLM_K8S_MINIKUBE=y
17+
CONFIG_VLLM_HELM_RELEASE_NAME="vllm"
18+
CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
19+
CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
20+
CONFIG_VLLM_MODEL_NAME="opt-125m"
21+
CONFIG_VLLM_REPLICA_COUNT=1
22+
CONFIG_VLLM_REQUEST_CPU=2
23+
CONFIG_VLLM_REQUEST_MEMORY="8Gi"
24+
CONFIG_VLLM_REQUEST_GPU=0
25+
CONFIG_VLLM_GPU_TYPE=""
26+
CONFIG_VLLM_MAX_MODEL_LEN=512
27+
CONFIG_VLLM_DTYPE="auto"
28+
CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.9"
29+
CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
30+
CONFIG_VLLM_ROUTER_ENABLED=y
31+
CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
32+
CONFIG_VLLM_OBSERVABILITY_ENABLED=y
33+
CONFIG_VLLM_GRAFANA_PORT=3000
34+
CONFIG_VLLM_PROMETHEUS_PORT=9090
35+
CONFIG_VLLM_API_PORT=8000
36+
CONFIG_VLLM_API_KEY=""
37+
CONFIG_VLLM_HF_TOKEN=""
38+
CONFIG_VLLM_QUICK_TEST=y
39+
CONFIG_VLLM_BENCHMARK_ENABLED=y
40+
CONFIG_VLLM_BENCHMARK_DURATION=30
41+
CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=5
42+
CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"

kconfigs/Kconfig.libvirt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,7 @@ config LIBVIRT_LARGE_CPU
335335

336336
choice
337337
prompt "Guest vCPUs"
338+
default LIBVIRT_VCPUS_64 if KDEVOPS_WORKFLOW_DEDICATE_VLLM
338339
default LIBVIRT_VCPUS_8
339340

340341
config LIBVIRT_VCPUS_2
@@ -408,6 +409,7 @@ config LIBVIRT_VCPUS_COUNT
408409

409410
choice
410411
prompt "How much GiB memory to use per guest"
412+
default LIBVIRT_MEM_64G if KDEVOPS_WORKFLOW_DEDICATE_VLLM
411413
default LIBVIRT_MEM_4G
412414

413415
config LIBVIRT_MEM_2G
@@ -478,6 +480,7 @@ config LIBVIRT_MEM_MB
478480
config LIBVIRT_IMAGE_SIZE
479481
string "VM image size"
480482
output yaml
483+
default "100G" if KDEVOPS_WORKFLOW_DEDICATE_VLLM
481484
default "20G"
482485
depends on GUESTFS
483486
help

kconfigs/workflows/Kconfig

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,14 @@ config KDEVOPS_WORKFLOW_DEDICATE_AI
233233
This will dedicate your configuration to running only the
234234
AI workflow for vector database performance testing.
235235

236+
config KDEVOPS_WORKFLOW_DEDICATE_VLLM
237+
bool "vllm"
238+
select KDEVOPS_WORKFLOW_ENABLE_VLLM
239+
help
240+
This will dedicate your configuration to running only the
241+
vLLM Production Stack workflow for deploying and benchmarking
242+
large language models with Kubernetes.
243+
236244
config KDEVOPS_WORKFLOW_DEDICATE_MINIO
237245
bool "minio"
238246
select KDEVOPS_WORKFLOW_ENABLE_MINIO
@@ -265,6 +273,7 @@ config KDEVOPS_WORKFLOW_NAME
265273
default "mmtests" if KDEVOPS_WORKFLOW_DEDICATE_MMTESTS
266274
default "fio-tests" if KDEVOPS_WORKFLOW_DEDICATE_FIO_TESTS
267275
default "ai" if KDEVOPS_WORKFLOW_DEDICATE_AI
276+
default "vllm" if KDEVOPS_WORKFLOW_DEDICATE_VLLM
268277
default "minio" if KDEVOPS_WORKFLOW_DEDICATE_MINIO
269278
default "build-linux" if KDEVOPS_WORKFLOW_DEDICATE_BUILD_LINUX
270279

@@ -395,6 +404,14 @@ config KDEVOPS_WORKFLOW_NOT_DEDICATED_ENABLE_AI
395404
Select this option if you want to provision AI benchmarks on a
396405
single target node for by-hand testing.
397406

407+
config KDEVOPS_WORKFLOW_NOT_DEDICATED_ENABLE_VLLM
408+
bool "vllm"
409+
select KDEVOPS_WORKFLOW_ENABLE_VLLM
410+
depends on LIBVIRT || TERRAFORM_PRIVATE_NET
411+
help
412+
Select this option if you want to provision vLLM Production Stack
413+
on a single target node for by-hand testing and development.
414+
398415
endif # !WORKFLOWS_DEDICATED_WORKFLOW
399416

400417
config KDEVOPS_WORKFLOW_ENABLE_FSTESTS
@@ -530,6 +547,17 @@ source "workflows/ai/Kconfig"
530547
endmenu
531548
endif # KDEVOPS_WORKFLOW_ENABLE_AI
532549

550+
config KDEVOPS_WORKFLOW_ENABLE_VLLM
551+
bool
552+
output yaml
553+
default y if KDEVOPS_WORKFLOW_NOT_DEDICATED_ENABLE_VLLM || KDEVOPS_WORKFLOW_DEDICATE_VLLM
554+
555+
if KDEVOPS_WORKFLOW_ENABLE_VLLM
556+
menu "Configure and run vLLM Production Stack"
557+
source "workflows/vllm/Kconfig"
558+
endmenu
559+
endif # KDEVOPS_WORKFLOW_ENABLE_VLLM
560+
533561
config KDEVOPS_WORKFLOW_ENABLE_MINIO
534562
bool
535563
output yaml

playbooks/roles/gen_hosts/defaults/main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ kdevops_workflow_enable_sysbench: false
3030
kdevops_workflow_enable_fio_tests: false
3131
kdevops_workflow_enable_mmtests: false
3232
kdevops_workflow_enable_ai: false
33+
kdevops_workflow_enable_vllm: false
3334
workflows_reboot_limit: false
3435
kdevops_use_declared_hosts: false
3536

playbooks/roles/gen_hosts/tasks/main.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,21 @@
270270
- ansible_hosts_template.stat.exists
271271
- not kdevops_use_declared_hosts|default(false)|bool
272272

273+
- name: Generate the Ansible hosts file for a dedicated vLLM setup
274+
tags: ['hosts']
275+
ansible.builtin.template:
276+
src: "{{ kdevops_hosts_template }}"
277+
dest: "{{ ansible_cfg_inventory }}"
278+
force: true
279+
trim_blocks: True
280+
lstrip_blocks: True
281+
mode: '0644'
282+
when:
283+
- kdevops_workflows_dedicated_workflow
284+
- kdevops_workflow_enable_vllm|default(false)|bool
285+
- ansible_hosts_template.stat.exists
286+
- not kdevops_use_declared_hosts|default(false)|bool
287+
273288
- name: Verify if final host file exists
274289
ansible.builtin.stat:
275290
path: "{{ ansible_cfg_inventory }}"

0 commit comments

Comments
 (0)