Add script for pd disagg on XPU#5
Conversation
There was a problem hiding this comment.
Pull request overview
Adds XPU-focused shell scripts to run NixlConnector prefill/decode (“P/D disagg”) accuracy integration tests and to sweep multiple TP/DP-EP configurations.
Changes:
- Added
run_accuracy_test_xpu.shto launch prefill/decodevllm serveinstances on XPU and runtest_accuracy.pythrough the toy proxy. - Added
config_sweep_accuracy_test_xpu.shto run the XPU accuracy test script across multiple TP (and optional DP-EP) configurations and attention backends.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
tests/v1/kv_connector/nixl_integration/run_accuracy_test_xpu.sh |
New runner script to launch XPU-based prefill/decode instances, start the proxy, and execute accuracy pytest. |
tests/v1/kv_connector/nixl_integration/config_sweep_accuracy_test_xpu.sh |
New wrapper to sweep environment configurations and optional attention backends/cross-layer mode for the XPU runner. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ;; | ||
| *) | ||
| echo "Unknown option $1" | ||
| echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]" |
There was a problem hiding this comment.
The usage/help text doesn’t mention the supported --enable-cross-layers flag, so users won’t discover it from --help/error output. Please include that option in the usage string (and ideally document what it changes).
| echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]" | |
| echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>] [--enable-cross-layers]" | |
| echo " --enable-cross-layers Enable cross-layer KV connector blocks in kv_connector_extra_config." |
| # Trap the SIGINT signal (triggered by Ctrl+C) | ||
| trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT |
There was a problem hiding this comment.
The EXIT/SIGINT trap runs kill $(jobs -pr). If there are no background jobs (e.g., argument parsing fails before any processes are started), kill is invoked with no PIDs and returns a non-zero status, which can cause confusing failures under set -e. Consider guarding for an empty job list or using an approach that no-ops cleanly when there’s nothing to kill.
| # Trap the SIGINT signal (triggered by Ctrl+C) | |
| trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT | |
| # Trap the SIGINT/SIGTERM/EXIT signals and clean up background jobs safely | |
| trap 'jobs -pr | xargs kill 2>/dev/null || true' SIGINT SIGTERM EXIT |
| # Function to clean up previous instances | ||
| cleanup_instances() { | ||
| echo "Cleaning up any running vLLM instances..." | ||
| pkill -f "vllm serve" || true |
There was a problem hiding this comment.
cleanup_instances uses pkill -f "vllm serve", which can terminate unrelated vLLM servers running on the same host (including other developers’ sessions or parallel CI steps). Prefer tracking the PIDs you start in this script and only stopping those, or scope the match more narrowly (e.g., by port/unique env var).
| pkill -f "vllm serve" || true | |
| pkill -P $$ -f "vllm serve" || true |
| FULL_CMD="$BASE_CMD" | ||
|
|
||
| eval "$FULL_CMD &" | ||
|
|
There was a problem hiding this comment.
The script builds a shell command as a string and runs it via eval. Because parts of the command come from inputs (MODEL_NAMES, --attention-backend, etc.), this creates an avoidable shell-injection risk and makes quoting brittle. Prefer constructing the command as an array and executing it without eval (or otherwise ensure arguments are safely quoted).
| #!/usr/bin/env bash | ||
| set -euo pipefail |
There was a problem hiding this comment.
The PR description is still the template placeholder (purpose/test plan/results are empty). Please fill it out so reviewers/CI owners know what this script is intended to validate and how it was tested on XPU.
| # Check if cross-layers is enabled (non-empty) | ||
| if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then | ||
| echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers" | ||
| run_tests "default backend" "--enable-cross-layers" |
There was a problem hiding this comment.
When rerunning with --enable-cross-layers, the label passed to run_tests is still "default backend", which makes the logs ambiguous about which mode failed. Consider using a distinct label (e.g., "cross-layers") for clearer output.
| run_tests "default backend" "--enable-cross-layers" | |
| run_tests "cross-layers backend" "--enable-cross-layers" |
There was a problem hiding this comment.
Code Review
This pull request introduces two new shell scripts for running accuracy tests for prefill/decode disaggregation on Intel XPUs. The security review found no vulnerabilities in the provided code changes. However, the code review identified critical issues such as bugs in the GPU ID assignment logic that will cause the script to fail, and a potential gap in test coverage where different configurations are not tested in combination. Although no direct vulnerability was found, the use of eval for command execution was noted as a potential security risk. Suggestions have been provided to address these problems.
| GPU_ID=$((i % $(get_num_gpus))) | ||
| NEXT_GPU=${GPU_ID} | ||
| # If PREFILLER_TP_SIZE is more than 1 | ||
| for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do | ||
| NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus))) | ||
| GPU_ID="${GPU_ID},${NEXT_GPU}" | ||
| done |
There was a problem hiding this comment.
There is a bug in the GPU ID calculation logic. The variable GPU_ID is first assigned an integer value, but inside the loop it is modified to be a comma-separated string (e.g., "0,1"). In the next iteration, this string is used in an arithmetic expression (GPU_ID + j), which will cause a syntax error and terminate the script due to set -e.
To fix this, you should use a base GPU ID for calculations and build the final GPU_ID string from a list of IDs.
| GPU_ID=$((i % $(get_num_gpus))) | |
| NEXT_GPU=${GPU_ID} | |
| # If PREFILLER_TP_SIZE is more than 1 | |
| for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do | |
| NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus))) | |
| GPU_ID="${GPU_ID},${NEXT_GPU}" | |
| done | |
| local base_gpu_id=$((i % $(get_num_gpus))) | |
| local -a gpu_ids=() | |
| for (( j=0; j < PREFILLER_TP_SIZE; j++ )); do | |
| gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus)))) | |
| done | |
| GPU_ID=$(IFS=,; echo "${gpu_ids[*]}") | |
| NEXT_GPU=${gpu_ids[-1]} |
| GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus))) | ||
| # If DECODER_TP_SIZE is more than 1 | ||
| for (( j=1; j < DECODER_TP_SIZE; j++ )); do | ||
| NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus))) | ||
| GPU_ID="${GPU_ID},${NEXT_GPU}" | ||
| done |
There was a problem hiding this comment.
This GPU ID calculation has the same bug as the one for prefill instances. The GPU_ID variable is used in an arithmetic expression inside the loop while it's being modified into a comma-separated string. This will cause a syntax error.
| GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus))) | |
| # If DECODER_TP_SIZE is more than 1 | |
| for (( j=1; j < DECODER_TP_SIZE; j++ )); do | |
| NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus))) | |
| GPU_ID="${GPU_ID},${NEXT_GPU}" | |
| done | |
| local base_gpu_id=$(((i + NEXT_GPU + 1) % $(get_num_gpus))) | |
| local -a gpu_ids=() | |
| for (( j=0; j < DECODER_TP_SIZE; j++ )); do | |
| gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus)))) | |
| done | |
| GPU_ID=$(IFS=,; echo "${gpu_ids[*]}") |
| # Run tests | ||
| if [[ -n "${ROCM_ATTN:-}" ]]; then | ||
| echo "ROCM_ATTN is set, running with --attention-backend ROCM_ATTN" | ||
| run_tests "ROCM_ATTN backend" "--attention-backend ROCM_ATTN" | ||
| else | ||
| run_tests "default backend" "" | ||
| fi | ||
|
|
||
| # Check if FLASHINFER is set (non-empty) | ||
| if [[ -n "${FLASHINFER:-}" ]]; then | ||
| echo "FLASHINFER is set, rerunning with --attention-backend FLASHINFER" | ||
| run_tests "FLASHINFER backend" "--attention-backend FLASHINFER" | ||
| else | ||
| echo "FLASHINFER not set, skipping FLASHINFER runs." | ||
| fi | ||
|
|
||
| # Check if cross-layers is enabled (non-empty) | ||
| if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then | ||
| echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers" | ||
| run_tests "default backend" "--enable-cross-layers" | ||
| fi |
There was a problem hiding this comment.
The current test execution logic does not test combinations of attention backends and features. For instance, if both ROCM_ATTN and CROSS_LAYERS_BLOCKS are set, the script will test:
ROCM_ATTNbackend without cross-layers.- Default backend with cross-layers.
It will not test the combination of ROCM_ATTN with cross-layers. The same applies to FLASHINFER.
This seems like a significant gap in test coverage. Consider restructuring the test execution to cover these important combinations. A possible approach would be to build a list of arguments for each feature and then iterate through their combinations.
| BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \ | ||
| VLLM_KV_CACHE_LAYOUT='HND' \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \ | ||
| vllm serve $model_name \ | ||
| --port $PORT \ | ||
| --enforce-eager \ | ||
| --block-size ${PREFILL_BLOCK_SIZE} \ | ||
| --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \ | ||
| --tensor-parallel-size $PREFILLER_TP_SIZE \ | ||
| --kv-transfer-config '$KV_CONFIG'" | ||
|
|
||
| # Add attention backend config if specified | ||
| if [[ -n "$ATTENTION_BACKEND" ]]; then | ||
| BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND" | ||
| fi | ||
|
|
||
| FULL_CMD="$BASE_CMD" | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.
local -a cmd_args=(
vllm serve "$model_name" \
--port "$PORT" \
--enforce-eager \
--block-size "${PREFILL_BLOCK_SIZE}" \
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
--tensor-parallel-size "$PREFILLER_TP_SIZE" \
--kv-transfer-config "$KV_CONFIG"
)
# Add attention backend config if specified
if [[ -n "$ATTENTION_BACKEND" ]]; then
cmd_args+=(--attention-backend "$ATTENTION_BACKEND")
fi
env \
ZE_AFFINITY_MASK="$GPU_ID" \
VLLM_KV_CACHE_LAYOUT='HND' \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \
"${cmd_args[@]}" &| BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \ | ||
| VLLM_KV_CACHE_LAYOUT=$DECODER_KV_LAYOUT \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \ | ||
| vllm serve $model_name \ | ||
| --port $PORT \ | ||
| --enforce-eager \ | ||
| --block-size ${DECODE_BLOCK_SIZE} \ | ||
| --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \ | ||
| --kv-transfer-config '$KV_CONFIG'" | ||
|
|
||
| # Add attention backend config if specified | ||
| if [[ -n "$ATTENTION_BACKEND" ]]; then | ||
| BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND" | ||
| fi | ||
|
|
||
| # DP-EP attention mode | ||
| if [[ -z "$DP_EP" ]]; then | ||
| BASE_CMD="${BASE_CMD} --tensor-parallel-size $DECODER_TP_SIZE" | ||
| else | ||
| echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1" | ||
| BASE_CMD="${BASE_CMD} --data-parallel-size $DECODER_TP_SIZE \ | ||
| --tensor-parallel-size 1 --enable-expert-parallel" | ||
| fi | ||
|
|
||
| FULL_CMD="$BASE_CMD" | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.
local -a cmd_args=(
vllm serve "$model_name" \
--port "$PORT" \
--enforce-eager \
--block-size "${DECODE_BLOCK_SIZE}" \
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
--kv-transfer-config "$KV_CONFIG"
)
# Add attention backend config if specified
if [[ -n "$ATTENTION_BACKEND" ]]; then
cmd_args+=(--attention-backend "$ATTENTION_BACKEND")
fi
# DP-EP attention mode
if [[ -z "$DP_EP" ]]; then
cmd_args+=(--tensor-parallel-size "$DECODER_TP_SIZE")
else
echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1"
cmd_args+=(--data-parallel-size "$DECODER_TP_SIZE" \
--tensor-parallel-size 1 --enable-expert-parallel)
fi
env \
ZE_AFFINITY_MASK="$GPU_ID" \
VLLM_KV_CACHE_LAYOUT="$DECODER_KV_LAYOUT" \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \
"${cmd_args[@]}" &…on MI300x (vllm-project#36247) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: mcelrath
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com>
…n chat tests (vllm-project#36341) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.