Add script for pd disagg on XPU by zhenwei-intel · Pull Request #5 · zhenwei-intel/vllm

zhenwei-intel · 2026-03-07T06:26:12Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copilot

Pull request overview

Adds XPU-focused shell scripts to run NixlConnector prefill/decode (“P/D disagg”) accuracy integration tests and to sweep multiple TP/DP-EP configurations.

Changes:

Added run_accuracy_test_xpu.sh to launch prefill/decode vllm serve instances on XPU and run test_accuracy.py through the toy proxy.
Added config_sweep_accuracy_test_xpu.sh to run the XPU accuracy test script across multiple TP (and optional DP-EP) configurations and attention backends.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`tests/v1/kv_connector/nixl_integration/run_accuracy_test_xpu.sh`	New runner script to launch XPU-based prefill/decode instances, start the proxy, and execute accuracy pytest.
`tests/v1/kv_connector/nixl_integration/config_sweep_accuracy_test_xpu.sh`	New wrapper to sweep environment configurations and optional attention backends/cross-layer mode for the XPU runner.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T06:29:36Z

+      ;;
+    *)
+      echo "Unknown option $1"
+      echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]"


The usage/help text doesn’t mention the supported --enable-cross-layers flag, so users won’t discover it from --help/error output. Please include that option in the usage string (and ideally document what it changes).

Suggested change

echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]"

echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>] [--enable-cross-layers]"

echo " --enable-cross-layers Enable cross-layer KV connector blocks in kv_connector_extra_config."

Copilot · 2026-03-07T06:29:37Z

+# Trap the SIGINT signal (triggered by Ctrl+C)
+trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT


The EXIT/SIGINT trap runs kill $(jobs -pr). If there are no background jobs (e.g., argument parsing fails before any processes are started), kill is invoked with no PIDs and returns a non-zero status, which can cause confusing failures under set -e. Consider guarding for an empty job list or using an approach that no-ops cleanly when there’s nothing to kill.

Suggested change

# Trap the SIGINT signal (triggered by Ctrl+C)

trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT

# Trap the SIGINT/SIGTERM/EXIT signals and clean up background jobs safely

trap 'jobs -pr | xargs kill 2>/dev/null || true' SIGINT SIGTERM EXIT

Copilot · 2026-03-07T06:29:37Z

+# Function to clean up previous instances
+cleanup_instances() {
+  echo "Cleaning up any running vLLM instances..."
+  pkill -f "vllm serve" || true


cleanup_instances uses pkill -f "vllm serve", which can terminate unrelated vLLM servers running on the same host (including other developers’ sessions or parallel CI steps). Prefer tracking the PIDs you start in this script and only stopping those, or scope the match more narrowly (e.g., by port/unique env var).

Suggested change

pkill -f "vllm serve" || true

pkill -P $$ -f "vllm serve" || true

Copilot · 2026-03-07T06:29:37Z

+    FULL_CMD="$BASE_CMD"
+
+    eval "$FULL_CMD &"
+


The script builds a shell command as a string and runs it via eval. Because parts of the command come from inputs (MODEL_NAMES, --attention-backend, etc.), this creates an avoidable shell-injection risk and makes quoting brittle. Prefer constructing the command as an array and executing it without eval (or otherwise ensure arguments are safely quoted).

Copilot · 2026-03-07T06:29:37Z

+#!/usr/bin/env bash
+set -euo pipefail


The PR description is still the template placeholder (purpose/test plan/results are empty). Please fill it out so reviewers/CI owners know what this script is intended to validate and how it was tested on XPU.

Copilot · 2026-03-07T06:29:37Z

+# Check if cross-layers is enabled (non-empty)
+if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then
+  echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers"
+  run_tests "default backend" "--enable-cross-layers"


When rerunning with --enable-cross-layers, the label passed to run_tests is still "default backend", which makes the logs ambiguous about which mode failed. Consider using a distinct label (e.g., "cross-layers") for clearer output.

Suggested change

run_tests "default backend" "--enable-cross-layers"

run_tests "cross-layers backend" "--enable-cross-layers"

gemini-code-assist

Code Review

This pull request introduces two new shell scripts for running accuracy tests for prefill/decode disaggregation on Intel XPUs. The security review found no vulnerabilities in the provided code changes. However, the code review identified critical issues such as bugs in the GPU ID assignment logic that will cause the script to fail, and a potential gap in test coverage where different configurations are not tested in combination. Although no direct vulnerability was found, the use of eval for command execution was noted as a potential security risk. Suggestions have been provided to address these problems.

gemini-code-assist · 2026-03-07T06:29:44Z

+    GPU_ID=$((i % $(get_num_gpus)))
+    NEXT_GPU=${GPU_ID}
+    # If PREFILLER_TP_SIZE is more than 1
+    for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do
+      NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
+      GPU_ID="${GPU_ID},${NEXT_GPU}"
+    done


There is a bug in the GPU ID calculation logic. The variable GPU_ID is first assigned an integer value, but inside the loop it is modified to be a comma-separated string (e.g., "0,1"). In the next iteration, this string is used in an arithmetic expression (GPU_ID + j), which will cause a syntax error and terminate the script due to set -e.

To fix this, you should use a base GPU ID for calculations and build the final GPU_ID string from a list of IDs.

Suggested change

GPU_ID=$((i % $(get_num_gpus)))

NEXT_GPU=${GPU_ID}

# If PREFILLER_TP_SIZE is more than 1

for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do

NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))

GPU_ID="${GPU_ID},${NEXT_GPU}"

done

local base_gpu_id=$((i % $(get_num_gpus)))

local -a gpu_ids=()

for (( j=0; j < PREFILLER_TP_SIZE; j++ )); do

gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus))))

done

GPU_ID=$(IFS=,; echo "${gpu_ids[*]}")

NEXT_GPU=${gpu_ids[-1]}

gemini-code-assist · 2026-03-07T06:29:44Z

+    GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))
+    # If DECODER_TP_SIZE is more than 1
+    for (( j=1; j < DECODER_TP_SIZE; j++ )); do
+      NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
+      GPU_ID="${GPU_ID},${NEXT_GPU}"
+    done


This GPU ID calculation has the same bug as the one for prefill instances. The GPU_ID variable is used in an arithmetic expression inside the loop while it's being modified into a comma-separated string. This will cause a syntax error.

Suggested change

GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))

# If DECODER_TP_SIZE is more than 1

for (( j=1; j < DECODER_TP_SIZE; j++ )); do

NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))

GPU_ID="${GPU_ID},${NEXT_GPU}"

done

local base_gpu_id=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))

local -a gpu_ids=()

for (( j=0; j < DECODER_TP_SIZE; j++ )); do

gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus))))

done

GPU_ID=$(IFS=,; echo "${gpu_ids[*]}")

gemini-code-assist · 2026-03-07T06:29:45Z

+# Run tests
+if [[ -n "${ROCM_ATTN:-}" ]]; then
+  echo "ROCM_ATTN is set, running with --attention-backend ROCM_ATTN"
+  run_tests "ROCM_ATTN backend" "--attention-backend ROCM_ATTN"
+else
+  run_tests "default backend" ""
+fi
+
+# Check if FLASHINFER is set (non-empty)
+if [[ -n "${FLASHINFER:-}" ]]; then
+  echo "FLASHINFER is set, rerunning with --attention-backend FLASHINFER"
+  run_tests "FLASHINFER backend" "--attention-backend FLASHINFER"
+else
+  echo "FLASHINFER not set, skipping FLASHINFER runs."
+fi
+
+# Check if cross-layers is enabled (non-empty)
+if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then
+  echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers"
+  run_tests "default backend" "--enable-cross-layers"
+fi


The current test execution logic does not test combinations of attention backends and features. For instance, if both ROCM_ATTN and CROSS_LAYERS_BLOCKS are set, the script will test:

ROCM_ATTN backend without cross-layers.

Default backend with cross-layers.

It will not test the combination of ROCM_ATTN with cross-layers. The same applies to FLASHINFER.

This seems like a significant gap in test coverage. Consider restructuring the test execution to cover these important combinations. A possible approach would be to build a list of arguments for each feature and then iterate through their combinations.

gemini-code-assist · 2026-03-07T06:29:45Z

+    BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \
+    VLLM_KV_CACHE_LAYOUT='HND' \
+    UCX_NET_DEVICES=all \
+    VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
+    vllm serve $model_name \
+    --port $PORT \
+    --enforce-eager \
+    --block-size ${PREFILL_BLOCK_SIZE} \
+    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
+    --tensor-parallel-size $PREFILLER_TP_SIZE \
+    --kv-transfer-config '$KV_CONFIG'"
+
+    # Add attention backend config if specified
+    if [[ -n "$ATTENTION_BACKEND" ]]; then
+      BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND"
+    fi
+
+    FULL_CMD="$BASE_CMD"
+
+    eval "$FULL_CMD &"


Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.

local -a cmd_args=( vllm serve "$model_name" \ --port "$PORT" \ --enforce-eager \ --block-size "${PREFILL_BLOCK_SIZE}" \ --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \ --tensor-parallel-size "$PREFILLER_TP_SIZE" \ --kv-transfer-config "$KV_CONFIG" ) # Add attention backend config if specified if [[ -n "$ATTENTION_BACKEND" ]]; then cmd_args+=(--attention-backend "$ATTENTION_BACKEND") fi env \ ZE_AFFINITY_MASK="$GPU_ID" \ VLLM_KV_CACHE_LAYOUT='HND' \ UCX_NET_DEVICES=all \ VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \ "${cmd_args[@]}" &

gemini-code-assist · 2026-03-07T06:29:45Z

+    BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \
+    VLLM_KV_CACHE_LAYOUT=$DECODER_KV_LAYOUT \
+    UCX_NET_DEVICES=all \
+    VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
+    vllm serve $model_name \
+    --port $PORT \
+    --enforce-eager \
+    --block-size ${DECODE_BLOCK_SIZE} \
+    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
+    --kv-transfer-config '$KV_CONFIG'"
+
+    # Add attention backend config if specified
+    if [[ -n "$ATTENTION_BACKEND" ]]; then
+      BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND"
+    fi
+
+  # DP-EP attention mode
+  if [[ -z "$DP_EP" ]]; then
+    BASE_CMD="${BASE_CMD} --tensor-parallel-size $DECODER_TP_SIZE"
+  else
+    echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1"
+    BASE_CMD="${BASE_CMD} --data-parallel-size $DECODER_TP_SIZE \
+    --tensor-parallel-size 1 --enable-expert-parallel"
+  fi
+
+    FULL_CMD="$BASE_CMD"
+
+    eval "$FULL_CMD &"


Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.

local -a cmd_args=( vllm serve "$model_name" \ --port "$PORT" \ --enforce-eager \ --block-size "${DECODE_BLOCK_SIZE}" \ --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \ --kv-transfer-config "$KV_CONFIG" ) # Add attention backend config if specified if [[ -n "$ATTENTION_BACKEND" ]]; then cmd_args+=(--attention-backend "$ATTENTION_BACKEND") fi # DP-EP attention mode if [[ -z "$DP_EP" ]]; then cmd_args+=(--tensor-parallel-size "$DECODER_TP_SIZE") else echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1" cmd_args+=(--data-parallel-size "$DECODER_TP_SIZE" \ --tensor-parallel-size 1 --enable-expert-parallel) fi env \ ZE_AFFINITY_MASK="$GPU_ID" \ VLLM_KV_CACHE_LAYOUT="$DECODER_KV_LAYOUT" \ UCX_NET_DEVICES=all \ VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \ "${cmd_args[@]}" &

…on MI300x (vllm-project#36247) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: mcelrath

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com>

…ultiprocess mode (vllm-project#35931)

…ct#35831)

…n chat tests (vllm-project#36341) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>

…llm-project#36166)

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel requested a review from Copilot March 7, 2026 06:26

Copilot started reviewing on behalf of zhenwei-intel March 7, 2026 06:26 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 7, 2026

View reviewed changes

[Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 …

ee8a295

…on MI300x (vllm-project#36247) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

zhenwei-intel force-pushed the pd_script branch from c83f221 to 8893c4a Compare March 7, 2026 13:14

majiayu000 and others added 18 commits March 7, 2026 22:09

[V0 Deprecation] Remove unused swap_space parameter (vllm-project#36216)

00b814b

Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: mcelrath

[Misc] Remove duplicate parser registration (vllm-project#36303)

5261223

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

Adding support to Sarvam's MoE models (vllm-project#33942)

85f50eb

Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com>

[UX][Startup] Account for CUDA graphs during memory profiling (vllm-p…

ebb9cc5

…roject#30515)

[CI] Enable Crosslayer KV layout tests for ROCm platforms (vllm-proje…

eebd146

…ct#35416)

[ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 (…

fc46577

…vllm-project#36174)

[ROCm][CI] Accept Different But Valid Output for test_olmoe_tp (vll…

ee54f9c

…m-project#35224)

[Core] NGram GPU Implementation compatible with Async Scheduler (vllm…

a6be75d

…-project#29184)

[Perf] Support FP8 KV cache for Flashinfer MLA Sparse (vllm-project#3…

379689d

…5891)

[compile] Split compile/warmup monitoring (vllm-project#36098)

2dde535

[Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache m…

63298ee

…ultiprocess mode (vllm-project#35931)

[LMCache MP Patch]: Race Condition + Duplicated Block Ids (vllm-proje…

5d6aae4

…ct#35831)

[CI] fix flaky empty responses and add diagnostic assertions in visio…

40077ea

…n chat tests (vllm-project#36341) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[Model] Nano Nemotron VL - fast media preprocessing (vllm-project#35657)

b7332b0

Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>

[Frontend] Add GPU-less render serving path (vllm launch render) (v…

4497431

…llm-project#36166)

Add support for ModelOpt MXFP8 MoE models (vllm-project#35986)

0a6a3a1

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

[cudagraph] fix cudagraph warning in deepseekv32 (vllm-project#28044)

e5ff140

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Add script for pd disagg on XPU

f999e2c

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel force-pushed the pd_script branch from 8893c4a to f999e2c Compare March 9, 2026 02:50

	echo "Usage: $0 [--kv_buffer_device <xpu\|cpu>] [--attention-backend <backend>]"
	echo "Usage: $0 [--kv_buffer_device <xpu\|cpu>] [--attention-backend <backend>] [--enable-cross-layers]"
	echo " --enable-cross-layers Enable cross-layer KV connector blocks in kv_connector_extra_config."

		# Trap the SIGINT signal (triggered by Ctrl+C)
		trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT

	pkill -f "vllm serve" \|\| true
	pkill -P $$ -f "vllm serve" \|\| true

	run_tests "default backend" "--enable-cross-layers"
	run_tests "cross-layers backend" "--enable-cross-layers"

Conversation

zhenwei-intel commented Mar 7, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

zhenwei-intel commented Mar 7, 2026 •

edited by github-actions Bot

Loading