Skip to content

Add script for pd disagg on XPU#5

Open
zhenwei-intel wants to merge 19 commits into
mainfrom
pd_script
Open

Add script for pd disagg on XPU#5
zhenwei-intel wants to merge 19 commits into
mainfrom
pd_script

Conversation

@zhenwei-intel
Copy link
Copy Markdown
Owner

@zhenwei-intel zhenwei-intel commented Mar 7, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds XPU-focused shell scripts to run NixlConnector prefill/decode (“P/D disagg”) accuracy integration tests and to sweep multiple TP/DP-EP configurations.

Changes:

  • Added run_accuracy_test_xpu.sh to launch prefill/decode vllm serve instances on XPU and run test_accuracy.py through the toy proxy.
  • Added config_sweep_accuracy_test_xpu.sh to run the XPU accuracy test script across multiple TP (and optional DP-EP) configurations and attention backends.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
tests/v1/kv_connector/nixl_integration/run_accuracy_test_xpu.sh New runner script to launch XPU-based prefill/decode instances, start the proxy, and execute accuracy pytest.
tests/v1/kv_connector/nixl_integration/config_sweep_accuracy_test_xpu.sh New wrapper to sweep environment configurations and optional attention backends/cross-layer mode for the XPU runner.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

;;
*)
echo "Unknown option $1"
echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]"
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage/help text doesn’t mention the supported --enable-cross-layers flag, so users won’t discover it from --help/error output. Please include that option in the usage string (and ideally document what it changes).

Suggested change
echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>]"
echo "Usage: $0 [--kv_buffer_device <xpu|cpu>] [--attention-backend <backend>] [--enable-cross-layers]"
echo " --enable-cross-layers Enable cross-layer KV connector blocks in kv_connector_extra_config."

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +74
# Trap the SIGINT signal (triggered by Ctrl+C)
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXIT/SIGINT trap runs kill $(jobs -pr). If there are no background jobs (e.g., argument parsing fails before any processes are started), kill is invoked with no PIDs and returns a non-zero status, which can cause confusing failures under set -e. Consider guarding for an empty job list or using an approach that no-ops cleanly when there’s nothing to kill.

Suggested change
# Trap the SIGINT signal (triggered by Ctrl+C)
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
# Trap the SIGINT/SIGTERM/EXIT signals and clean up background jobs safely
trap 'jobs -pr | xargs kill 2>/dev/null || true' SIGINT SIGTERM EXIT

Copilot uses AI. Check for mistakes.
# Function to clean up previous instances
cleanup_instances() {
echo "Cleaning up any running vLLM instances..."
pkill -f "vllm serve" || true
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_instances uses pkill -f "vllm serve", which can terminate unrelated vLLM servers running on the same host (including other developers’ sessions or parallel CI steps). Prefer tracking the PIDs you start in this script and only stopping those, or scope the match more narrowly (e.g., by port/unique env var).

Suggested change
pkill -f "vllm serve" || true
pkill -P $$ -f "vllm serve" || true

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +157
FULL_CMD="$BASE_CMD"

eval "$FULL_CMD &"

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script builds a shell command as a string and runs it via eval. Because parts of the command come from inputs (MODEL_NAMES, --attention-backend, etc.), this creates an avoidable shell-injection risk and makes quoting brittle. Prefer constructing the command as an array and executing it without eval (or otherwise ensure arguments are safely quoted).

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +2
#!/usr/bin/env bash
set -euo pipefail
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is still the template placeholder (purpose/test plan/results are empty). Please fill it out so reviewers/CI owners know what this script is intended to validate and how it was tested on XPU.

Copilot uses AI. Check for mistakes.
# Check if cross-layers is enabled (non-empty)
if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then
echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers"
run_tests "default backend" "--enable-cross-layers"
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rerunning with --enable-cross-layers, the label passed to run_tests is still "default backend", which makes the logs ambiguous about which mode failed. Consider using a distinct label (e.g., "cross-layers") for clearer output.

Suggested change
run_tests "default backend" "--enable-cross-layers"
run_tests "cross-layers backend" "--enable-cross-layers"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new shell scripts for running accuracy tests for prefill/decode disaggregation on Intel XPUs. The security review found no vulnerabilities in the provided code changes. However, the code review identified critical issues such as bugs in the GPU ID assignment logic that will cause the script to fail, and a potential gap in test coverage where different configurations are not tested in combination. Although no direct vulnerability was found, the use of eval for command execution was noted as a potential security risk. Suggestions have been provided to address these problems.

Comment on lines +121 to +127
GPU_ID=$((i % $(get_num_gpus)))
NEXT_GPU=${GPU_ID}
# If PREFILLER_TP_SIZE is more than 1
for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do
NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
GPU_ID="${GPU_ID},${NEXT_GPU}"
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a bug in the GPU ID calculation logic. The variable GPU_ID is first assigned an integer value, but inside the loop it is modified to be a comma-separated string (e.g., "0,1"). In the next iteration, this string is used in an arithmetic expression (GPU_ID + j), which will cause a syntax error and terminate the script due to set -e.

To fix this, you should use a base GPU ID for calculations and build the final GPU_ID string from a list of IDs.

Suggested change
GPU_ID=$((i % $(get_num_gpus)))
NEXT_GPU=${GPU_ID}
# If PREFILLER_TP_SIZE is more than 1
for (( j=1; j < PREFILLER_TP_SIZE; j++ )); do
NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
GPU_ID="${GPU_ID},${NEXT_GPU}"
done
local base_gpu_id=$((i % $(get_num_gpus)))
local -a gpu_ids=()
for (( j=0; j < PREFILLER_TP_SIZE; j++ )); do
gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus))))
done
GPU_ID=$(IFS=,; echo "${gpu_ids[*]}")
NEXT_GPU=${gpu_ids[-1]}

Comment on lines +166 to +171
GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))
# If DECODER_TP_SIZE is more than 1
for (( j=1; j < DECODER_TP_SIZE; j++ )); do
NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
GPU_ID="${GPU_ID},${NEXT_GPU}"
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This GPU ID calculation has the same bug as the one for prefill instances. The GPU_ID variable is used in an arithmetic expression inside the loop while it's being modified into a comma-separated string. This will cause a syntax error.

Suggested change
GPU_ID=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))
# If DECODER_TP_SIZE is more than 1
for (( j=1; j < DECODER_TP_SIZE; j++ )); do
NEXT_GPU=$(((GPU_ID + j) % $(get_num_gpus)))
GPU_ID="${GPU_ID},${NEXT_GPU}"
done
local base_gpu_id=$(((i + NEXT_GPU + 1) % $(get_num_gpus)))
local -a gpu_ids=()
for (( j=0; j < DECODER_TP_SIZE; j++ )); do
gpu_ids+=($(((base_gpu_id + j) % $(get_num_gpus))))
done
GPU_ID=$(IFS=,; echo "${gpu_ids[*]}")

Comment on lines +51 to +71
# Run tests
if [[ -n "${ROCM_ATTN:-}" ]]; then
echo "ROCM_ATTN is set, running with --attention-backend ROCM_ATTN"
run_tests "ROCM_ATTN backend" "--attention-backend ROCM_ATTN"
else
run_tests "default backend" ""
fi

# Check if FLASHINFER is set (non-empty)
if [[ -n "${FLASHINFER:-}" ]]; then
echo "FLASHINFER is set, rerunning with --attention-backend FLASHINFER"
run_tests "FLASHINFER backend" "--attention-backend FLASHINFER"
else
echo "FLASHINFER not set, skipping FLASHINFER runs."
fi

# Check if cross-layers is enabled (non-empty)
if [[ -n "${CROSS_LAYERS_BLOCKS:-}" ]]; then
echo "CROSS_LAYERS_BLOCKS is set, rerunning with --enable-cross-layers"
run_tests "default backend" "--enable-cross-layers"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current test execution logic does not test combinations of attention backends and features. For instance, if both ROCM_ATTN and CROSS_LAYERS_BLOCKS are set, the script will test:

  1. ROCM_ATTN backend without cross-layers.
  2. Default backend with cross-layers.

It will not test the combination of ROCM_ATTN with cross-layers. The same applies to FLASHINFER.

This seems like a significant gap in test coverage. Consider restructuring the test execution to cover these important combinations. A possible approach would be to build a list of arguments for each feature and then iterate through their combinations.

Comment on lines +137 to +156
BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \
VLLM_KV_CACHE_LAYOUT='HND' \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
vllm serve $model_name \
--port $PORT \
--enforce-eager \
--block-size ${PREFILL_BLOCK_SIZE} \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--tensor-parallel-size $PREFILLER_TP_SIZE \
--kv-transfer-config '$KV_CONFIG'"

# Add attention backend config if specified
if [[ -n "$ATTENTION_BACKEND" ]]; then
BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND"
fi

FULL_CMD="$BASE_CMD"

eval "$FULL_CMD &"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.

    local -a cmd_args=(
        vllm serve "$model_name" \
        --port "$PORT" \
        --enforce-eager \
        --block-size "${PREFILL_BLOCK_SIZE}" \
        --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
        --tensor-parallel-size "$PREFILLER_TP_SIZE" \
        --kv-transfer-config "$KV_CONFIG"
    )

    # Add attention backend config if specified
    if [[ -n "$ATTENTION_BACKEND" ]]; then
      cmd_args+=(--attention-backend "$ATTENTION_BACKEND")
    fi

    env \
        ZE_AFFINITY_MASK="$GPU_ID" \
        VLLM_KV_CACHE_LAYOUT='HND' \
        UCX_NET_DEVICES=all \
        VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \
        "${cmd_args[@]}" &

Comment on lines +180 to +207
BASE_CMD="ZE_AFFINITY_MASK=$GPU_ID \
VLLM_KV_CACHE_LAYOUT=$DECODER_KV_LAYOUT \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
vllm serve $model_name \
--port $PORT \
--enforce-eager \
--block-size ${DECODE_BLOCK_SIZE} \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--kv-transfer-config '$KV_CONFIG'"

# Add attention backend config if specified
if [[ -n "$ATTENTION_BACKEND" ]]; then
BASE_CMD="${BASE_CMD} --attention-backend=$ATTENTION_BACKEND"
fi

# DP-EP attention mode
if [[ -z "$DP_EP" ]]; then
BASE_CMD="${BASE_CMD} --tensor-parallel-size $DECODER_TP_SIZE"
else
echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1"
BASE_CMD="${BASE_CMD} --data-parallel-size $DECODER_TP_SIZE \
--tensor-parallel-size 1 --enable-expert-parallel"
fi

FULL_CMD="$BASE_CMD"

eval "$FULL_CMD &"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using eval to construct and run commands is generally unsafe as it can lead to shell injection vulnerabilities if any of the variables contain special characters. It's better to use an array to build the command and env to set environment variables. This avoids the need for eval and makes the code more secure and robust.

    local -a cmd_args=(
        vllm serve "$model_name" \
        --port "$PORT" \
        --enforce-eager \
        --block-size "${DECODE_BLOCK_SIZE}" \
        --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
        --kv-transfer-config "$KV_CONFIG"
    )

    # Add attention backend config if specified
    if [[ -n "$ATTENTION_BACKEND" ]]; then
      cmd_args+=(--attention-backend "$ATTENTION_BACKEND")
    fi

  # DP-EP attention mode
  if [[ -z "$DP_EP" ]]; then
    cmd_args+=(--tensor-parallel-size "$DECODER_TP_SIZE")
  else
    echo "DP-EP Attention enabled, deploying with dp=DECODER_TP_SIZE and tp=1"
    cmd_args+=(--data-parallel-size "$DECODER_TP_SIZE" \
    --tensor-parallel-size 1 --enable-expert-parallel)
  fi

    env \
        ZE_AFFINITY_MASK="$GPU_ID" \
        VLLM_KV_CACHE_LAYOUT="$DECODER_KV_LAYOUT" \
        UCX_NET_DEVICES=all \
        VLLM_NIXL_SIDE_CHANNEL_PORT="$SIDE_CHANNEL_PORT" \
        "${cmd_args[@]}" &

…on MI300x (vllm-project#36247)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
majiayu000 and others added 18 commits March 7, 2026 22:09
Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: mcelrath
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com>
…n chat tests (vllm-project#36341)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.