Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
2eef210
feat(mp): SHM-based data transfer path for GPGPUs/CPU
hlin99 May 21, 2026
3f8a799
address gemini's comments
hlin99 May 21, 2026
2c71e8a
Use multiprocessing.shared_memory for cross-platform SHM transport
hlin99 May 21, 2026
2f74d25
Merge branch 'dev' into ww21_PR_shm
maobaolong May 22, 2026
eeeb301
Merge branch 'dev' into ww21_PR_shm
maobaolong May 22, 2026
7e87a7a
docs: MPCacheEngine prepare/commit docstrings
hlin99 May 26, 2026
09e81e7
Move SHM logic to MPCacheEngine and add lazy/SHM guard
hlin99 May 26, 2026
7809a3e
add ShmSlotDescriptor schema
hlin99 May 26, 2026
6268f2a
Refactor: move transport files into lmcache/v1/multiprocess/transport/
hlin99 May 26, 2026
d626ad0
Remove redundant shm_name vs use_lazy check
hlin99 May 26, 2026
350876e
Merge branch 'dev' into ww21_PR_shm
hlin99 May 26, 2026
8492a75
abstract server transfer strategy
hlin99 May 26, 2026
c61c215
to a more friendly naming
hlin99 May 26, 2026
3dd3668
fix: support HND formats in MP KV transfer (#3282)
he-yufeng May 26, 2026
1966340
[Fix][Observability] PrometheusLogger instance already created with d…
cr7258 May 26, 2026
ec4dbe1
docs: daily drift check — multi-process mode (2026-05-21) (#3361)
ApostaC May 26, 2026
4bdcb5c
[Fix] Change pinned pointer allocations in non-CUDA equivalents to be…
zhengfeihe May 26, 2026
956b7bd
[Operator]: Force external LMCache MP connector path (#3393)
sammshen May 26, 2026
62bb37d
[Fix]: Skip unpin for non-pinned objects in cleanup_memory_objs (#3385)
zhengfeihe May 26, 2026
370cf94
[CI] smoke-test container images before pushing (#3358)
deng451e May 26, 2026
9a34979
[hipFile]: Add cufile-python compatible shim layer to use AMD's hipFi…
riley-dixon May 26, 2026
1b8785a
[Bugfix] Fix 0-hit async lookup when use_layerwise=true (#3252)
luceinaltis May 26, 2026
bd713b5
fix: prevent TypeError crash when streaming response has zero visible…
weizhoublue May 26, 2026
aac9aa1
Merge remote-tracking branch 'origin/dev' into ww21_PR_shm
Copilot May 27, 2026
9cd6edd
Restore SHM test cases dropped during merge conflict resolution
Copilot May 27, 2026
18b8cba
Merge branch 'dev' into ww21_PR_shm
hlin99 May 27, 2026
dacfe02
fix rebase errors
hlin99 May 27, 2026
dca2548
[CI/CD] Add CI-safe raw-block temp-file tests (#3203)
DongDongJu May 27, 2026
61aa202
[CI/CD] Tighten the threshold requirement for k3 multiprocess test (#…
ApostaC May 27, 2026
4ded4ab
docs: daily drift check — multi-process mode (2026-05-26) (#3399)
ApostaC May 27, 2026
f41931c
[Docs] Combined PR from recent doc drift scannings (#3401)
ApostaC May 27, 2026
b3a7275
Add Qwen3-30B-A3B-Instruct-2507 and Qwen3-235B-A22B-Thinking-2507 to …
pengxin99 May 27, 2026
262759a
[Doc][ROCm] Document gfx950 (MI350X/MI355X) in install example (#3395)
hyukjlee May 27, 2026
acf3c7f
[Hotfix][CI] fix raw-block L2 store result assertions (#3415)
DongDongJu May 27, 2026
713b1a5
[Bugfix] Avoid vLLM import during blake3 token hasher startup (#3416)
DongDongJu May 27, 2026
39e3beb
[FIX] use nixl meta-package on CUDA 13 so L2 adapters load (#3370)
deng451e May 27, 2026
67a72cc
fix: move pytest.ini to project root so CI picks it up (#3250)
abinggo May 27, 2026
29bbd55
[Build] sync torch version with vLLM (2.11.0) (#3348)
github-actions[bot] May 27, 2026
d6b1a08
[refactor]Refactor bench kvcache cli (#3411)
chunxiaozheng May 27, 2026
af4910e
restore shm logic during rebase
hlin99 May 28, 2026
8dae6dd
move files to worker_transfer for clear code layout
hlin99 May 28, 2026
3e0e3ff
[LMCache MP Connector] Report cache hit stats in KVTransferParams (#3…
aeon-x May 28, 2026
0423cfe
refactor: cache _shm_pool_info in __init__ instead of recomputing on …
hlin99 May 28, 2026
3c58046
Split UNREGISTER_KV_CACHE into GPU and non-GPU variants to have
hlin99 May 28, 2026
e479f6f
chore: handle bool/int strict decoding in msgspec_decode
hlin99 May 28, 2026
bb192f3
add auto as default transfer-mode on server side
hlin99 May 28, 2026
d4efd2b
Merge branch 'dev' into ww21_PR_shm
hlin99 May 28, 2026
eda3847
move transfer_context.py into worker_transfer/ package
hlin99 May 28, 2026
b11e368
move server transfer into modules
hlin99 May 28, 2026
ddb3e92
refactor: move SHM pool info to MPCacheEngineContext
hlin99 May 28, 2026
0d74e42
refactor test cases
hlin99 May 28, 2026
4f6032d
update docs according to latest code
hlin99 May 28, 2026
49823dc
fix: correct spelling 'mignt' -> 'might' in cache policy comments (#3…
Jah-yee May 28, 2026
a783898
Refactored name for better semantic clarity.
hlin99 May 28, 2026
4ef643b
Merge branch 'dev' into ww21_PR_shm
hlin99 May 28, 2026
07b49b2
merging files
hlin99 May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .buildkite/k3_tests/multiprocess/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,10 @@ steps:
agents: { queue: "k8s" }
plugins: [{ kubernetes: { podSpec: *pod-2gpu } }]
artifact_paths: ["*.log"]

- label: ":compression: cache_stats"
command: .buildkite/k3_tests/multiprocess/run.sh cache_stats
timeout_in_minutes: 30
agents: { queue: "k8s" }
plugins: [{ kubernetes: { podSpec: *pod-2gpu } }]
artifact_paths: ["*.log"]
214 changes: 214 additions & 0 deletions .buildkite/k3_tests/multiprocess/scripts/run-cache-stats.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
#!/usr/bin/env bash
# Test that kv_transfer_params / cached_token_stats flows end-to-end
# through the OpenAI-compatible API when LMCache MP mode is active.
#
# Flow:
# 1. Send a long prompt (cold — populates LMCache, no cache hit)
# 2. Send the same prompt again (warm — should hit LMCache)
# 3. Verify the response contains cached_token_stats with expected values
set -e
set -o pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../../.." && pwd)"

source "${REPO_ROOT}/.buildkite/k3_tests/common_scripts/helpers.sh"

# Configuration (inherited from run-single-test.sh)
VLLM_PORT="${VLLM_PORT:-8000}"
MODEL="${MODEL:-Qwen/Qwen3-14B}"
BUILD_ID="${BUILD_ID:-local_$$}"
RESULTS_DIR="${RESULTS_DIR:-/tmp/lmcache_ci_results_${BUILD_ID}}"

STATS_DIR="$RESULTS_DIR/cache_stats"
mkdir -p "$STATS_DIR"

echo "=== Cache Stats Reporting Test ==="
echo "Model: $MODEL"
echo "vLLM Port: $VLLM_PORT"
echo "Results dir: $STATS_DIR"
echo ""

# Build a prompt long enough to span multiple LMCache chunks (default
# chunk_size=256 tokens). Repeating a sentence gives us ~600+ tokens.
LONG_CONTENT="Explain the history of computer science in great detail. $(printf 'The Turing machine is a fundamental concept in theoretical computer science that defines an abstract machine capable of manipulating symbols on a strip of tape according to a table of rules. %.0s' {1..20})"

send_request() {
local label="$1"
local output_file="$2"

echo "--- Sending request: $label ---"
local http_code
http_code=$(curl -s -o "$output_file" -w "%{http_code}" \
-X POST "http://localhost:${VLLM_PORT}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL}\",
\"messages\": [{\"role\": \"user\", \"content\": $(python3 -c "import json; print(json.dumps('$LONG_CONTENT'))")}],
\"max_tokens\": 1,
\"kv_transfer_params\": {\"cached_token_stats\": true}
}")

if [ "$http_code" -ne 200 ]; then
echo "FAIL: $label returned HTTP $http_code"
cat "$output_file"
return 1
fi
echo "$label: HTTP 200 OK"
}

validate_stats_present() {
local label="$1"
local response_file="$2"

python3 -c "
import json, sys

with open('$response_file') as f:
data = json.load(f)

kv_params = data.get('kv_transfer_params')
if kv_params is None:
print('FAIL: $label — kv_transfer_params is missing from response')
sys.exit(1)

stats = kv_params.get('cached_token_stats')
if stats is None:
print('FAIL: $label — cached_token_stats is missing from kv_transfer_params')
print(f' kv_transfer_params = {kv_params}')
sys.exit(1)

required_keys = [
'num_vllm_cached_tokens',
'num_lmcache_cached_tokens',
'num_lmcache_extra_cached_tokens',
]
missing = [k for k in required_keys if k not in stats]
if missing:
print(f'FAIL: $label — missing keys in cached_token_stats: {missing}')
print(f' cached_token_stats = {stats}')
sys.exit(1)

for k in required_keys:
v = stats[k]
if not isinstance(v, int) or v < 0:
print(f'FAIL: $label — {k} should be a non-negative integer, got {v!r}')
sys.exit(1)

print(f'PASS: $label — cached_token_stats present with all required keys')
print(f' num_vllm_cached_tokens: {stats[\"num_vllm_cached_tokens\"]}')
print(f' num_lmcache_cached_tokens: {stats[\"num_lmcache_cached_tokens\"]}')
print(f' num_lmcache_extra_cached_tokens: {stats[\"num_lmcache_extra_cached_tokens\"]}')
"
}

validate_warm_hit() {
local cold_file="$1"
local warm_file="$2"

python3 -c "
import json, sys

with open('$cold_file') as f:
cold = json.load(f)
with open('$warm_file') as f:
warm = json.load(f)

cold_stats = cold['kv_transfer_params']['cached_token_stats']
warm_stats = warm['kv_transfer_params']['cached_token_stats']

cold_lmcache = cold_stats['num_lmcache_cached_tokens']
warm_lmcache = warm_stats['num_lmcache_cached_tokens']

print(f'Cold request — num_lmcache_cached_tokens: {cold_lmcache}')
print(f'Warm request — num_lmcache_cached_tokens: {warm_lmcache}')

if warm_lmcache <= cold_lmcache:
print(f'FAIL: warm request should have more LMCache hits than cold request')
print(f' cold={cold_lmcache}, warm={warm_lmcache}')
sys.exit(1)

if warm_lmcache == 0:
print('FAIL: warm request has 0 LMCache cached tokens (cache not populated?)')
sys.exit(1)

print(f'PASS: warm request has more LMCache hits ({warm_lmcache} > {cold_lmcache})')
"
}

# ── Step 1: Cold request (populates LMCache) ──────────────────
echo "============================================"
echo "=== Step 1: Cold request ==="
echo "============================================"
if ! send_request "Cold" "$STATS_DIR/cold_response.json"; then
exit 1
fi
if ! validate_stats_present "Cold" "$STATS_DIR/cold_response.json"; then
exit 1
fi
echo ""

# Small delay to let the store operation complete in LMCache
sleep 2

# ── Step 2: Warm request (same prompt, should hit cache) ──────
echo "============================================"
echo "=== Step 2: Warm request ==="
echo "============================================"
if ! send_request "Warm" "$STATS_DIR/warm_response.json"; then
exit 1
fi
if ! validate_stats_present "Warm" "$STATS_DIR/warm_response.json"; then
exit 1
fi
echo ""

# ── Step 3: Validate cache hit improvement ────────────────────
echo "============================================"
echo "=== Step 3: Validate cache hit ==="
echo "============================================"
if ! validate_warm_hit "$STATS_DIR/cold_response.json" "$STATS_DIR/warm_response.json"; then
exit 1
fi
echo ""

# ── Step 4: Verify opt-in behavior ────────────────────────────
# Request WITHOUT kv_transfer_params should NOT have stats in response.
echo "============================================"
echo "=== Step 4: Verify opt-in (no stats without opt-in) ==="
echo "============================================"

echo "--- Sending request without kv_transfer_params ---"
http_code=$(curl -s -o "$STATS_DIR/no_opt_in_response.json" -w "%{http_code}" \
-X POST "http://localhost:${VLLM_PORT}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL}\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
\"max_tokens\": 1
}")

if [ "$http_code" -ne 200 ]; then
echo "FAIL: no-opt-in request returned HTTP $http_code"
exit 1
fi

python3 -c "
import json, sys

with open('$STATS_DIR/no_opt_in_response.json') as f:
data = json.load(f)

kv_params = data.get('kv_transfer_params')
if kv_params is not None:
print(f'FAIL: kv_transfer_params should be absent without opt-in, got {kv_params}')
sys.exit(1)

print('PASS: kv_transfer_params correctly absent when not opted in')
"
echo ""

# ── Summary ───────────────────────────────────────────────────
echo "============================================"
echo "=== Cache Stats Reporting Test PASSED ==="
echo "============================================"
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,12 @@ L2_MAX_SIZE_GB="${L2_MAX_SIZE_GB:-80}"
L2_BANDWIDTH_GB="${L2_BANDWIDTH_GB:-4}"

# L2 performance thresholds
MIN_L2_SPEEDUP="${MIN_L2_SPEEDUP:-1.0}"
MIN_L2_TTFT_SPEEDUP="${MIN_L2_TTFT_SPEEDUP:-1.0}"
MAX_WARMUP_OVERHEAD="${MAX_WARMUP_OVERHEAD:-2.0}"
# Recent CI runs show ~1.51-1.67x query speedup, ~1.77-2.02x TTFT speedup,
# and ~0.87-0.99x warmup overhead. Tighten from the previous pass-anything
# thresholds (1.0x/1.0x/2.0x) while leaving headroom for variance.
MIN_L2_SPEEDUP="${MIN_L2_SPEEDUP:-1.3}"
MIN_L2_TTFT_SPEEDUP="${MIN_L2_TTFT_SPEEDUP:-1.5}"
MAX_WARMUP_OVERHEAD="${MAX_WARMUP_OVERHEAD:-1.2}"

L2_RESULTS_DIR="$RESULTS_DIR/long_doc_qa_l2"
PID_FILE="/tmp/lmcache_mp_pids_${BUILD_ID}"
Expand Down
24 changes: 15 additions & 9 deletions .buildkite/k3_tests/multiprocess/scripts/run-long-doc-qa.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,11 @@ SHUFFLE_SEED="${SHUFFLE_SEED:-0}"
MAX_INFLIGHT_REQUESTS="${MAX_INFLIGHT_REQUESTS:-5}"

# Relative performance thresholds (compared against baseline run in same job)
# Allow at most 10% slower than baseline for both metrics
MAX_TTFT_SLOWDOWN_PCT="${MAX_TTFT_SLOWDOWN_PCT:-10}"
MAX_ROUND_TIME_SLOWDOWN_PCT="${MAX_ROUND_TIME_SLOWDOWN_PCT:-10}"
# Negative values mean LMCache must be *faster* than baseline by at least that %.
# Recent CI runs show ~77-84% TTFT improvement and ~27-40% round-time improvement,
# so requiring 60% and 15% respectively leaves comfortable headroom.
MAX_TTFT_SLOWDOWN_PCT="${MAX_TTFT_SLOWDOWN_PCT:--60}"
MAX_ROUND_TIME_SLOWDOWN_PCT="${MAX_ROUND_TIME_SLOWDOWN_PCT:--15}"

# Output directory
LONG_DOC_QA_DIR="$RESULTS_DIR/long_doc_qa"
Expand All @@ -43,9 +45,9 @@ echo "Number of documents: $NUM_DOCUMENTS"
echo "Output length: $OUTPUT_LEN"
echo "Results dir: $LONG_DOC_QA_DIR"
echo ""
echo "Performance thresholds (relative to baseline):"
echo " Max TTFT slowdown: ${MAX_TTFT_SLOWDOWN_PCT}%"
echo " Max query round time slowdown: ${MAX_ROUND_TIME_SLOWDOWN_PCT}%"
echo "Performance thresholds (relative to baseline, negative = must be faster):"
echo " Max TTFT slowdown: ${MAX_TTFT_SLOWDOWN_PCT}% (LMCache must be >= $(echo "$MAX_TTFT_SLOWDOWN_PCT" | tr -d '-')% faster)"
echo " Max round time slowdown: ${MAX_ROUND_TIME_SLOWDOWN_PCT}% (LMCache must be >= $(echo "$MAX_ROUND_TIME_SLOWDOWN_PCT" | tr -d '-')% faster)"
echo ""

mkdir -p "$LONG_DOC_QA_DIR"
Expand Down Expand Up @@ -196,12 +198,16 @@ def check_metric(name, lmcache_val, baseline_val, max_slowdown_pct):
print(f"{name}: unable to compare (lmcache={lmcache_val}, baseline={baseline_val}) -- FAIL")
return False
pct = ((lmc - base) / base) * 100
label = f"{abs(pct):.1f}% faster" if pct < 0 else f"{pct:.1f}% slower"
if max_slowdown_pct < 0:
threshold_label = f"need >= {abs(max_slowdown_pct):.0f}% faster"
else:
threshold_label = f"max {max_slowdown_pct}% slower"
if pct <= max_slowdown_pct:
label = f"{abs(pct):.1f}% faster" if pct < 0 else f"{pct:.1f}% slower"
print(f"{name}: {lmc:.4f}s vs baseline {base:.4f}s ({label}, max {max_slowdown_pct}% slower) -- PASS")
print(f"{name}: {lmc:.4f}s vs baseline {base:.4f}s ({label}, {threshold_label}) -- PASS")
return True
else:
print(f"{name}: {lmc:.4f}s vs baseline {base:.4f}s ({pct:.1f}% slower, max {max_slowdown_pct}% slower) -- FAIL")
print(f"{name}: {lmc:.4f}s vs baseline {base:.4f}s ({label}, {threshold_label}) -- FAIL")
return False

failed = False
Expand Down
5 changes: 4 additions & 1 deletion .buildkite/k3_tests/multiprocess/scripts/run-single-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,12 @@ case "$TEST_NAME" in
restart_recovery)
exec_script="${SCRIPT_DIR}/run-restart-recovery.sh"
;;
cache_stats)
exec_script="${SCRIPT_DIR}/run-cache-stats.sh"
;;
*)
echo "Unknown test: $TEST_NAME"
echo "Valid tests: lm_eval, vllm_bench, long_doc_qa, long_doc_qa_l2, fault_tolerance, deadlock, restart_recovery"
echo "Valid tests: lm_eval, vllm_bench, long_doc_qa, long_doc_qa_l2, fault_tolerance, deadlock, restart_recovery, cache_stats"
exit 1
;;
esac
Expand Down
1 change: 1 addition & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ steps:
export CXX=hipcc
export BUILD_WITH_HIP=1
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.0
uv pip install -r requirements/rocm_core.txt
fi

uv pip install -r requirements/common.txt
Expand Down
24 changes: 24 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,12 @@ jobs:
--tag lmcache/vllm-openai:latest --tag lmcache/vllm-openai:${{ env.LATEST_TAG }} \
--file docker/Dockerfile .

# `lmcache --help` exercises eager CLI subcommand discovery,
# so a missing runtime dep fails the build before the image ships.
- name: Smoke test lmcache/vllm-openai (cu13)
run: |
docker run --rm --entrypoint lmcache lmcache/vllm-openai:${{ env.LATEST_TAG }} --help

- name: Push lmcache/vllm-openai container image to DockerHub
run: |
docker push lmcache/vllm-openai:latest
Expand All @@ -354,6 +360,10 @@ jobs:
--tag lmcache/vllm-openai:lightweight --tag lmcache/vllm-openai:${{ env.LATEST_TAG }}-lightweight \
--file docker/Dockerfile.lightweight .

- name: Smoke test lmcache/vllm-openai:lightweight
run: |
docker run --rm --entrypoint lmcache lmcache/vllm-openai:${{ env.LATEST_TAG }}-lightweight --help

- name: Push lmcache/vllm-openai:lightweight image to DockerHub
run: |
docker push lmcache/vllm-openai:lightweight
Expand All @@ -375,6 +385,10 @@ jobs:
--tag lmcache/standalone:latest-cu130 --tag lmcache/standalone:${{ env.LATEST_TAG }}-cu130 \
--file docker/Dockerfile.standalone .

- name: Smoke test lmcache/standalone (cu13)
run: |
docker run --rm --entrypoint lmcache lmcache/standalone:${{ env.LATEST_TAG }} --help

- name: Push lmcache/standalone container image to DockerHub
run: |
docker push lmcache/standalone:latest
Expand All @@ -398,6 +412,11 @@ jobs:
--tag lmcache/vllm-openai:latest-cu129 --tag lmcache/vllm-openai:${{ env.LATEST_TAG }}-cu129 \
--file docker/Dockerfile .

- name: Smoke test lmcache/vllm-openai (cu12.9)
if: needs.publish-cu129-github-release.result == 'success'
run: |
docker run --rm --entrypoint lmcache lmcache/vllm-openai:${{ env.LATEST_TAG }}-cu129 --help

- name: Push lmcache/vllm-openai cu129 container image to DockerHub
if: needs.publish-cu129-github-release.result == 'success'
run: |
Expand All @@ -419,6 +438,11 @@ jobs:
--tag lmcache/standalone:latest-cu129 --tag lmcache/standalone:${{ env.LATEST_TAG }}-cu129 \
--file docker/Dockerfile.standalone .

- name: Smoke test lmcache/standalone (cu12.9)
if: needs.publish-cu129-github-release.result == 'success'
run: |
docker run --rm --entrypoint lmcache lmcache/standalone:${{ env.LATEST_TAG }}-cu129 --help

- name: Push lmcache/standalone cu129 container image to DockerHub
if: needs.publish-cu129-github-release.result == 'success'
run: |
Expand Down
Loading
Loading