Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1601 commits
Select commit Hold shift + click to select a range
0867497
[CI/Build] Bump flashinfer to v0.6.11.post2 (#41711)
arpera May 16, 2026
a941892
Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer…
weizhoublue May 17, 2026
504a26c
Support bf16 for mamba ssm cache (#41680)
qizzzh May 17, 2026
ff712f6
[MRV2][XPU] add Model Runner V2 log (#42710)
zhenwei-intel May 17, 2026
0fa8884
[XPU] fix weight scale shape (#42725)
zufangzhu May 17, 2026
1c8e9c0
Refactor: Pass num_labels explicitly to PoolerClassify instead of rea…
taneem-ibrahim May 17, 2026
599e75f
[ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810)
tjtanaa May 17, 2026
966903e
[torch.compile] Add patch for fullgraph compilation (#42686)
ProExpertProg May 17, 2026
03ddc1c
[Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M…
qianlihuang May 18, 2026
1072104
[CI] Add NIXL EP import canary (#42567)
alec-flowers May 18, 2026
990f49b
[MM][CG] Enable encoder Cudagraph for Step3VL (#42224)
JisoLya May 18, 2026
b50646e
[ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909)
AndreasKaratzas May 18, 2026
23c15ac
[BugFix] Kimi-K2.5: skip vision tower dtype conversion when using qua…
gaozihao-shy May 18, 2026
c1f7854
Improve logging when docs build is skipped (#42929)
hmellor May 18, 2026
e3aeee5
[Bugfix] moe lora align kernel grid (#40131)
TheDuyIT May 18, 2026
7d5b033
[LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242)
jeejeelee May 18, 2026
5ab6d1b
[Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311)
rishaps May 18, 2026
9537542
Revert checkpoint specific workaround in Transformers modelling backe…
hmellor May 18, 2026
998714b
[Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849)
xyang16 May 18, 2026
c38bed4
delete xpu ci (#42582)
wendyliu235 May 18, 2026
965d076
[CPU] Specify required KV cache layout for CPU attention backend (#42…
hlin99 May 18, 2026
2267f70
[Kernel] Pack topk id/weights triton kernel (#42527)
jeejeelee May 18, 2026
b4601ad
[CPU] Add fused GDN support for AMX CPU platform (#42707)
bigPYJ1151 May 18, 2026
cac81b6
[CPU Backend] Improve cpu thread utilization (#42666)
tianmu-li May 18, 2026
88a860d
[CPU] Add MXFP4 W4A16 MoE support (#41922)
yuwenzho May 18, 2026
df852ed
fix: remove unused norm for dpskv4 (#41710)
inisis May 18, 2026
e414e1f
[Bugfix][KV Offload] count appended GPU blocks in store group_sizes (…
kfirtoledo May 18, 2026
737bfa3
[Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative de…
roikoren755 May 18, 2026
69c91d0
[MRv2] Default to MRv1 when a connector is present (#42955)
NickLucche May 18, 2026
2e40faf
[XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_…
zxd1997066 May 18, 2026
e541765
[KV Connector][Offloading] Flush all pending jobs on last step (#42611)
liranschour May 18, 2026
1ac10f1
Revert "[torch.compile] Add patch for fullgraph compilation" (#42686)…
vllm-agent May 18, 2026
f5d3dc7
[Model Runner v2] Support update_config (#42783)
mgoin May 18, 2026
78e7a7b
Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483)
bedeks May 18, 2026
4a39b4f
[Model] Add Apertus Tool Parser (#41154)
blancsw May 18, 2026
47829b1
[Bugfix] mamba: run single-token extends as decodes (#42430)
netanel-haber May 18, 2026
e267369
[Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors m…
yewentao256 May 18, 2026
b12745e
Fix `--convert` passed without `--runner` on causal models (#42935)
hmellor May 18, 2026
8c296de
[Perf] Re-enable flashinfer autotune by default and cleanup (#42857)
wzhao18 May 18, 2026
67f58ce
[Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930)
mmangkad May 18, 2026
6859ca7
[Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#4…
jinzhen-lin May 18, 2026
a2c8fc6
[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436)
BowenBao May 18, 2026
9758a6e
[BugFix] support PP for Cohere vision model (#42819)
czhu-cohere May 18, 2026
00e20e7
[Refactor] Remove dead cuda kernels (#42767)
yewentao256 May 18, 2026
ce88f01
[Docs] update attribution to reflect EDEN foundation (#41666)
amitport May 18, 2026
8fc1c28
[ROCm] Guard AITER GDN decode fast path by layout (#42880)
tuukkjs May 18, 2026
8474748
Tier offload followup (#42529)
ronensc May 18, 2026
cd49a05
[Refactor] Remove dead code (#42889)
yewentao256 May 18, 2026
0191354
[Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885)
haosdent May 18, 2026
57fef4e
[Refactor] Extract shared coerce_to_schema_type utility from Minimax …
sfeng33 May 18, 2026
37ece59
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7%…
yewentao256 May 18, 2026
a171e6b
Add parallel drafting to v2 model runner unsupported features (#43010)
shanjiaz May 18, 2026
f85c76d
[CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991)
arpera May 18, 2026
239b5ff
[Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (…
mgoin May 19, 2026
287471b
[Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004)
WoosukKwon May 19, 2026
afd7b1d
[Bugfix] Use platform-agnostic device in example_connector load (#42926)
revit13 May 19, 2026
8f16c4a
[BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#…
ofirzaf May 19, 2026
36dcaf2
[XPU] add gptq(int4) support (#37844)
jikunshang May 19, 2026
da03e54
[UX] Add a persistent cache for FlashInfer autotuning (#42537)
mmangkad May 19, 2026
fba010d
[Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#…
NickLucche May 19, 2026
87b08c5
[Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` …
WoosukKwon May 19, 2026
3ca8db2
add cutedsl dsv4 indexer fp8 kernel (#42899)
gnovack May 19, 2026
fab07e4
[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between P…
qyYue1389 May 19, 2026
6e889b5
[ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
khluu May 19, 2026
27f4ba9
fix: use keyword arguments for shard_id and expert_id in weight_loade…
junyanxu May 19, 2026
9fd8487
[Docs] Add SVG images for pooling models. (#42626)
gracie-guo May 19, 2026
f1e3f0e
[XPU] Use custom op collective behavior (#41354)
chaojun-zhang May 19, 2026
4a4fdab
[Misc] Aligning tokwise pooler heads for consistency (#43041)
taneem-ibrahim May 19, 2026
257af77
[Docs] Reorganize online serving docs. (#41907)
noooop May 19, 2026
301d986
[Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
noooop May 19, 2026
b14be81
[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] …
WoosukKwon May 19, 2026
f34623b
[bug] AsyncScheduler drops first post-resume token after pause_genera…
hao-aaron May 19, 2026
056bc2e
[KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
ivanium May 19, 2026
07beaed
[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
WoosukKwon May 19, 2026
ef54a4d
[Misc][MM] Remove redundant code in CLIPAttention (#43046)
shen-shanshan May 19, 2026
129019f
[CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
ZhanqiuHu May 19, 2026
a78b842
[Bugfix] Fix top logprobs token placeholders in `/inference/v1/genera…
sagearc May 19, 2026
b82e908
[Perf][4/n] Eliminate various GPU<->CPU syncs (#42347)
njhill May 19, 2026
d740e2c
[XPU] update xpu graph usage (#43043)
xinyu-intel May 19, 2026
1c61580
[Model] Openvla support (#42654)
yiwen101 May 19, 2026
42b4f1f
[Refactor] Extract extract_types_from_schema utility from Minimax M2 …
sfeng33 May 19, 2026
8200fbe
[Misc] add humming to dependencies (#42540)
jinzhen-lin May 19, 2026
d247a93
[feat] Add FP8 per-tensor Q scale support to Triton attention backend…
DomBrown May 19, 2026
aed2eb3
[Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994)
Dao007forever May 19, 2026
f54721b
[Bugfix][MoE] FlashInfer one-sided: workspace union across heterogene…
tomeras91 May 19, 2026
9aaf83e
[CI failure] Temporarily disable using persistent cache for flashinfe…
wzhao18 May 19, 2026
a65093c
[ci] Move language models tests (hybrid) back to L4 (#43129)
khluu May 19, 2026
1242196
[Model] Support post-norm architecture for EAGLE-3 supeculators (#42764)
Dogacel May 19, 2026
117afee
Fix error in Dynamic NTK scaling (#41277)
maxdebayser May 19, 2026
be16785
[CPU][DOC] Fix installation commands for Arm CPUs (#43115)
fadara01 May 19, 2026
73dd2f3
[bug] fix WeightTransferConfig.backend to allow for all strings (#43121)
hao-aaron May 20, 2026
39bba71
[MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160)
njhill May 20, 2026
5774aae
[Cohere] Enable Cohere MoE (#43143)
Terrencezzj May 20, 2026
c628a93
[Perf][Bugfix] Update dflash aux layer indexing (#40727)
benchislett May 20, 2026
fadf5d3
add enqueue all option to throughput benchmark (#42975)
pmaybank May 20, 2026
2ae910e
[Perf] Avoid forward scan for async output placeholders (#42938)
izikgo May 20, 2026
cd0ff26
[CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111)
mgoin May 20, 2026
4f94089
[KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to seconda…
ronensc May 20, 2026
8595956
[ci] Revert model executor test back to L4 (#43188)
khluu May 20, 2026
7e4bc2c
[Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099)
NickLucche May 20, 2026
40651c0
[Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097)
NickLucche May 20, 2026
07aeaf9
[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to lib…
cleonard530 May 20, 2026
9b343dd
Enable mermaid diagrams in the docs (#43192)
hmellor May 20, 2026
1cb2244
[GDN] Enable FI Blackwell GDN prefill kernel (#40717)
arpera May 20, 2026
6f21558
[XPU][CI] Add 2 server model test files in Intel GPU CI (#42499)
zxd1997066 May 20, 2026
cb600d1
[Frontend] Forward X-data-parallel-rank header on /inference/v1/gener…
hallerite May 20, 2026
87e3145
[Doc] Sync CLI guide with actual help modes and launch subcommand (#4…
wangrui6 May 20, 2026
19cf334
[Feature] Support manually enabling the cumem allocator (#33648)
kebe7jun May 20, 2026
0a50874
[Spec Decode] Support non-MTP speculation for NemotronH (#43130)
benchislett May 20, 2026
df84fb0
Remove additional dead code as a follow-up to #42889 (#43144)
dsikka May 20, 2026
ded8712
[Bug][Structured Outputs] Fix bug that leads to unconstrained generat…
rishitdholakia13 May 20, 2026
644b2a2
[Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels o…
j9smith May 20, 2026
a10d691
[Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser …
sfeng33 May 20, 2026
9c78c99
[MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993)
vadiklyutiy May 20, 2026
2d6b348
[R3] Add routed experts to openai entrypoint (#38939)
hao-aaron May 20, 2026
f2d5e3d
[CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConn…
haosdent May 20, 2026
363fc84
Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40…
meena-at-work May 20, 2026
53ff50f
[Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed …
yewentao256 May 20, 2026
2a43b40
[Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass i…
sfeng33 May 20, 2026
452baa8
Add dllehr-amd to CODEOWNERS and committers list (#42772)
dllehr-amd May 20, 2026
5774aad
[Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135)
mgoin May 20, 2026
6dc0a71
[Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230)
ZJY0516 May 20, 2026
bde560e
[ROCm] Add QuickReduce min-size override and codec threshold (#41675)
akii96 May 20, 2026
63ea117
[CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parser…
alexeldeib May 21, 2026
9640970
[Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert t…
yewentao256 May 21, 2026
5d041cc
update GPU json file based on h200 recipes (#43262)
louie-tsai May 21, 2026
ee05e81
[Minor] Bigger overlap for FI AR (#43103)
jeejeelee May 21, 2026
e45df8c
[Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2…
sonusflow May 21, 2026
2b75a73
[Perf][Gemma4] Batch vision encoder calls for image and video process…
lucianommartins May 21, 2026
7e50709
[CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082)
haosdent May 21, 2026
346cf16
[Frontend] Normalize reasoning_content to reasoning for client compat…
bbrowning May 21, 2026
6441cf4
[Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (…
sfeng33 May 21, 2026
d97ba29
[ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in…
AAISSJ May 21, 2026
f2ace1d
[Frontend][RFC] Rust front-end integration (#40848)
njhill May 21, 2026
a6682d1
[Bugfix] Warn when renderer_num_workers has no effect on offline LLM …
DaoyuanLi2816 May 21, 2026
905b97a
[Benchmark] Add num-warmup to vllm bench throughput (#43245)
yzong-rh May 21, 2026
050611a
[Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses F…
ianliuy May 21, 2026
a950e94
[CI] De-flake test_models for bigscience/bloom-560m (#43197)
haosdent May 21, 2026
0a54df2
[XPU] add setuptools-rust for xpu dependency (#43287)
jikunshang May 21, 2026
b719b16
Update KDA chunk prefill decay to use exp2 semantics (#43195)
zexplorerhj May 21, 2026
edafea3
Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223)
zhangxin81 May 21, 2026
ebbfb34
[Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization …
khluu May 21, 2026
68e07d5
[Bug] Fix ci issue `assert output_size is not None` AssertionError (#…
yewentao256 May 21, 2026
caf6982
[CI] Pin protoc binary in rust-build stages (#43292)
haosdent May 21, 2026
5ecd8e9
[XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266)
zxd1997066 May 21, 2026
c68c55d
[CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
velonica0 May 21, 2026
b730c46
[Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postproc…
fuscof-ibm May 21, 2026
9b9d5db
[CI] Fix CPU tests failing on `tl.exp2` import (#43311)
haosdent May 21, 2026
1c78f76
[Bugfix] Add early validation to reject incompatible runner types for…
anishesg May 21, 2026
9b54e50
[Deprecation] Mark env vars covered by --moe-backend / --linear-backe…
mgoin May 21, 2026
b29cbf0
[Perf] `zeros` -> `empty` to remove additional fill (#42988)
yewentao256 May 21, 2026
17b6982
[Core] Add native ModelExpress load format (#43105)
zhengluo-nv May 21, 2026
0b59fc4
Disable build isolation to bypass CUDA related deps for vllm-tpu (#43…
ylangtsou May 21, 2026
0f66623
[Frontend] Rework fastokens integration (#43168)
njhill May 21, 2026
e26e1f0
[Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU …
fangyuchu May 21, 2026
565b745
[BugFix] Use correct logprobs for `logprob_token_ids` (#43125)
njhill May 21, 2026
39d5fa9
[Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba…
liulanze May 21, 2026
39910f2
[Rust Frontend] Move code from `vllm-frontend-rs` (#43283)
BugenZhao May 22, 2026
ba369b7
[CI] Fix dockerfile dependency graph failure for pre-commit (#43378)
Isotr0py May 22, 2026
2998a04
[Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855)
zx3xyy May 22, 2026
86ccef7
[ROCm] Add XGMI backend for MoRI Connector (#41753)
simondanielsson May 22, 2026
35d0141
[ROCm][CI] add warmup to mem_util test before measurement (#43236)
divakar-amd May 22, 2026
60af5c1
[Frontend] Add truncation side to OpenAI endpoints (#43260)
ruizhang99 May 22, 2026
0ddd7dd
[Frontend] DP Supervisor (#40841)
yewentao256 May 22, 2026
18a27cc
[Bugfix] Make CuMemAllocator free callback stream-aware (#43020)
zixi-qi May 22, 2026
8c8b182
[XPU] Enable multiple key kernels for sparse attention (#37888)
xwu-intel May 22, 2026
1fe3303
[CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallb…
haosdent May 22, 2026
e746a2e
[Model] Use `AutoWeightsLoader` for Voyage (#42972)
yufufi May 22, 2026
fa1ff88
[Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213)
tc-mb May 22, 2026
5ea76fa
[CI] Fix test_lora_with_spec_decode on V2 model runner (#43314)
haosdent May 22, 2026
025d4f5
[CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296)
haosdent May 22, 2026
6bb8753
Correcting the mock classes for MM GC tests (#43321)
wdhongtw May 22, 2026
694d9a8
[BugFix] Fix setuptools-rust dep in requirements files (#43377)
njhill May 22, 2026
a761697
Fix the docker build failure in tpu-inference (#43360)
mrjunwan-lang May 22, 2026
2380bfc
[Docs] Note image preprocessing difference between qwen_vl_utils and …
noooop May 22, 2026
65b7a81
[CPU] Experimentally enable Triton and MRV2 (#43225)
bigPYJ1151 May 22, 2026
7e1b45a
[Attention] Mamba attention module refactor (#41126)
wangxiyuan May 22, 2026
d3d1cf6
[XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#…
majian4work May 22, 2026
b3c7ffc
[Misc] Replace assert with proper exceptions for security and validat…
taneem-ibrahim May 22, 2026
4658bf8
[Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desyn…
wasnertobias May 22, 2026
79ff0ff
[BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtra…
JasonKeyiL May 22, 2026
15f7cd3
[LoRA] Reduce memory of 2D weights when EP is set (#42737)
jeejeelee May 22, 2026
d3a5635
[EPLB] Change default EPLB communicator (#43110)
ilmarkov May 22, 2026
a377631
[CI] Fix AMD docker build tests (#43329)
haosdent May 22, 2026
fb21d8b
Add NVFP4 MOE support for Deepseek V4. (#42209)
sychen52 May 22, 2026
f0feb15
[Multimodal] Simplify ViT CUDA graph interfaces (#41234)
Isotr0py May 22, 2026
91f5b92
[Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43…
BugenZhao May 22, 2026
c7624be
[Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Trit…
zhandaz May 22, 2026
b21f3d5
[KV Connector] MooncakeStore: don't co-queue save with load to avoid …
Dao007forever May 22, 2026
8437157
[Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43…
zyongye May 22, 2026
2b94d1c
[Frontend] Simplify AuthenticationMiddleware path extraction (#43426)
russellb May 22, 2026
977703a
[RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from …
SandishKumarHN May 22, 2026
23f7b11
[Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GD…
arpera May 22, 2026
4e597b7
[Bugfix] Clear error message for FP8 torchao quantization on unsuppor…
haosdent May 22, 2026
08cb467
mhc_post - remove sts & add vectorized copies (#43437)
gnovack May 22, 2026
e203006
[Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision disp…
juhi10071998 May 22, 2026
47d4407
[Model Runner V2] Support sharing kv cache layers (#35045)
njhill May 22, 2026
f743254
DSv4 fused Q-norm kernel grid refactor (#42353)
gnovack May 22, 2026
4e2eba2
[Perf] Optimize hidden state extraction logic (#37374)
benchislett May 22, 2026
8de5cab
[XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950)
majian4work May 22, 2026
6d30655
elastic_ep: stage/commit MoE quant method on reconfigure (#40881)
itayalroy May 22, 2026
552bbe6
[Attention] Add head_dim=512 support for FlashInfer trtllm attention …
djmmoss May 23, 2026
3cb83c9
Add `model` to `WeightTransferEngine.__init__` (#42922)
SumanthRH May 23, 2026
367cb81
[DSV4] More multi-stream enablement for c4a (#42925)
zyongye May 23, 2026
6a4723a
[ROCm][CI] Stabilize runner teardown between sampler tests (#43023)
AndreasKaratzas May 23, 2026
76ea1d5
[ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017)
AndreasKaratzas May 23, 2026
84e3515
[Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal m…
ashwing May 23, 2026
d28bdf9
[ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs …
AndreasKaratzas May 23, 2026
a5bbd81
[XPU]feat: enable FP8 block-scaled quantization on XPU (#42952)
majian4work May 23, 2026
54d1536
[XPU] reudce host overhead of XPU MOE (#42915)
mayuyuace May 23, 2026
a7be0f3
[7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (c…
cleonard530 May 23, 2026
3a1c062
[Misc] Added missing return type annotations to improve mypy and IDE …
taneem-ibrahim May 23, 2026
d19db10
[Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi…
zhougit86 May 23, 2026
09a219c
[ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546)
meenchen May 23, 2026
82536ac
Keep scheduler alive for delayed KV connector frees (#43433)
lucifer1004 May 23, 2026
3f3e862
fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpo…
May 23, 2026
5bb8d27
[Kernel] Batch invariant NVFP4 linear using cutlass (#39912)
jzakrzew May 23, 2026
2a7d5b7
[ROCm][CI] Remove benchmarks test group and shard long test groups (#…
AndreasKaratzas May 23, 2026
d8b385b
[Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43…
ffggs May 23, 2026
a0be71e
[MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attenti…
huanghua1994 May 23, 2026
7c2ff1f
[Docs] Fix stale version number in token_embed.md (#43488)
fuergaosi233 May 23, 2026
8737e4a
[Docs] Fix stale version number in token_classify.md (#43489)
fuergaosi233 May 23, 2026
4438b6e
[MoE] Migrate W4A8 CT to oracle kernel setup (#42680)
bedeks May 23, 2026
819c610
[Mooncake] Add metrics for MooncakeStoreConnector operations (#43392)
Dao007forever May 23, 2026
46f95b2
[ROCm][Critical] Fix the GDN import bug (#43486)
tjtanaa May 23, 2026
10d264a
Revert "[Misc] add humming to dependencies" (#43492)
mgoin May 23, 2026
b32fe41
[Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691)
sfeng33 May 23, 2026
33d7cbe
[Model Runner v2] Force v1 runner for tests (#43233)
yewentao256 May 23, 2026
0902d8e
[KV Connector] Keep MooncakeStore full hits block-aligned (#43494)
Dao007forever May 24, 2026
357fddf
[kv_offload]: Add DSv4 support (#43142)
orozery May 24, 2026
5940590
[ROCm][CI] Stabilize 400 error return code for invalid schema inputs …
AndreasKaratzas May 24, 2026
1806d1a
[ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385)
tjtanaa May 24, 2026
d56285c
Tuning script and configs for Triton Mamba SSU kernel (#43083)
danisereb May 24, 2026
d0a100c
File system secondary tier implemented in python (#41735)
rshavitt May 24, 2026
b06813e
[Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474)
jeejeelee May 25, 2026
6cbe448
fix: MoE model using shared routed experts crashes on AMD GPUs (#42373)
weizhoublue May 25, 2026
1b26fa3
[Docs] Reorganize offline inference docs. (#43552)
noooop May 25, 2026
3df1c7c
[Docker] Non-root support for vllm-openai; add opt-in vllm-openai-non…
TheDuyIT May 25, 2026
81252d4
[Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296)
ivanium May 25, 2026
0c942c6
[Doc] Add section on escalating stalled contributions (#43568)
esmeetu May 25, 2026
5c1aec3
Reduce memory usage for granite_speech. (#42933)
Yihuki May 25, 2026
026c7f0
Init
jeejeelee Mar 28, 2026
8ed3ef1
Move
jeejeelee Mar 28, 2026
115946e
Move
jeejeelee Mar 28, 2026
03854b2
Move
jeejeelee Mar 28, 2026
7f0a790
Move
jeejeelee Mar 29, 2026
b420431
Move
jeejeelee Mar 29, 2026
0614b52
Fix comments
jeejeelee Mar 30, 2026
3c42f56
Enable PDL
jeejeelee Mar 31, 2026
5c78853
Port code from 43276 (#14)
qianlihuang May 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 3 additions & 2 deletions .buildkite/ci_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ run_all_patterns:
- "CMakeLists.txt"
- "requirements/common.txt"
- "requirements/cuda.txt"
- "requirements/build.txt"
- "requirements/test.txt"
- "requirements/kv_connectors.txt"
- "requirements/build/cuda.txt"
- "requirements/test/cuda.txt"
- "setup.py"
- "csrc/"
- "cmake/"
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/ci_config_intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ run_all_patterns:
- "CMakeLists.txt"
- "requirements/common.txt"
- "requirements/xpu.txt"
- "requirements/build.txt"
- "requirements/test.txt"
- "requirements/build/cuda.txt"
- "requirements/test/cuda.txt"
- "setup.py"
- "csrc/"
- "cmake/"
Expand Down
8 changes: 0 additions & 8 deletions .buildkite/hardware_tests/amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,3 @@ steps:
- docker push "rocm/vllm-ci:${BUILDKITE_COMMIT}"
env:
DOCKER_BUILDKIT: "1"
retry:
automatic:
- exit_status: -1 # Agent was lost
limit: 1
- exit_status: -10 # Agent was lost
limit: 1
- exit_status: 1 # Machine occasionally fail
limit: 1
50 changes: 41 additions & 9 deletions .buildkite/hardware_tests/cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,19 @@ steps:
- vllm/_custom_ops.py
- tests/kernels/attention/test_cpu_attn.py
- tests/kernels/moe/test_cpu_fused_moe.py
- tests/kernels/moe/test_cpu_quant_fused_moe.py
- tests/kernels/test_onednn.py
- tests/kernels/test_awq_int4_to_int8.py
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py"
pytest -x -v -s tests/kernels/moe/test_cpu_quant_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"

- label: CPU-Compatibility Tests
depends_on: []
Expand All @@ -44,34 +50,49 @@ steps:
- tests/models/language/pooling/
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 40m "
pytest -x -v -s tests/models/language/generation -m cpu_model
pytest -x -v -s tests/models/language/pooling -m cpu_model"

- label: CPU-ModelRunnerV2 Tests
depends_on: []
device: intel_cpu
no_plugin: true
soft_fail: true
source_file_dependencies:
- vllm/v1/worker/cpu/
- vllm/v1/worker/gpu/
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
uv pip install git+https://github.com/triton-lang/triton-cpu.git@270e696d
VLLM_USE_V2_MODEL_RUNNER=1 pytest -x -v -s tests/models/language/generation/test_granite.py -m cpu_model"

- label: CPU-Quantization Model Tests
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies:
- csrc/cpu/
- vllm/model_executor/layers/quantization/cpu_wna16.py
- vllm/model_executor/layers/quantization/gptq_marlin.py
- vllm/model_executor/layers/quantization/auto_gptq.py
- vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
- vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
- vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
- vllm/model_executor/layers/fused_moe/experts/cpu_moe.py
- tests/quantization/test_compressed_tensors.py
- tests/quantization/test_cpu_wna16.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
pytest -x -v -s tests/quantization/test_cpu_wna16.py"

- label: CPU-Distributed Tests
- label: CPU-Distributed Tests (PP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies:
source_file_dependencies: &cpu_distributed_deps
- csrc/cpu/shm.cpp
- vllm/v1/worker/cpu_worker.py
- vllm/v1/worker/gpu_worker.py
Expand All @@ -80,10 +101,21 @@ steps:
- vllm/platforms/cpu.py
- vllm/distributed/parallel_state.py
- vllm/distributed/device_communicators/cpu_communicator.py
- .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh tp_pp"

- label: CPU-Distributed Tests (DP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies: *cpu_distributed_deps
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp"

- label: CPU-Multi-Modal Model Tests %N
depends_on: []
Expand All @@ -97,7 +129,7 @@ steps:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 45m "
pytest -x -v -s tests/models/multimodal/generation --ignore=tests/models/multimodal/generation/test_pixtral.py -m cpu_model --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB"
parallelism: 2
parallelism: 3

- label: "Arm CPU Test"
depends_on: []
Expand Down
7 changes: 0 additions & 7 deletions .buildkite/hardware_tests/intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,3 @@ steps:
commands:
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh

- label: "Intel GPU Test"
depends_on: []
soft_fail: true
device: intel_gpu
no_plugin: true
commands:
- bash .buildkite/scripts/hardware_ci/run-xpu-test.sh
5 changes: 3 additions & 2 deletions .buildkite/image_build/image_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ check_and_skip_if_image_exists() {
}

ecr_login() {
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com || true
}

prepare_cache_tags() {
Expand Down Expand Up @@ -192,6 +192,7 @@ export BUILDKITE_COMMIT
export PARENT_COMMIT
export IMAGE_TAG
export IMAGE_TAG_LATEST
export COMMIT="${COMMIT:-${BUILDKITE_COMMIT}}"
export CACHE_FROM
export CACHE_FROM_BASE_BRANCH
export CACHE_FROM_MAIN
Expand Down
42 changes: 42 additions & 0 deletions .buildkite/image_build/image_build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,48 @@ steps:
timeout_in_minutes: 600
commands:
- if [[ "$BUILDKITE_BRANCH" == "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $IMAGE_TAG $IMAGE_TAG_LATEST; else .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $IMAGE_TAG; fi
# Non-root smoke 1: the default (root) image must still be importable
# under a non-root UID via `--user 2000:0`. Validates the `vllm` passwd
# entry + group-0-writable /home/vllm + uv path cleanup from #31959.
# Uses `import vllm` rather than `vllm serve --help` because the latter
# instantiates `VllmConfig` which requires a GPU attached to the
# container.
- docker run --rm --user 2000:0 --entrypoint python3 "$IMAGE_TAG" -c "import vllm; print(vllm.__version__)"
# Non-root smoke 2: assert the non-root enabling invariants are baked
# into the image. Runs as UID 2000:0 via a shell so we can verify
# filesystem perms + passwd/group file state + wrapper presence without
# triggering vLLM's GPU-requiring config-init path. The opt-in
# `vllm-openai-nonroot` target adds only `USER vllm`, `WORKDIR
# /home/vllm`, and an `ENTRYPOINT` override on top of these invariants;
# its build correctness is reviewed at the Dockerfile level. Wrapper
# logic is covered separately by the pre-commit hook
# `test-nonroot-entrypoint` (see .pre-commit-config.yaml).
- |
docker run --rm --user 2000:0 --entrypoint /bin/sh "$IMAGE_TAG" -ec '
if ! getent passwd 2000 | grep -q ^vllm:; then
echo FAIL: UID 2000 != vllm
exit 1
fi
if ! id -gn 2>/dev/null | grep -qx root; then
echo FAIL: GID 0 not root group
exit 1
fi
touch /home/vllm/.smoke && rm /home/vllm/.smoke
touch /opt/uv/cache/.smoke && rm /opt/uv/cache/.smoke
if ! test -x /usr/local/bin/vllm-nonroot-entrypoint.sh; then
echo FAIL: wrapper missing
exit 1
fi
if ! test -w /etc/passwd; then
echo FAIL: /etc/passwd not group-writable
exit 1
fi
if ! test -w /etc/group; then
echo FAIL: /etc/group not group-writable
exit 1
fi
echo non-root invariants OK
'
retry:
automatic:
- exit_status: -1 # Agent was lost
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true

# skip build if image already exists
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-cpu) ]]; then
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_cpu_arm64.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true

# skip build if image already exists
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-arm64-cpu) ]]; then
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_hpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true

# skip build if image already exists
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-hpu) ]]; then
Expand Down
68 changes: 68 additions & 0 deletions .buildkite/image_build/image_build_torch_nightly.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash
set -euo pipefail

# Build a vLLM test image with PyTorch nightly installed.
# Called by the pipeline generator's "vLLM Against PyTorch Nightly" group.

if [[ $# -lt 5 ]]; then
echo "Usage: $0 <registry> <repo> <commit> <branch> <image_tag>"
exit 1
fi

REGISTRY=$1
REPO=$2
BUILDKITE_COMMIT=$3
BRANCH=$4
IMAGE_TAG=$5

# --- Arguments ---
echo "--- :mag: Arguments"
echo "REGISTRY: ${REGISTRY}"
echo "REPO: ${REPO}"
echo "BUILDKITE_COMMIT: ${BUILDKITE_COMMIT}"
echo "BRANCH: ${BRANCH}"
echo "IMAGE_TAG: ${IMAGE_TAG}"

# --- ECR login ---
echo "--- :key: ECR login"
aws ecr-public get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin "$REGISTRY"
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com

# --- Set up buildx ---
echo "--- :docker: Setting up buildx"
docker buildx create --name vllm-builder --driver docker-container --use || true
docker buildx inspect --bootstrap
docker buildx ls

# --- Skip if image already exists ---
echo "--- :mag: Checking if image already exists"
if docker manifest inspect "$IMAGE_TAG" >/dev/null 2>&1; then
echo "Image found: $IMAGE_TAG — skipping build"
exit 0
fi
echo "Image not found, proceeding with build..."

# --- CUDA 13.0 for nightly builds ---
# Nightly CI uses CUDA 13.0 while regular CI stays on CUDA 12.9
NIGHTLY_CUDA_VERSION="13.0.2"
NIGHTLY_BUILD_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-devel-ubuntu22.04"
NIGHTLY_FINAL_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-base-ubuntu22.04"

echo "--- :docker: Building torch nightly image (CUDA ${NIGHTLY_CUDA_VERSION})"
docker buildx build --file docker/Dockerfile \
--build-arg max_jobs=16 \
--build-arg buildkite_commit="$BUILDKITE_COMMIT" \
--build-arg USE_SCCACHE=1 \
--build-arg PYTORCH_NIGHTLY=1 \
--build-arg CUDA_VERSION="${NIGHTLY_CUDA_VERSION}" \
--build-arg BUILD_BASE_IMAGE="${NIGHTLY_BUILD_BASE_IMAGE}" \
--build-arg FINAL_BASE_IMAGE="${NIGHTLY_FINAL_BASE_IMAGE}" \
--build-arg torch_cuda_arch_list="8.0 8.9 9.0 10.0 12.0" \
--tag "$IMAGE_TAG" \
--push \
--target test \
--progress plain .

echo "--- :white_check_mark: Torch nightly image build complete: $IMAGE_TAG"
4 changes: 2 additions & 2 deletions .buildkite/image_build/image_build_xpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com || true

# skip build if image already exists
if ! docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-xpu &> /dev/null; then
Expand Down
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/engine_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Engine Intel
depends_on:
- image-build-xpu
steps:
- label: Engine (1 GPU)
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/v1/engine/
- tests/v1/engine/
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py'
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/kernels_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Kernels Intel
depends_on:
- image-build-xpu
steps:
- label: vLLM IR Tests
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/ir
- vllm/kernels
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s kernels/ir'
Loading