Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1453 commits
Select commit Hold shift + click to select a range
d077622
[Build] Build bundled DeepGEMM `_C` per-Python so the wheel imports o…
mgoin May 12, 2026
7a9cc5e
[Model] Support MiniCPM-V 4.6 (#41254)
tc-mb May 12, 2026
6ccb10d
Added peagle speculators support (#41826)
shanjiaz May 12, 2026
289cee0
[vLLM IR] Minor improvements (#39362) (#39558)
GOavi101 May 12, 2026
5a6a9fc
[docs] Added one new contact to the Vulnerability Management team (#4…
jperezdealgaba May 12, 2026
418ba8e
[kv_offload][BugFix] Fix store deferral (#41945)
hickeyma May 12, 2026
a1b2d87
[Refactor] Clean up pooling models `build_tok_params` logic (#42341)
yewentao256 May 12, 2026
c8a6e27
[CPU] Fix rotary embedding for CPU without flash-attn ops (#42225)
jmamou May 12, 2026
bcb9c13
feat(kv-events): emit KV cache metadata (#40984)
PeaBrane May 12, 2026
6ff7405
[Bugfix] [Frontend] Responses API, fix merging of messages (#42189)
yzong-rh May 12, 2026
4d591db
[MoE Refactor] Introduce RoutedExperts alias for FusedMoE and don't s…
bnellnm May 12, 2026
d9b4990
[MoE Refactor] EPLB refactoring for FusedMoE (#41055)
bnellnm May 12, 2026
67c89fe
[Model][Bugfix] Fix Step3-VL image_embeds input path (#42333)
KaivalyaMDabhadkar May 12, 2026
379f0ec
[CI] Migrate 6 verified jobs from gpu_1_queue to h200_18gb MIG (#42446)
khluu May 12, 2026
0ce6613
platforms: add uses_cpu_device() hook to Platform for DeviceConfig (#…
viktorpusTT May 12, 2026
fe5b4e0
[Model Runner V2] Apply synthetic mode to probabilistic rejection sam…
TheEpicDolphin May 12, 2026
fe8b42e
[CI] Fix `test_async_scheduling.py` flakiness (#42455)
njhill May 12, 2026
8c4fc42
[CI] Inline build artifact annotations in release pipeline (#42357)
khluu May 12, 2026
184577a
[Build] DeepGEMM: trim comments, add integration notes + TODOs (#42429)
mgoin May 12, 2026
ebeb09d
[KV Transfer] Add MooncakeStoreConnector for KV cache offloading via …
LCAIZJ May 12, 2026
3d635c5
[Perf] Optimize MLA `compute_prefill_context` memory allocation (#42460)
yewentao256 May 12, 2026
07534b8
[PD] Bump NIXL connector dependency to 1.x (#42364)
alec-flowers May 13, 2026
18f6bf5
[MoE Refactor] Add sequence parallel tests to test_moe_layer.py (#41299)
bnellnm May 13, 2026
dcacdf9
[Attention] Sync FA with upstream (#41052)
MatthewBonanni May 13, 2026
71bcd02
[Bugfix][PD] Fix multi-node TP (TP>8) (#39907)
NickLucche May 13, 2026
503697c
[chore] Refactor pooling metadata token ID accessors (#42368)
taneem-ibrahim May 13, 2026
85b2fec
[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a g…
cleonard530 May 13, 2026
92def12
[MM][Perf][CG] Support ViT full CUDA graph for Qwen3.5 (#42151)
shen-shanshan May 13, 2026
a8c13d2
Patch SlidingWindowSpec.real_page_size_bytes for nvfp4 kv (#42464)
sychen52 May 13, 2026
9ce7404
[Bugfix][SimpleCPUOffloadBackend] Dedup in-flight CPU offload stores …
ivanium May 13, 2026
140dc2e
[Bugfix] Install nvidia-cutlass-dsl[cu13] extra on CUDA 13 platforms …
ZJY0516 May 13, 2026
13bf242
[Feat][KVConnector] Add `bind_gpu_block_pool()` to KVConnectorBase_V1…
ivanium May 13, 2026
f6e868f
[CI] Use uv with Python 3.12 for PyPI wheel upload (#42470)
khluu May 13, 2026
97c4317
[Bugfix][Frontend] Default max_tokens server-side on /inference/v1/ge…
hallerite May 13, 2026
74dffae
[ROCm] Run AITER RMSNorm pad fusion before AR RMS fusion (#42411)
akii96 May 13, 2026
d628a3c
[ROCm][CI] Skip ROCm batch invalid-input test pending torch fix (#41572)
AndreasKaratzas May 13, 2026
1686307
[Bugfix] Fix scipy audio resampling ratio (#42233)
BWAAEEEK May 13, 2026
cee6751
[Bugfix][Qwen3-VL] Fix pipeline-parallel deepstack initialization (#4…
MrZ20 May 13, 2026
79fd1bc
[kv_offload] Add req_id to ReqContext for per-request tracking (#42507)
ronensc May 13, 2026
3c413a5
Triton attention: add USE_TD constexpr for tensor descriptor Q/K/V lo…
afierka-intel May 13, 2026
3b1ef03
[Bugfix][Quark] Fix W8A8 INT8 garbage outputs on Step-3.5-Flash (and …
JoursBleu May 13, 2026
0a62f5e
[AMD] skip machete tests for rocm (#42326)
hissu-hyvarinen May 13, 2026
6767169
[CI] Re-enable Nemotron Parse parity test and switch testing to nemot…
mwawrzos May 13, 2026
0ddaf6d
[XPU] [CT] Enable CT W4A4MxFp4 path and add xpu kernel (#38896)
zufangzhu May 13, 2026
a8887c2
[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support (#41946)
tjtanaa May 13, 2026
11f6b54
[kv_offload] Add multi-tier KV cache offloading framework (#40020)
ronensc May 13, 2026
e35c0d4
[Feature] Support compile mode for batch invariance on SM80 (#42456)
yewentao256 May 13, 2026
256dbca
[Feature] Support custom callable proposer backend for speculative de…
CynicDora May 13, 2026
5794c65
[Bugfix][Model] Gemma4 MoE routing closure captures per_expert_scale,…
NoeliaBentancor May 13, 2026
2f821fa
[Spec Decode] Support hybrid attention models in extract_hidden_state…
mgoin May 13, 2026
b3c6959
[MM][CG] Support ViT CG for Qwen2-VL (#41736)
johncalesp May 13, 2026
0f69128
[Bugfix] Handle real-world gpt-oss tool call output in Harmony parsin…
bbrowning May 13, 2026
ab1ad0d
Remove verifier model type check in speculative config (#42536)
fynnsu May 13, 2026
4033096
[Quark] Support loading Quark NVFP4 checkpoints in vLLM (#35859)
fxmarty-amd May 13, 2026
a505cf8
[ModelRunner V2] Share identical MTP weights (#42538)
njhill May 13, 2026
3f611f6
[CI] Fix pre-commit issue (#42563)
yewentao256 May 13, 2026
873910d
[Frontend] add support for thinking_token_budget in completions (#42116)
walterbm May 13, 2026
cca32d5
[PD] Fix broken NIXL EP installation (#42542)
ovidiusm May 13, 2026
8efd508
[Quantization] Rework quantization_config to use QuantKey and allow f…
mgoin May 13, 2026
6b5c389
expose flex block size for batch invariant mode (#41252)
liangel-02 May 13, 2026
597ed13
[Core][MM] Do not use urllib3 to parse data URLs (#42535)
lgeiger May 13, 2026
f1cc7aa
[Bugfix] Fix DeepSeek V4 MTP HC state handling (#42320)
mmangkad May 13, 2026
b219867
[Bugfix] V1: support tuple model outputs in ubatch wrapper (dbo + spe…
he-yufeng May 13, 2026
ca7e454
[CI] set max transformers version for skywork model (#42104)
divakar-amd May 13, 2026
63cc8a5
fix(tool-parser): preserve "none"/"nil" strings as valid enum values …
ianliuy May 14, 2026
1087676
[Refactor] Use shared utils in hermes tool parser (#42570)
sfeng33 May 14, 2026
665f9c4
[Bugfix] Fix Gemma4ToolParser streaming float corruption (#42128)
abinggo May 14, 2026
f51f684
[Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model…
bedeks May 14, 2026
70c0016
[Feature] Add instruction support for score/rerank chat templates (#4…
KrxGu May 14, 2026
751b9f1
[XPU][CT] Support mxfp8 moe model (#41918)
jikunshang May 14, 2026
77e1421
[Bugfix] Fix EPLB initialization for VLM wrapper models (#39805)
esmeetu May 14, 2026
ca60a4e
[Fix] Weight loading for qwen3_5 using runai_streamer (#42521)
hks-9697-v2 May 14, 2026
bf0d2dc
[Misc] Fix mypy error in parser_manager type narrowing (#42441)
Sarah-Salah May 14, 2026
b26558d
[CI][XPU] skip ut of offload connector (#42598)
zhenwei-intel May 14, 2026
fd7d858
Use hidden_pad and intermediate_pad from vLLM #34301 (#42098)
rebklee May 14, 2026
0d2732d
[MLA Attention Backend] Add TOKENSPEED_MLA backend for DSR1/Kimi K25 …
zyongye May 14, 2026
8c79ad6
Revert "[Core] Replace routing replay with device cache and async D2H…
aoshen02 May 14, 2026
ce29c26
Update Dockerfile.rocm for AINIC & Thor NIC (#40453)
haic0 May 14, 2026
addef32
[CI][AMD] Skip tests where models have problems or fails on both HW t…
rasmith May 14, 2026
768f4a6
[CI][AMD][BugFix] Prevent triton compiler error when running test_moe…
rasmith May 14, 2026
23c8534
[Bug] Fix DeepSeek V4 `AttributeError: module 'cutlass.cute.nvgpu' ha…
yewentao256 May 14, 2026
9946c38
[XPU] Fix double-transpose in XPUFP8ScaledMMLinearKernel for W8A8 qua…
libinta May 14, 2026
1ea9401
[Quantization][Autoround][Toolkit] Add W4A16 Support (#39778)
Zhenzhong1 May 14, 2026
0a65d46
[DSV4] Fuse norm and router for low latency scenario (#41263)
jeejeelee May 14, 2026
6548560
[Compile] Fix compile warning with topk softplus sqrt (#41261)
yewentao256 May 14, 2026
5bd8c71
[kv_offload] Implement `reset_cache()` for the offloading connector (…
hickeyma May 14, 2026
2317682
[Bugfix] Fix TRTLLM ragged MLA prefill workspace warmup (#42112)
mmangkad May 14, 2026
c7560af
[RFC] Replace shared-memory routed experts with ModelRunnerOutput tra…
xhx1022 May 14, 2026
24337fb
PD disagg with NIXL Connector: GDN support (Qwen3.5) (#41869)
ZhanqiuHu May 14, 2026
f60c6b3
[V1][DP][LB] Publish request counts at the start of each engine step …
vadiklyutiy May 14, 2026
f07b1da
[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X) (#42062)
frida-andersson May 14, 2026
b8a25d0
[Bugfix] Fix LM detection for Nemotron Parse (#42641)
DarkLight1337 May 14, 2026
a7737cb
[Fix] Misc Fixes in ViT CUDA Graph (#38040)
b-mu May 14, 2026
f3d5360
[Bugfix][Multimodal] PyAV video backend returns keyframes labeled as …
WindChimeRan May 14, 2026
ae4f59f
[Model Runner v2] Oracle for model runner v2 - qwen3 dense model by d…
yewentao256 May 14, 2026
9898f94
[Attention] Remove deprecated MLA prefill arguments (#42555)
MatthewBonanni May 14, 2026
f887aa1
[Aiter][ROCm] RMSNormGated+GroupedQuantFP8 fusion (#40710)
tpopp May 14, 2026
4cfcc08
[CI][ROCm] Remove unsupported cases in test_fusion.py (#38680)
charlifu May 14, 2026
f8848b2
[Bugfix] Add swiglu limits to deepgemm fp8 methods (#41986)
zyongye May 14, 2026
3b6a204
[Model Runner V2][Bug Fix][DSV4] Ensure lazy attention state initiali…
TheEpicDolphin May 14, 2026
fa2a33b
[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py (#38288)
chengyinie May 15, 2026
0d4d334
Bump llguidance to 1.7 (#42150)
ricky-chaoju May 15, 2026
56434e8
[Bugfix] Fix incorrect chat template format for Qwen3.5 (#42660)
DarkLight1337 May 15, 2026
f351455
[CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector …
lyd1992 May 15, 2026
faa4b76
[Model] Support InternS2 Preview (#42705)
Isotr0py May 15, 2026
bf610c2
[Bugfix] Fix inverted condition causing thinking_token_budget to be s…
JasonKeyiL May 15, 2026
e30f39c
Update Intel Xeon model list and vLLM Benchmark Suite BKMs (#42607)
louie-tsai May 15, 2026
27b85d2
[Bugfix] Clarify CPU backend memory error messages reference shared f…
daniel-devlab May 15, 2026
2676ab1
[Deprecation] Remove old locations of `get_tokenizer` and `resolve_hf…
DarkLight1337 May 15, 2026
31fa757
[Misc] Make it simpler to replace out-of-tree layer classes with rela…
paulyu12 May 15, 2026
4b364f8
[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…
ivanium May 15, 2026
75fd68c
[Entrypoints] Split the pooling offline API into PoolingOfflineMixin.…
noooop May 15, 2026
ccde954
DeepSeekV4-Pro enable cuda graph full and piecewise mode (#42604)
bobofang11235 May 15, 2026
d735968
[ROCm][CI] Stage B gating (#42025)
AndreasKaratzas May 15, 2026
d26a28a
fix: propagate revision/code_revision pins to all artifact boundaries…
jperezdealgaba May 15, 2026
1dc3fe0
gemma3 multi-gpu bug-fix (#42630)
pmaybank May 15, 2026
95cfe10
[Bugfix] Ensure embeding model compilation on CPU (#42709)
bigPYJ1151 May 15, 2026
0fe7550
[Bugfix] DFlash FP8 KV-Cache (#42692)
benchislett May 15, 2026
e0a45f1
[Feat][RL] IPC weight sync optimizations: multigpu support and chunke…
hao-aaron May 15, 2026
d792d99
[ROCm] Widen OAI Triton MoE capability range to include gfx12 (RDNA4)…
laudney May 15, 2026
af9616d
[Model Runner V2] Fix kv_connector `pre_forward` order (#42676)
yewentao256 May 15, 2026
491e8d8
[Perf] Optimize MLA attention `_v_up_proj` bmm by removing additional…
yewentao256 May 15, 2026
ee58665
[Bugfix] Fix DeepGEMM context lens contiguity in MLA indexer (#42135)
mmangkad May 15, 2026
fb5bd03
[Perf] Set IR Op Priority Once at Worker Init (#42631)
BadrBasowid May 15, 2026
46a9581
[ROCm][MLA] FP8 ASM prefill for AITER dense MLA backend on gfx950 (#4…
maeehart May 15, 2026
0162596
[Model Runner V2] FP32 gumbel sampling. (#41775)
PatchouliTIS May 15, 2026
6147c70
[Model Runner v2] Support reload weights (sleep mode) (#42673)
yewentao256 May 15, 2026
be7a03e
[ROCm] Widen AITER fused AR RMSNorm 1-stage gate (#42409)
akii96 May 15, 2026
f45c210
[LMCacheMPConnector] Prioritize importing the lmcache_mp_connector fr…
chunxiaozheng May 15, 2026
06d020b
[Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 path…
blake-snc May 15, 2026
4d67d3b
[ROCm] Restore fast top_k_per_row kernels for sparse MLA when topk_to…
frida-andersson May 15, 2026
b2c58ee
[FlashAttn] Fix supports_kv_cache_dtype() accepting unhandled fp8 kv-…
liulanze May 15, 2026
9a7a273
Add HumanEval and GSM8K benchmarks to datasets (#42648)
southfreebird May 15, 2026
de2d76f
[Build] Switch CUDA 12.9 wheel builds to PyTorch manylinux_2_28 base …
mgoin May 15, 2026
bd9dbe6
[ROCm][Bugfix] Fix fused_mla_dual_rms_norm for AITER API rename _fuse…
rbrugaro-amd May 15, 2026
1ccdf87
[Bugfix] Fix layerwise reload alias-buffer corruption (#42481)
rasdani May 15, 2026
d0921ba
[Bugfix] Unwrap VLM wrappers for EPLB on Model Runner V2 (#42706)
JasonKeyiL May 15, 2026
b2a27b8
[Kernel][UX] Add `--linear-backend` arg for linear kernel selection (…
mgoin May 16, 2026
852f567
[Bugfix] Respect explicit --kv-cache-dtype over checkpoint kv_cache_s…
mgoin May 16, 2026
87a2adc
[Misc] Add common random prefix option to structured-output serving b…
viktorpusTT May 16, 2026
39c67d7
fix: add API key authorization to /v2 endpoints (#42594)
dusthunter May 16, 2026
32b7177
[LoRA][Bugfix] Dedup LoRA wrapping for modules referenced from multip…
jeejeelee May 16, 2026
657b42b
[Docker][KVConnector] Build mooncake-transfer-engine from source (#42…
zhewenl May 16, 2026
4db300e
[ROCm][CI] Removed problematic command override mechanism (#42807)
AndreasKaratzas May 16, 2026
8a56da3
[Experimental] Breakable CUDA graph (#42304)
ZJY0516 May 16, 2026
d1586e1
Fix: Propagate pinned model revisions into Ultravox secondary weight …
weizhoublue May 16, 2026
787bc0d
Add unit tests for pooler activation functions (#42824)
taneem-ibrahim May 16, 2026
36e74c9
[KV Connector] Support disk offloading in MooncakeStoreConnector (#42…
zhewenl May 16, 2026
0867497
[CI/Build] Bump flashinfer to v0.6.11.post2 (#41711)
arpera May 16, 2026
a941892
Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer…
weizhoublue May 17, 2026
504a26c
Support bf16 for mamba ssm cache (#41680)
qizzzh May 17, 2026
ff712f6
[MRV2][XPU] add Model Runner V2 log (#42710)
zhenwei-intel May 17, 2026
0fa8884
[XPU] fix weight scale shape (#42725)
zufangzhu May 17, 2026
1c8e9c0
Refactor: Pass num_labels explicitly to PoolerClassify instead of rea…
taneem-ibrahim May 17, 2026
599e75f
[ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810)
tjtanaa May 17, 2026
966903e
[torch.compile] Add patch for fullgraph compilation (#42686)
ProExpertProg May 17, 2026
03ddc1c
[Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M…
qianlihuang May 18, 2026
1072104
[CI] Add NIXL EP import canary (#42567)
alec-flowers May 18, 2026
990f49b
[MM][CG] Enable encoder Cudagraph for Step3VL (#42224)
JisoLya May 18, 2026
b50646e
[ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909)
AndreasKaratzas May 18, 2026
23c15ac
[BugFix] Kimi-K2.5: skip vision tower dtype conversion when using qua…
gaozihao-shy May 18, 2026
c1f7854
Improve logging when docs build is skipped (#42929)
hmellor May 18, 2026
e3aeee5
[Bugfix] moe lora align kernel grid (#40131)
TheDuyIT May 18, 2026
7d5b033
[LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242)
jeejeelee May 18, 2026
5ab6d1b
[Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311)
rishaps May 18, 2026
9537542
Revert checkpoint specific workaround in Transformers modelling backe…
hmellor May 18, 2026
998714b
[Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849)
xyang16 May 18, 2026
c38bed4
delete xpu ci (#42582)
wendyliu235 May 18, 2026
965d076
[CPU] Specify required KV cache layout for CPU attention backend (#42…
hlin99 May 18, 2026
2267f70
[Kernel] Pack topk id/weights triton kernel (#42527)
jeejeelee May 18, 2026
b4601ad
[CPU] Add fused GDN support for AMX CPU platform (#42707)
bigPYJ1151 May 18, 2026
cac81b6
[CPU Backend] Improve cpu thread utilization (#42666)
tianmu-li May 18, 2026
88a860d
[CPU] Add MXFP4 W4A16 MoE support (#41922)
yuwenzho May 18, 2026
df852ed
fix: remove unused norm for dpskv4 (#41710)
inisis May 18, 2026
e414e1f
[Bugfix][KV Offload] count appended GPU blocks in store group_sizes (…
kfirtoledo May 18, 2026
737bfa3
[Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative de…
roikoren755 May 18, 2026
69c91d0
[MRv2] Default to MRv1 when a connector is present (#42955)
NickLucche May 18, 2026
2e40faf
[XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_…
zxd1997066 May 18, 2026
e541765
[KV Connector][Offloading] Flush all pending jobs on last step (#42611)
liranschour May 18, 2026
1ac10f1
Revert "[torch.compile] Add patch for fullgraph compilation" (#42686)…
vllm-agent May 18, 2026
f5d3dc7
[Model Runner v2] Support update_config (#42783)
mgoin May 18, 2026
78e7a7b
Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483)
bedeks May 18, 2026
4a39b4f
[Model] Add Apertus Tool Parser (#41154)
blancsw May 18, 2026
47829b1
[Bugfix] mamba: run single-token extends as decodes (#42430)
netanel-haber May 18, 2026
e267369
[Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors m…
yewentao256 May 18, 2026
b12745e
Fix `--convert` passed without `--runner` on causal models (#42935)
hmellor May 18, 2026
8c296de
[Perf] Re-enable flashinfer autotune by default and cleanup (#42857)
wzhao18 May 18, 2026
67f58ce
[Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930)
mmangkad May 18, 2026
6859ca7
[Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#4…
jinzhen-lin May 18, 2026
a2c8fc6
[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436)
BowenBao May 18, 2026
9758a6e
[BugFix] support PP for Cohere vision model (#42819)
czhu-cohere May 18, 2026
00e20e7
[Refactor] Remove dead cuda kernels (#42767)
yewentao256 May 18, 2026
ce88f01
[Docs] update attribution to reflect EDEN foundation (#41666)
amitport May 18, 2026
8fc1c28
[ROCm] Guard AITER GDN decode fast path by layout (#42880)
tuukkjs May 18, 2026
8474748
Tier offload followup (#42529)
ronensc May 18, 2026
cd49a05
[Refactor] Remove dead code (#42889)
yewentao256 May 18, 2026
0191354
[Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885)
haosdent May 18, 2026
57fef4e
[Refactor] Extract shared coerce_to_schema_type utility from Minimax …
sfeng33 May 18, 2026
37ece59
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7%…
yewentao256 May 18, 2026
a171e6b
Add parallel drafting to v2 model runner unsupported features (#43010)
shanjiaz May 18, 2026
f85c76d
[CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991)
arpera May 18, 2026
239b5ff
[Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (…
mgoin May 19, 2026
287471b
[Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004)
WoosukKwon May 19, 2026
afd7b1d
[Bugfix] Use platform-agnostic device in example_connector load (#42926)
revit13 May 19, 2026
8f16c4a
[BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#…
ofirzaf May 19, 2026
36dcaf2
[XPU] add gptq(int4) support (#37844)
jikunshang May 19, 2026
da03e54
[UX] Add a persistent cache for FlashInfer autotuning (#42537)
mmangkad May 19, 2026
fba010d
[Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#…
NickLucche May 19, 2026
87b08c5
[Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` …
WoosukKwon May 19, 2026
3ca8db2
add cutedsl dsv4 indexer fp8 kernel (#42899)
gnovack May 19, 2026
fab07e4
[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between P…
qyYue1389 May 19, 2026
6e889b5
[ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
khluu May 19, 2026
27f4ba9
fix: use keyword arguments for shard_id and expert_id in weight_loade…
junyanxu May 19, 2026
9fd8487
[Docs] Add SVG images for pooling models. (#42626)
gracie-guo May 19, 2026
f1e3f0e
[XPU] Use custom op collective behavior (#41354)
chaojun-zhang May 19, 2026
4a4fdab
[Misc] Aligning tokwise pooler heads for consistency (#43041)
taneem-ibrahim May 19, 2026
257af77
[Docs] Reorganize online serving docs. (#41907)
noooop May 19, 2026
301d986
[Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
noooop May 19, 2026
b14be81
[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] …
WoosukKwon May 19, 2026
f34623b
[bug] AsyncScheduler drops first post-resume token after pause_genera…
hao-aaron May 19, 2026
056bc2e
[KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
ivanium May 19, 2026
07beaed
[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
WoosukKwon May 19, 2026
ef54a4d
[Misc][MM] Remove redundant code in CLIPAttention (#43046)
shen-shanshan May 19, 2026
129019f
[CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
ZhanqiuHu May 19, 2026
a78b842
[Bugfix] Fix top logprobs token placeholders in `/inference/v1/genera…
sagearc May 19, 2026
b82e908
[Perf][4/n] Eliminate various GPU<->CPU syncs (#42347)
njhill May 19, 2026
d740e2c
[XPU] update xpu graph usage (#43043)
xinyu-intel May 19, 2026
1c61580
[Model] Openvla support (#42654)
yiwen101 May 19, 2026
42b4f1f
[Refactor] Extract extract_types_from_schema utility from Minimax M2 …
sfeng33 May 19, 2026
8200fbe
[Misc] add humming to dependencies (#42540)
jinzhen-lin May 19, 2026
d247a93
[feat] Add FP8 per-tensor Q scale support to Triton attention backend…
DomBrown May 19, 2026
aed2eb3
[Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994)
Dao007forever May 19, 2026
f54721b
[Bugfix][MoE] FlashInfer one-sided: workspace union across heterogene…
tomeras91 May 19, 2026
9aaf83e
[CI failure] Temporarily disable using persistent cache for flashinfe…
wzhao18 May 19, 2026
a65093c
[ci] Move language models tests (hybrid) back to L4 (#43129)
khluu May 19, 2026
1242196
[Model] Support post-norm architecture for EAGLE-3 supeculators (#42764)
Dogacel May 19, 2026
117afee
Fix error in Dynamic NTK scaling (#41277)
maxdebayser May 19, 2026
be16785
[CPU][DOC] Fix installation commands for Arm CPUs (#43115)
fadara01 May 19, 2026
73dd2f3
[bug] fix WeightTransferConfig.backend to allow for all strings (#43121)
hao-aaron May 20, 2026
39bba71
[MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160)
njhill May 20, 2026
5774aae
[Cohere] Enable Cohere MoE (#43143)
Terrencezzj May 20, 2026
c628a93
[Perf][Bugfix] Update dflash aux layer indexing (#40727)
benchislett May 20, 2026
fadf5d3
add enqueue all option to throughput benchmark (#42975)
pmaybank May 20, 2026
2ae910e
[Perf] Avoid forward scan for async output placeholders (#42938)
izikgo May 20, 2026
cd0ff26
[CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111)
mgoin May 20, 2026
4f94089
[KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to seconda…
ronensc May 20, 2026
b2b2ef5
keep minimal diff for hopper
qianlihuang May 20, 2026
411e2fb
take gpt s advice
qianlihuang May 21, 2026
bab6268
take gemini s advice
qianlihuang May 21, 2026
81f04e2
fix comment
qianlihuang May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 3 additions & 2 deletions .buildkite/ci_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ run_all_patterns:
- "CMakeLists.txt"
- "requirements/common.txt"
- "requirements/cuda.txt"
- "requirements/build.txt"
- "requirements/test.txt"
- "requirements/kv_connectors.txt"
- "requirements/build/cuda.txt"
- "requirements/test/cuda.txt"
- "setup.py"
- "csrc/"
- "cmake/"
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/ci_config_intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ run_all_patterns:
- "CMakeLists.txt"
- "requirements/common.txt"
- "requirements/xpu.txt"
- "requirements/build.txt"
- "requirements/test.txt"
- "requirements/build/cuda.txt"
- "requirements/test/cuda.txt"
- "setup.py"
- "csrc/"
- "cmake/"
Expand Down
8 changes: 0 additions & 8 deletions .buildkite/hardware_tests/amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,3 @@ steps:
- docker push "rocm/vllm-ci:${BUILDKITE_COMMIT}"
env:
DOCKER_BUILDKIT: "1"
retry:
automatic:
- exit_status: -1 # Agent was lost
limit: 1
- exit_status: -10 # Agent was lost
limit: 1
- exit_status: 1 # Machine occasionally fail
limit: 1
36 changes: 27 additions & 9 deletions .buildkite/hardware_tests/cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,19 @@ steps:
- vllm/_custom_ops.py
- tests/kernels/attention/test_cpu_attn.py
- tests/kernels/moe/test_cpu_fused_moe.py
- tests/kernels/moe/test_cpu_quant_fused_moe.py
- tests/kernels/test_onednn.py
- tests/kernels/test_awq_int4_to_int8.py
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py"
pytest -x -v -s tests/kernels/moe/test_cpu_quant_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"

- label: CPU-Compatibility Tests
depends_on: []
Expand All @@ -44,7 +50,7 @@ steps:
- tests/models/language/pooling/
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 40m "
pytest -x -v -s tests/models/language/generation -m cpu_model
pytest -x -v -s tests/models/language/pooling -m cpu_model"

Expand All @@ -55,23 +61,24 @@ steps:
source_file_dependencies:
- csrc/cpu/
- vllm/model_executor/layers/quantization/cpu_wna16.py
- vllm/model_executor/layers/quantization/gptq_marlin.py
- vllm/model_executor/layers/quantization/auto_gptq.py
- vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
- vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
- vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
- vllm/model_executor/layers/fused_moe/experts/cpu_moe.py
- tests/quantization/test_compressed_tensors.py
- tests/quantization/test_cpu_wna16.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
pytest -x -v -s tests/quantization/test_cpu_wna16.py"

- label: CPU-Distributed Tests
- label: CPU-Distributed Tests (PP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies:
source_file_dependencies: &cpu_distributed_deps
- csrc/cpu/shm.cpp
- vllm/v1/worker/cpu_worker.py
- vllm/v1/worker/gpu_worker.py
Expand All @@ -80,10 +87,21 @@ steps:
- vllm/platforms/cpu.py
- vllm/distributed/parallel_state.py
- vllm/distributed/device_communicators/cpu_communicator.py
- .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh tp_pp"

- label: CPU-Distributed Tests (DP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies: *cpu_distributed_deps
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp"

- label: CPU-Multi-Modal Model Tests %N
depends_on: []
Expand All @@ -97,7 +115,7 @@ steps:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 45m "
pytest -x -v -s tests/models/multimodal/generation --ignore=tests/models/multimodal/generation/test_pixtral.py -m cpu_model --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB"
parallelism: 2
parallelism: 3

- label: "Arm CPU Test"
depends_on: []
Expand Down
7 changes: 0 additions & 7 deletions .buildkite/hardware_tests/intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,3 @@ steps:
commands:
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh

- label: "Intel GPU Test"
depends_on: []
soft_fail: true
device: intel_gpu
no_plugin: true
commands:
- bash .buildkite/scripts/hardware_ci/run-xpu-test.sh
5 changes: 3 additions & 2 deletions .buildkite/image_build/image_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ check_and_skip_if_image_exists() {
}

ecr_login() {
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com || true
}

prepare_cache_tags() {
Expand Down Expand Up @@ -192,6 +192,7 @@ export BUILDKITE_COMMIT
export PARENT_COMMIT
export IMAGE_TAG
export IMAGE_TAG_LATEST
export COMMIT="${COMMIT:-${BUILDKITE_COMMIT}}"
export CACHE_FROM
export CACHE_FROM_BASE_BRANCH
export CACHE_FROM_MAIN
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true

# skip build if image already exists
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-cpu) ]]; then
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_cpu_arm64.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ REPO=$2
BUILDKITE_COMMIT=$3

# authenticate with AWS ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true

# skip build if image already exists
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-arm64-cpu) ]]; then
Expand Down
68 changes: 68 additions & 0 deletions .buildkite/image_build/image_build_torch_nightly.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash
set -euo pipefail

# Build a vLLM test image with PyTorch nightly installed.
# Called by the pipeline generator's "vLLM Against PyTorch Nightly" group.

if [[ $# -lt 5 ]]; then
echo "Usage: $0 <registry> <repo> <commit> <branch> <image_tag>"
exit 1
fi

REGISTRY=$1
REPO=$2
BUILDKITE_COMMIT=$3
BRANCH=$4
IMAGE_TAG=$5

# --- Arguments ---
echo "--- :mag: Arguments"
echo "REGISTRY: ${REGISTRY}"
echo "REPO: ${REPO}"
echo "BUILDKITE_COMMIT: ${BUILDKITE_COMMIT}"
echo "BRANCH: ${BRANCH}"
echo "IMAGE_TAG: ${IMAGE_TAG}"

# --- ECR login ---
echo "--- :key: ECR login"
aws ecr-public get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin "$REGISTRY"
aws ecr get-login-password --region us-east-1 \
| docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com

# --- Set up buildx ---
echo "--- :docker: Setting up buildx"
docker buildx create --name vllm-builder --driver docker-container --use || true
docker buildx inspect --bootstrap
docker buildx ls

# --- Skip if image already exists ---
echo "--- :mag: Checking if image already exists"
if docker manifest inspect "$IMAGE_TAG" >/dev/null 2>&1; then
echo "Image found: $IMAGE_TAG — skipping build"
exit 0
fi
echo "Image not found, proceeding with build..."

# --- CUDA 13.0 for nightly builds ---
# Nightly CI uses CUDA 13.0 while regular CI stays on CUDA 12.9
NIGHTLY_CUDA_VERSION="13.0.2"
NIGHTLY_BUILD_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-devel-ubuntu22.04"
NIGHTLY_FINAL_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-base-ubuntu22.04"

echo "--- :docker: Building torch nightly image (CUDA ${NIGHTLY_CUDA_VERSION})"
docker buildx build --file docker/Dockerfile \
--build-arg max_jobs=16 \
--build-arg buildkite_commit="$BUILDKITE_COMMIT" \
--build-arg USE_SCCACHE=1 \
--build-arg PYTORCH_NIGHTLY=1 \
--build-arg CUDA_VERSION="${NIGHTLY_CUDA_VERSION}" \
--build-arg BUILD_BASE_IMAGE="${NIGHTLY_BUILD_BASE_IMAGE}" \
--build-arg FINAL_BASE_IMAGE="${NIGHTLY_FINAL_BASE_IMAGE}" \
--build-arg torch_cuda_arch_list="8.0 8.9 9.0 10.0 12.0" \
--tag "$IMAGE_TAG" \
--push \
--target test \
--progress plain .

echo "--- :white_check_mark: Torch nightly image build complete: $IMAGE_TAG"
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/engine_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Engine Intel
depends_on:
- image-build-xpu
steps:
- label: Engine (1 GPU)
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/v1/engine/
- tests/v1/engine/
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py'
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/kernels_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Kernels Intel
depends_on:
- image-build-xpu
steps:
- label: vLLM IR Tests
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/ir
- vllm/kernels
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s kernels/ir'
Loading