Skip to content

Update vllm/vllm-openai Docker tag to v0.23.0#28

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/vllm-vllm-openai-0.x
Open

Update vllm/vllm-openai Docker tag to v0.23.0#28
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/vllm-vllm-openai-0.x

Conversation

@renovate

@renovate renovate Bot commented Apr 7, 2026

Copy link
Copy Markdown

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Update Change
vllm/vllm-openai (source) minor v0.17.1v0.23.0

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

vllm-project/vllm (vllm/vllm-openai)

v0.23.0

Compare Source

vLLM v0.23.0 Release Notes

Please note that Minimax M3 is not yet supported in this version. Please follow vLLM recipe for usage guides for M3.

Highlights

This release features 408 commits from 200 contributors (63 new)!

  • DeepSeek-V4 matures across backends: Following its introduction in v0.22.0, DeepSeek-V4 received another large hardening and optimization pass. Its sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#​44699), it gained a TRTLLM-gen attention kernel (#​43827), EPLB support for the Mega-MoE (#​43339), selective prefix-cache retention for sliding-window KV cache (#​43447), and an index-share feature for DSA MTP (#​44420). The model was also detached from torch.compile (#​43746, #​43891), its attention and RoPE paths were refactored (#​44569, #​44262, #​43926), and an XPU attention decode path was added (#​42953).
  • Model Runner V2 expands to more dense models: MRv2 is now selected by default for Llama and Mistral dense models (#​43458) in addition to Qwen3. It gained a FlashInfer sampler (#​42472), breakable CUDA graphs (#​44050), pipeline-parallel bubble elimination (#​42187), kernel block-size support for hybrid models (#​38831), and Gemma 4 MTP (#​43241).
  • Rust frontend grows up: The experimental Rust frontend added a streaming generate endpoint (#​43779), dynamic LoRA endpoints (#​43778), /version (#​43854) and /server_info (#​43942) endpoints, a server-router extension hook (#​43774), request-ID headers (#​43883), and many new tool parsers (InternLM2 #​43481, hy_v3 #​43872, Phi-4-mini #​44213, Gemma4 #​43850).
  • Gemma 4: Added encoder-free Gemma 4 Unified support (#​44429) and Gemma 4 MTP (#​43241), plus numerous accuracy and startup fixes.
  • Transformers v5 compatibility: vLLM now targets Transformers v5, with vendored MiniCPM-V/O processors (#​44282) and compatibility fixes for Sarvam (#​38804) and Voxtral (#​44559).
  • Multi-tier KV cache offloading: The offloading framework gained an object-store secondary tier (#​41968), HMA enabled by default for capable connectors (#​41847), tiering support for HMA models (#​44287), and a per-request offloading policy via the on_new_request lifecycle hook (#​43205).
  • Unified parser: Reasoning and tool-call parsing are now unified behind a single Parser.parse() interface (#​44267), with the Responses parser migrated to it (#​42977).
Model Support
Engine Core
  • Model Runner V2: Default for Llama and Mistral dense models (#​43458), FlashInfer sampler (#​42472), breakable CUDA graphs (#​44050), removed Eagle's dedicated CUDA graph pool (#​44078), pipeline-parallel bubble elimination (#​42187), kernel block size for hybrid models (#​38831), zeroing of freshly allocated KV blocks for hybrid + FP8 KV cache (#​43990), actual batch max_seq_len for attention metadata (#​43991), rejection-sampling acceptance-rate fix (#​40651), KVConnector + PP cleanup (#​43732), speculator-prefill warmup/capture (#​44253).
  • Speculative decoding (DFlash): Causal DFlash (#​43445), proper lookahead-slot allocation (#​43733), prefix-cache corruption fix (#​42971); independent drafter attention-backend selection (#​39930), attention-group split by num_heads_q for drafts (#​43543), EAGLE/MTP lookahead caching in the SWA prefix-cache mask (#​44082).
  • Attention & hybrid/Mamba: FlexAttention/FlashAttention num-blocks-first layouts (#​42095), OOT MLA prefill backend registration (#​43325), FlashAttention upstream sync (#​44065), Mamba LINEAR attention-module refactor (#​43556), corrupted MLA + linear attention fix (#​43961), KDA conv-state unification (#​44539) and gate/cumsum fusion (#​43667), Mamba SSD do_not_specialize (#​43803), Qwen3.5 mixed prefill+decode split routing (#​44700), MiniMax-M2 gate kernel (#​38445).
  • KV cache & scheduler: Pluggable KVCacheSpec (#​37505), scheduler_block_size threaded into KVCacheManager/Coordinator (#​44165), max_concurrent_batches moved to VllmConfig (#​44274), config validation rejecting 0/negative knobs (#​43794, #​44057, #​44207), KV-cache scale boilerplate removed from weight loading (#​43167).
  • Core: Freeze the garbage collector in workers after model init (#​44363), sparse NCCL weight transfer for in-place updates (#​40096), graceful spinloop ext-load failure handling (#​43659), scheduled-function deprecations (#​43358).
Large Scale Serving & Distributed
  • KV cache offloading: Object-store secondary tier (#​41968), HMA on by default for capable connectors (#​41847) and tiering (#​44287), per-request offloading policy (on_new_request) (#​43205) and on_schedule_end() hook (#​44206), token-offset selective offload (#​39983), skip decode-phase blocks in CPU offload (#​43797), page-size block alignment (#​43689), Triton fast-path for small CPU→GPU swap_blocks_batch (#​42212), stale sliding-window block fix (#​42959).
  • KV connectors / disaggregated serving: PP-aware handshake aggregation and intermediate-PP output plumbing (#​43720), multiple-async-KV-load deadlock fix (#​44560), Nixl Mamba prefix-caching mode (#​42554), NixlConnector kv_both role deprecation cycle (#​43874), Mooncake fixes (#​43742, #​44103, #​42694), LMCache LMCacheMPConnector (#​42865), EC connector shutdown API (#​42423) and non-blocking lookup (#​41627), KV-transfer tokens excluded from iteration_tokens_total (#​43346).
  • EPLB: Async EPLB by default (#​43219), EPLB for DeepSeek-V4 Mega-MoE (#​43339), Nixl zero-copy EPLB transfers (#​41633).
  • Data parallel: DP Ray placement groups on specific nodes (#​44669) and grouped-node allocation fix (#​43998), SSL for the DP supervisor (#​43688), DP-coordinator startup timeout raised to 120s (#​42343), per-GPU-worker RDMA NIC selection (#​42083).
Hardware & Performance
  • NVIDIA / kernels: FP8 FlashInfer attention for ViT (#​38065), Triton MoE backend on Hopper by default (#​44220), CUTLASS FP8 scaled-mm padding bypass (+20%) (#​43706), MoE-permute buffer pre-allocation (+9–14%) (#​43014), Fp8BlockScaledMM new_empty() optimization (#​43677), TurboQuant shared dequant buffers (#​40941), tuned selective_state_update for H200/RTX PRO (#​44251), Inductor fast-path fallback for vLLM/AITER custom ops (#​42129), Gemma RMS all-reduce fusion (#​42646), NUMA auto-binding on DGX B300 (#​43270).
  • AMD ROCm: ROCm 7.2.3 (#​43136), AITER v0.1.13.post1 (#​44265), native W4A16 (#​41394) and fused-MoE W4A16 HIP (#​44075) kernels for RDNA3 (gfx1100), AITER top-k/top-p sampler by default (#​43331), attention-sink support in AITER FA (#​43817), AITER hipBLASLt GEMM online tuning (#​40426), permute_cols for ROCm (#​44674), blocks-first KV layout for AMD (#​43660), N=5 wvSplitK for spec decode (#​40687), MoRI connector improvements (#​43303, #​41751, #​40344).
  • Intel XPU: vllm-xpu-kernel v0.1.7 (#​41019), block_fp8_moe (#​42139), block-scaled W8A8 FP8 path (#​39968), WNA16 oracle for GPTQ sym-int4 (#​41426), rms_norm/act quant fusions (#​43963), GDN-attention MTP (#​43565), Triton selective-scan op (#​43421), transparent sleep mode (#​37149), CPU/tiering offloading on XPU (#​36423), DeepSeek-V4 attention decode path (#​42953).
  • CPU & other architectures: zentorch-accelerated W8A8/W4A16 on AMD Zen CPUs (#​41813), CPU top-k/top-p Triton sampling (#​43633), non-divisible GQA decode in mixed batches (#​43032), cpu_awq folded into awq_marlin (#​43841), RISC-V RVV WNA16 helpers (#​42730), fused GDN gated-delta-rule kernels (#​43534), PowerPC SHM communicator (#​43754), arm64 CI image (#​41303).
  • TPU: tpu-inference upgraded to v0.20.0 (#​43394) then v0.21.0 (#​44621).
  • torch stable ABI: Continued migration of kernels to the libtorch stable ABI — merge_attn_states/mamba/sampler [8/n] (#​43361), attention/cache kernels [9/n] (#​43717), header files (#​44013), cuda_view/silu_and_mul [10/n] (#​44334), custom all-reduce/DeepSeek-V4 fused MLA/MXFP8 MoE [10b/n] (#​44365); ROCm fallback to regular ABI (#​44648), _has_module trial-import verification (#​44035).
Quantization
  • ModelOpt: LM-head quantization (#​42124), MXFP8 non-gated MoE (#​42958).
  • compressed-tensors: WNA8O8Int linears and WNInt embeddings (#​44340), asymmetric MoE WNA16 Marlin (#​44025), single-class NVFP4 linear refactor (#​42443).
  • Kernels & backends: Triton W4A16 as CUDA fallback for non-Marlin-aligned shapes (#​43731), Marlin MoE on SM 12.x (#​40923), Machete W4A16 tests (#​35450), fail-fast for unsupported NVFP4 KV-cache-dtype arch (#​43669), CuteDSL compressor 128-split kernel optimization (#​44230).
  • MoE refactor (oracle): Migrated ModelOpt MXFP8 (#​42768), W4A8-int8 (#​42789), and WNA16 backend selection (#​42553) into the modular-kernel oracle; removed supports_expert_map (#​43108) and the inplace fused-experts mechanism (#​43727).
API & Frontend
  • Anthropic Messages API: Structured output and effort support (#​42396), system-role messages inside the messages array (#​44283).
  • OpenAI / Responses API: system_fingerprint field (#​40537), streaming tool/function calling with required (#​40700), chat_template_kwargs in Responses (#​43761), developer-to-system conversion in the HF renderer (#​43590), unstreamed tool-call-args streaming fix (#​44348).
  • Parsers: Unified reasoning + tool-call parsing behind Parser.parse() (#​44267), Responses parser migrated to the unified interface (#​42977), unstreamed tool-arg flush moved into the parser (#​44017); new/fixed tool parsers — MiniCPM5 XML (#​43175), Qwen3 XML JSON-args-first (#​43243), DeepSeek DSML incremental streaming (#​42879), first-args-chunk serializer fix (#​42683), tool_choice="none" honored in streaming (#​42752), null-tool-args crash fix (#​43862).
  • Frontend: thinking_token_budget validation (#​43402), GPT-OSS instruction rendering (#​44330), Harmony stop_token_ids cleanup (#​44009), consistent VLLMValidationError in chat/completion validators (#​36254), consolidation of dev entrypoints (#​44170) and online-serving utils (#​44479).
  • Rust frontend: Streaming generate endpoint (#​43779), dynamic LoRA endpoints (#​43778), /version (#​43854) and /server_info (#​43942), server-router extension hook (#​43774), --enable-request-id-headers (#​43883), recursive tool-parameter conversion (#​44299), include_reasoning=false (#​44391), --language-model-only skips the multimodal processor (#​44500), per-engine batch auto-abort (#​44591), UTF-8 char-boundary detokenizer fix (#​44620), HF chat-template fixes (#​44311), cross-DP aggregation of is_sleeping/reset_prefix_cache (#​43429); new tool parsers — InternLM2 (#​43481), hy_v3 (#​43872), Phi-4-mini JSON (#​44213), Gemma4 (#​43850).
  • Benchmarks: Timed trace replay for Moonshot/Alibaba workloads in vllm bench serve (#​39795), reasoning-model (thinking) benchmarking via --chat-template-kwargs (#​44244).
Security
  • Transport encryption: SSL/TLS support for the data-parallel supervisor (#​43688).
  • Untrusted-input hardening: Reject out-of-vocabulary token IDs before they reach the GPU logprob path (#​44042) and fix a UTF-8 char-boundary panic in the Rust incremental detokenizer on malformed input (#​44620), both of which prevent request-triggered crashes.
  • Parameter validation: Reject invalid thinking_token_budget values (#​43402), non-positive ParallelConfig integer knobs (#​44057), zero-valued config fields (#​43794), and out-of-range max_num_scheduled_tokens (#​44207).
Dependencies
Deprecations
  • Deprecate JAISLMHeadModel (#​43784).
  • Begin the deprecation cycle for the NixlConnector kv_both role (#​43874).
  • Remove functions previously scheduled for deprecation in v0.21.0 (#​43358).
New Contributors
Contributors

Thank you to everyone who made this release possible!

@​AndreasKaratzas, @​WoosukKwon, @​BugenZhao, @​yewentao256, @​hmellor, @​khluu, @​njhill, @​sfeng33, @​bnellnm, @​vadiklyutiy, @​NickLucche, @​JartX, @​lucianommartins, @​cleonard530, @​wzhao18, @​yma11, @​simondanielsson, @​jeejeelee, @​zyongye, @​chaunceyjiang, @​bigPYJ1151, @​ronensc, @​taneem-ibrahim, @​LucasWilkinson, @​MatthewBonanni, @​mmangkad, @​chunyang-wen, @​yzong-rh, @​JaredforReal, @​zixi-qi, @​Isotr0py, @​noooop, @​chaojun-zhang, @​Xunzhuo, @​ivanium, @​zufangzhu, @​DaoyuanLi2816, @​CienetStingLin, @​aoshen02, @​akii96, @​benchislett, @​MengqingCao, @​rshavitt, @​kliuae, @​omerpaz95, @​willamhou, @​Majid-Taheri, @​micah-wil, @​ricky-chaoju, @​mikekg, @​mgoin, @​mayuyuace, @​Etelis, @​ilmarkov, @​tlrmchlsmth, @​UranusSeven, @​bedeks, @​izhuhaoran, @​ZJY0516, @​fadara01, @​pschlan-amd, @​wangxiyuan, @​Oxygen56, @​charlifu, @​varun-sundar-rabindranath, @​shen-shanshan, @​TheEpicDolphin, @​adobrzyn, @​XuZhou26, @​tjtanaa, @​Terrencezzj, @​zhejiangxiaomai, @​ILikeIneine, @​yubofredwang, @​chfeng-cs, @​ThibaultCastells, @​linzm1007, @​javierdejesusda, @​meenchen, @​zhewenl, @​xyang16, @​angelayi, @​nholmber, @​zhangtao2-1, @​adityasingh2400, @​sts07142, @​jatseng-ai, @​fallintoplace, @​andakai, @​he-yufeng, @​ignaciosica, @​JINO-ROHIT, @​tonyliu312, @​QwertyJack, @​animeshtrivedi, @​jzakrzew, @​juliendenize, @​zexplorerhj, @​ruocco, @​mgehre-amd, @​jasonboukheir, @​MaciejBalaNV, @​JohnQinAMD, @​huanghua1994, @​rajkiranjoshi, @​rasmith, @​harshaljanjani, @​ltd0924, @​wdhongtw, @​yintong-lu, @​tianmu-li, @​jikunshang, @​JMonde, @​MHYangAMD, @​frida-andersson, @​gau-nernst, @​Wauplin, @​czhu-cohere, @​gagandhakrey, @​nemanjaudovic, @​Liangliang-Ma, @​liulanze, @​sphinx07, @​aadwived, @​nightcityblade, @​umut-polat, @​jeffreywang88, @​wcynb1023, @​zzt93, @​shadeMe, @​Dao007forever, @​alec-flowers, @​Krishnachaitanyakc, @​orozery, @​BWAAEEEK, @​cinnamonica02, @​albertoperdomo2, @​Rukhaiya2004, @​mfylcek, @​shreyas269, @​Gruner-atero, @​TomerBN-Nvidia, @​wjinxu, @​IdoAtadTD, @​xiaozcy, @​brian-dellabetta, @​zhenwei-intel, @​adotdad, @​Kartavyasonar, @​lesj0610, @​ECMGit, @​cakeng, @​william-rom, @​qiching, @​NolanHo, @​andylolu2, @​xwu-intel, @​linitra24, @​hoobnn, @​Dymasik, @​wanghenshui, @​maobaolong, @​oguzhankir, @​Jie-Fang, @​okorzh-amd, @​Kevin-XiongC, @​jiahanc, @​garrygale, @​dsikka, @​QiliangCui2023, @​wjabbour, @​zvik, @​tc-mb, @​jwzheng96, @​divakar-amd, @​tushar00jain, @​galletas1712, @​hanlin12-AMD, @​tuukkjs, @​viiccwen, @​Sunt-ing, @​HueCodes, @​tianyu-z, @​adhithyamulticoreware, @​rishitdholakia13, @​effi-ofer, @​Vikrantpalle, @​walterbm, @​devin-lai, @​Yadan-Wei, @​amd-fuweiy, @​maeehart, @​qyYue1389, @​BramVanroy, @​SunskyXH, @​Holworth, @​majian4work, @​xaguilar-amd, @​Rohan138

v0.22.1

Compare Source

Highlights

This release features 8 commits from 6 contributors (1 new)!

v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.

Model Support
  • New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation mod

Note

PR body was truncated to here.


Configuration

📅 Schedule: (in timezone Europe/Zurich)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Enabled.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot added the renovate label Apr 7, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from 868b291 to 826ad73 Compare April 8, 2026 18:39
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.19.0 Update vllm/vllm-openai Docker tag to v0.19.1 Apr 18, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from 826ad73 to 3d12f5e Compare April 18, 2026 09:23
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from 3d12f5e to 9822382 Compare April 27, 2026 13:44
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.19.1 Update vllm/vllm-openai Docker tag to v0.20.0 Apr 27, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from 9822382 to e6b5a67 Compare May 3, 2026 13:37
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.20.0 Update vllm/vllm-openai Docker tag to v0.20.1 May 3, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from e6b5a67 to d49e0f8 Compare May 9, 2026 13:15
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.20.1 Update vllm/vllm-openai Docker tag to v0.20.2 May 9, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from d49e0f8 to c973e91 Compare May 15, 2026 01:32
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.20.2 Update vllm/vllm-openai Docker tag to v0.21.0 May 15, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from c973e91 to b1a6fe1 Compare May 29, 2026 11:44
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.21.0 Update vllm/vllm-openai Docker tag to v0.22.0 May 29, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from b1a6fe1 to 2fe2467 Compare June 5, 2026 10:02
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.22.0 Update vllm/vllm-openai Docker tag to v0.22.1 Jun 5, 2026
@renovate renovate Bot force-pushed the renovate/vllm-vllm-openai-0.x branch from 2fe2467 to f34b4ca Compare June 13, 2026 02:07
@renovate renovate Bot changed the title Update vllm/vllm-openai Docker tag to v0.22.1 Update vllm/vllm-openai Docker tag to v0.23.0 Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants