Update vllm/vllm-openai Docker tag to v0.23.0#28
Open
renovate[bot] wants to merge 1 commit into
Open
Conversation
868b291 to
826ad73
Compare
826ad73 to
3d12f5e
Compare
3d12f5e to
9822382
Compare
9822382 to
e6b5a67
Compare
e6b5a67 to
d49e0f8
Compare
d49e0f8 to
c973e91
Compare
c973e91 to
b1a6fe1
Compare
b1a6fe1 to
2fe2467
Compare
2fe2467 to
f34b4ca
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
v0.17.1→v0.23.0Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
Release Notes
vllm-project/vllm (vllm/vllm-openai)
v0.23.0Compare Source
vLLM v0.23.0 Release Notes
Please note that Minimax M3 is not yet supported in this version. Please follow vLLM recipe for usage guides for M3.
Highlights
This release features 408 commits from 200 contributors (63 new)!
torch.compile(#43746, #43891), its attention and RoPE paths were refactored (#44569, #44262, #43926), and an XPU attention decode path was added (#42953).generateendpoint (#43779), dynamic LoRA endpoints (#43778),/version(#43854) and/server_info(#43942) endpoints, a server-router extension hook (#43774), request-ID headers (#43883), and many new tool parsers (InternLM2 #43481, hy_v3 #43872, Phi-4-mini #44213, Gemma4 #43850).on_new_requestlifecycle hook (#43205).Parser.parse()interface (#44267), with the Responses parser migrated to it (#42977).Model Support
fetch_audiofor transformers≥5.10 (#44559).torch.compile(#43617), EVS for Qwen3-VL (#44205), GLM-5.1 PP loading (#42944), GLM-4.1V processor logits (#43575), GLM-4.6V video loader (#44417), OlmoHybrid init (#43846), HyperCLOVAX remote-code removal (#43860), Bailing-MoE rotary factor (#43770), Step3 PP residual KeyError (#37622), MiniCPM-V-4.6 video (#44509), MiniCPM-O audio unpadding (#38053), MiniCPM-V batched preprocessing (#44609), FunASR-Nano init (#44215), Cohere routing method (#44021), Kimi-K2.5 FlashInfer ViT metadata (#44493).extra_repr()for pooler classes (#44805), LoRA-adapter-name pooling fix (#44410), resettled generative scoring entrypoint (#44153), expanded pooler unit tests (#43818, #44471).Engine Core
max_seq_lenfor attention metadata (#43991), rejection-sampling acceptance-rate fix (#40651), KVConnector + PP cleanup (#43732), speculator-prefill warmup/capture (#44253).num_heads_qfor drafts (#43543), EAGLE/MTP lookahead caching in the SWA prefix-cache mask (#44082).do_not_specialize(#43803), Qwen3.5 mixed prefill+decode split routing (#44700), MiniMax-M2 gate kernel (#38445).KVCacheSpec(#37505),scheduler_block_sizethreaded into KVCacheManager/Coordinator (#44165),max_concurrent_batchesmoved toVllmConfig(#44274), config validation rejecting 0/negative knobs (#43794, #44057, #44207), KV-cache scale boilerplate removed from weight loading (#43167).Large Scale Serving & Distributed
on_new_request) (#43205) andon_schedule_end()hook (#44206), token-offset selective offload (#39983), skip decode-phase blocks in CPU offload (#43797), page-size block alignment (#43689), Triton fast-path for small CPU→GPUswap_blocks_batch(#42212), stale sliding-window block fix (#42959).kv_bothrole deprecation cycle (#43874), Mooncake fixes (#43742, #44103, #42694), LMCacheLMCacheMPConnector(#42865), EC connector shutdown API (#42423) and non-blocking lookup (#41627), KV-transfer tokens excluded fromiteration_tokens_total(#43346).Hardware & Performance
Fp8BlockScaledMMnew_empty()optimization (#43677), TurboQuant shared dequant buffers (#40941), tunedselective_state_updatefor H200/RTX PRO (#44251), Inductor fast-path fallback for vLLM/AITER custom ops (#42129), Gemma RMS all-reduce fusion (#42646), NUMA auto-binding on DGX B300 (#43270).permute_colsfor ROCm (#44674), blocks-first KV layout for AMD (#43660), N=5 wvSplitK for spec decode (#40687), MoRI connector improvements (#43303, #41751, #40344).block_fp8_moe(#42139), block-scaled W8A8 FP8 path (#39968), WNA16 oracle for GPTQ sym-int4 (#41426), rms_norm/act quant fusions (#43963), GDN-attention MTP (#43565), Triton selective-scan op (#43421), transparent sleep mode (#37149), CPU/tiering offloading on XPU (#36423), DeepSeek-V4 attention decode path (#42953).cpu_awqfolded intoawq_marlin(#43841), RISC-V RVV WNA16 helpers (#42730), fused GDN gated-delta-rule kernels (#43534), PowerPC SHM communicator (#43754), arm64 CI image (#41303)._has_moduletrial-import verification (#44035).Quantization
supports_expert_map(#43108) and the inplace fused-experts mechanism (#43727).API & Frontend
system_fingerprintfield (#40537), streaming tool/function calling withrequired(#40700),chat_template_kwargsin Responses (#43761), developer-to-system conversion in the HF renderer (#43590), unstreamed tool-call-args streaming fix (#44348).Parser.parse()(#44267), Responses parser migrated to the unified interface (#42977), unstreamed tool-arg flush moved into the parser (#44017); new/fixed tool parsers — MiniCPM5 XML (#43175), Qwen3 XML JSON-args-first (#43243), DeepSeek DSML incremental streaming (#42879), first-args-chunk serializer fix (#42683),tool_choice="none"honored in streaming (#42752), null-tool-args crash fix (#43862).thinking_token_budgetvalidation (#43402), GPT-OSS instruction rendering (#44330), Harmonystop_token_idscleanup (#44009), consistentVLLMValidationErrorin chat/completion validators (#36254), consolidation of dev entrypoints (#44170) and online-serving utils (#44479).generateendpoint (#43779), dynamic LoRA endpoints (#43778),/version(#43854) and/server_info(#43942), server-router extension hook (#43774),--enable-request-id-headers(#43883), recursive tool-parameter conversion (#44299),include_reasoning=false(#44391),--language-model-onlyskips the multimodal processor (#44500), per-engine batch auto-abort (#44591), UTF-8 char-boundary detokenizer fix (#44620), HF chat-template fixes (#44311), cross-DP aggregation ofis_sleeping/reset_prefix_cache(#43429); new tool parsers — InternLM2 (#43481), hy_v3 (#43872), Phi-4-mini JSON (#44213), Gemma4 (#43850).vllm bench serve(#39795), reasoning-model (thinking) benchmarking via--chat-template-kwargs(#44244).Security
thinking_token_budgetvalues (#43402), non-positiveParallelConfiginteger knobs (#44057), zero-valued config fields (#43794), and out-of-rangemax_num_scheduled_tokens(#44207).Dependencies
libcublas-dev(#39855), and CUTLASS DSL cu13 install order (#45204).Deprecations
JAISLMHeadModel(#43784).kv_bothrole (#43874).New Contributors
Contributors
Thank you to everyone who made this release possible!
@AndreasKaratzas, @WoosukKwon, @BugenZhao, @yewentao256, @hmellor, @khluu, @njhill, @sfeng33, @bnellnm, @vadiklyutiy, @NickLucche, @JartX, @lucianommartins, @cleonard530, @wzhao18, @yma11, @simondanielsson, @jeejeelee, @zyongye, @chaunceyjiang, @bigPYJ1151, @ronensc, @taneem-ibrahim, @LucasWilkinson, @MatthewBonanni, @mmangkad, @chunyang-wen, @yzong-rh, @JaredforReal, @zixi-qi, @Isotr0py, @noooop, @chaojun-zhang, @Xunzhuo, @ivanium, @zufangzhu, @DaoyuanLi2816, @CienetStingLin, @aoshen02, @akii96, @benchislett, @MengqingCao, @rshavitt, @kliuae, @omerpaz95, @willamhou, @Majid-Taheri, @micah-wil, @ricky-chaoju, @mikekg, @mgoin, @mayuyuace, @Etelis, @ilmarkov, @tlrmchlsmth, @UranusSeven, @bedeks, @izhuhaoran, @ZJY0516, @fadara01, @pschlan-amd, @wangxiyuan, @Oxygen56, @charlifu, @varun-sundar-rabindranath, @shen-shanshan, @TheEpicDolphin, @adobrzyn, @XuZhou26, @tjtanaa, @Terrencezzj, @zhejiangxiaomai, @ILikeIneine, @yubofredwang, @chfeng-cs, @ThibaultCastells, @linzm1007, @javierdejesusda, @meenchen, @zhewenl, @xyang16, @angelayi, @nholmber, @zhangtao2-1, @adityasingh2400, @sts07142, @jatseng-ai, @fallintoplace, @andakai, @he-yufeng, @ignaciosica, @JINO-ROHIT, @tonyliu312, @QwertyJack, @animeshtrivedi, @jzakrzew, @juliendenize, @zexplorerhj, @ruocco, @mgehre-amd, @jasonboukheir, @MaciejBalaNV, @JohnQinAMD, @huanghua1994, @rajkiranjoshi, @rasmith, @harshaljanjani, @ltd0924, @wdhongtw, @yintong-lu, @tianmu-li, @jikunshang, @JMonde, @MHYangAMD, @frida-andersson, @gau-nernst, @Wauplin, @czhu-cohere, @gagandhakrey, @nemanjaudovic, @Liangliang-Ma, @liulanze, @sphinx07, @aadwived, @nightcityblade, @umut-polat, @jeffreywang88, @wcynb1023, @zzt93, @shadeMe, @Dao007forever, @alec-flowers, @Krishnachaitanyakc, @orozery, @BWAAEEEK, @cinnamonica02, @albertoperdomo2, @Rukhaiya2004, @mfylcek, @shreyas269, @Gruner-atero, @TomerBN-Nvidia, @wjinxu, @IdoAtadTD, @xiaozcy, @brian-dellabetta, @zhenwei-intel, @adotdad, @Kartavyasonar, @lesj0610, @ECMGit, @cakeng, @william-rom, @qiching, @NolanHo, @andylolu2, @xwu-intel, @linitra24, @hoobnn, @Dymasik, @wanghenshui, @maobaolong, @oguzhankir, @Jie-Fang, @okorzh-amd, @Kevin-XiongC, @jiahanc, @garrygale, @dsikka, @QiliangCui2023, @wjabbour, @zvik, @tc-mb, @jwzheng96, @divakar-amd, @tushar00jain, @galletas1712, @hanlin12-AMD, @tuukkjs, @viiccwen, @Sunt-ing, @HueCodes, @tianyu-z, @adhithyamulticoreware, @rishitdholakia13, @effi-ofer, @Vikrantpalle, @walterbm, @devin-lai, @Yadan-Wei, @amd-fuweiy, @maeehart, @qyYue1389, @BramVanroy, @SunskyXH, @Holworth, @majian4work, @xaguilar-amd, @Rohan138
v0.22.1Compare Source
Highlights
This release features 8 commits from 6 contributors (1 new)!
v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.
Model Support
Configuration
📅 Schedule: (in timezone Europe/Zurich)
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.