Update dependency vllm to v0.22.0 [SECURITY] - autoclosed#45
Closed
renovate[bot] wants to merge 1 commit into
Closed
Update dependency vllm to v0.22.0 [SECURITY] - autoclosed#45renovate[bot] wants to merge 1 commit into
renovate[bot] wants to merge 1 commit into
Conversation
Contributor
Author
|
4781d93 to
a70fcd1
Compare
a70fcd1 to
27c0fb1
Compare
27c0fb1 to
bcb2a4f
Compare
bcb2a4f to
fb53a3d
Compare
fb53a3d to
fa25622
Compare
fa25622 to
5b769e6
Compare
5b769e6 to
ac59da5
Compare
ac59da5 to
4081aeb
Compare
4081aeb to
ebfedd4
Compare
ebfedd4 to
f2453df
Compare
f2453df to
72daf4b
Compare
72daf4b to
1bd2543
Compare
1bd2543 to
08ea896
Compare
Signed-off-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
08ea896 to
61076f5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==v0.8.4→==0.22.0==v0.6.6→==0.22.0==v0.6.4→==0.22.0==0.11.0→==0.22.0==0.8.4→==0.22.0==0.6.6→==0.22.0==0.6.4→==0.22.0vLLM denial of service vulnerability
CVE-2024-8768 / GHSA-w2r7-9579-27hf
More information
Details
A flaw was found in the vLLM library. A completions API request with an empty prompt will crash the vLLM API server, resulting in a denial of service.
Severity
CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:N/VA:H/SC:N/SI:N/SA:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vllm: Malicious model to RCE by torch.load in hf_model_weights_iterator
CVE-2025-24357 / GHSA-rh4j-5rhw-hr54
More information
Details
Description
The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.
Impact
This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.
Note that most models now use the safetensors format, which is not vulnerable to this issue.
References
Severity
CVSS:3.1/AV:N/AC:H/PR:N/UI:R/S:U/C:H/I:H/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM uses Python 3.12 built-in hash() which leads to predictable hash collisions in prefix cache
CVE-2025-25183 / GHSA-rm76-4mrf-v9r8
More information
Details
Summary
Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.
Details
vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.
Impact
The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.
Solution
We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.
Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.
To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.
References
Severity
CVSS:3.1/AV:N/AC:H/PR:L/UI:R/S:U/C:N/I:L/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM denial of service via outlines unbounded cache on disk
CVE-2025-29770 / GHSA-mgrm-fgjv-mhv8
More information
Details
Impact
The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server.
The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. vLLM should have this off by default and allow administrators to opt-in due to the potential for abuse.
A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space.
Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the
guided_decoding_backendkey of theextra_bodyfield of the request.This issue applies to the V0 engine only. The V1 engine is not affected.
Patches
The fix is to disable this cache by default since it does not provide an option to limit its size. If you want to use this cache anyway, you may set the
VLLM_V0_USE_OUTLINES_CACHEenvironment variable to1.Workarounds
There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.
References
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM Allows Remote Code Execution via Mooncake Integration
CVE-2025-29783 / GHSA-x3m8-f7g5-qhm7
More information
Details
Summary
When vLLM is configured to use Mooncake, unsafe deserialization exposed directly over ZMQ/TCP will allow attackers to execute remote code on distributed hosts.
Details
Further, the mooncake integration opens these sockets listening on all interfaces on the host, meaning it can not be configured to only use a private, trusted network.Only
sender_socketandreceiver_ackare allowed to be accessed publicly, while the data actually decompressed bypickle.loads()comes from recv_bytes. Its interface is defined asself.receiver_socket.connect(f\"tcp://{d_host}:{d_rank_offset + 1}\"), whered_hostisdecode_host, a locally defined address 192.168.0.139,from mooncake.json (https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/vllm-integration-v0.2.md?plain=1#L36).recv_tensor()calls_recv_implwhich passes the raw network bytes topickle.loads(). Additionally, it does not appear that there are any controls (network, authentication, etc) to prevent arbitrary users from sending this payload to the affected service.Impact
This is a remote code execution vulnerability impacting any deployments using Mooncake to distribute KV across distributed hosts.
Remediation
This issue is resolved by https://github.com/vllm-project/vllm/pull/14228
Severity
CVSS:3.1/AV:A/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM vulnerable to Denial of Service by abusing xgrammar cache
GHSA-hf3c-wxg2-49q9
More information
Details
Impact
This report is to highlight a vulnerability in XGrammar, a library used by the structured output feature in vLLM. The XGrammar advisory is here: GHSA-389x-67px-mjg3
The xgrammar library is the default backend used by vLLM to support structured output (a.k.a. guided decoding). Xgrammar provides a required, built-in cache for its compiled grammars stored in RAM. xgrammar is available by default through the OpenAI compatible API server with both the V0 and V1 engines.
A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service by consuming all of the system's RAM.
Note that even if vLLM was configured to use a different backend by default, it is still possible to choose xgrammar on a per-request basis using the
guided_decoding_backendkey of theextra_bodyfield of the request with the V0 engine. This per-request choice is not available when using the V1 engine.Patches
Workarounds
There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.
References
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
CVE-2025-24357 Malicious model remote code execution fix bypass with PyTorch < 2.6.0
GHSA-ggpf-24jw-3fcw
More information
Details
Description
GHSA-rh4j-5rhw-hr54 reported a vulnerability where loading a malicious model could result in code execution on the vllm host. The fix applied to specify
weights_only=Trueto calls totorch.load()did not solve the problem prior to PyTorch 2.6.0.PyTorch has issued a new CVE about this problem: GHSA-53q9-r3pm-6pq6
This means that versions of vLLM using PyTorch before 2.6.0 are vulnerable to this problem.
Background Knowledge
When users install VLLM according to the official manual

But the version of PyTorch is specified in the requirements. txt file

So by default when the user install VLLM, it will install the PyTorch with version 2.5.1

In CVE-2025-24357, weights_only=True was used for patching, but we know this is not secure.
Because we found that using Weights_only=True in pyTorch before 2.5.1 was unsafe
Here, we use this interface to prove that it is not safe.

Fix
update PyTorch version to 2.6.0
Credit
This vulnerability was found By Ji'an Zhou and Li'shuo Song
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Data exposure via ZeroMQ on multi-node vLLM deployment
CVE-2025-30202 / GHSA-9f8f-2vmf-885j
More information
Details
Impact
In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an
XPUBZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.Any client with network access to this host can connect to this
XPUBsocket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.
Detailed Analysis
The
XPUBsocket in question is created here:https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237
Data is published over this socket via
MessageQueue.enqueue()which is called byMessageQueue.broadcast_object():https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478
The
MessageQueue.broadcast_object()method is called by theGroupCoordinator.broadcast_object()method inparallel_state.py:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366
The broadcast over ZeroMQ is only done if the
GroupCoordinatorwas created withuse_message_queue_broadcasterset toTrue:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219
The only case where
GroupCoordinatoris created withuse_message_queue_broadcasteris the coordinator for the tensor parallelism group:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936
To determine what data is broadcasted to the tensor parallism group, we must continue tracing.
GroupCoordinator.broadcast_object()is called byGroupCoordinator.broadcoast_tensor_dict():https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489
which is called by
broadcast_tensor_dict()incommunication_op.py:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34
If we look at
_get_driver_input_and_broadcast()in the V0worker_base.py, we'll see how this tensor dict is formed:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352
but the data actually sent over ZeroMQ is the
metadata_listportion that is split from thistensor_dict. The tensor parts are sent viatorch.distributedand only metadata about those tensors is sent via ZeroMQ.https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83
Patches
Workarounds
Prior to the fix, your options include:
XPUBsocket. Note that port used is random.References
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM Vulnerable to Remote Code Execution via Mooncake Integration
CVE-2025-32444 / GHSA-hj4w-hm2g-p6w5
More information
Details
Impacted Deployments
Note that vLLM instances that do NOT make use of the mooncake integration are NOT vulnerable.
Description
vLLM integration with mooncake is vaulnerable to remote code execution due to using
picklebased serialization over unsecured ZeroMQ sockets. The vulnerable sockets were set to listen on all network interfaces, increasing the likelihood that an attacker is able to reach the vulnerable ZeroMQ sockets to carry out an attack.This is a similar to GHSA - x3m8 - f7g5 - qhm7, the problem is in
https://github.com/vllm-project/vllm/blob/32b14baf8a1f7195ca09484de3008063569b43c5/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py#L179
Here recv_pyobj() Contains implicit
pickle.loads(), which leads to potential RCE.Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Data exposure via ZeroMQ on multi-node vLLM deployment
CVE-2025-30202 / GHSA-9f8f-2vmf-885j
More information
Details
Impact
In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an
XPUBZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.Any client with network access to this host can connect to this
XPUBsocket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.
Detailed Analysis
The
XPUBsocket in question is created here:https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237
Data is published over this socket via
MessageQueue.enqueue()which is called byMessageQueue.broadcast_object():https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478
The
MessageQueue.broadcast_object()method is called by theGroupCoordinator.broadcast_object()method inparallel_state.py:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366
The broadcast over ZeroMQ is only done if the
GroupCoordinatorwas created withuse_message_queue_broadcasterset toTrue:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219
The only case where
GroupCoordinatoris created withuse_message_queue_broadcasteris the coordinator for the tensor parallelism group:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936
To determine what data is broadcasted to the tensor parallism group, we must continue tracing.
GroupCoordinator.broadcast_object()is called byGroupCoordinator.broadcoast_tensor_dict():https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489
which is called by
broadcast_tensor_dict()incommunication_op.py:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34
If we look at
_get_driver_input_and_broadcast()in the V0worker_base.py, we'll see how this tensor dict is formed:https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352
but the data actually sent over ZeroMQ is the
metadata_listportion that is split from thistensor_dict. The tensor parts are sent viatorch.distributedand only metadata about those tensors is sent via ZeroMQ.https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83
Patches
Workarounds
Prior to the fix, your options include:
XPUBsocket. Note that port used is random.References
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by OSV and the GitHub Advisory Database (CC-BY 4.0).
vLLM Vulnerable to Remote Code Execution via Mooncake Integration
CVE-2025-32444 / GHSA-hj4w-hm2g-p6w5 / PYSEC-2025-42
More information
Details
Impacted Deployments
Note that vLLM instances that do NOT make use of the mooncake integration are NOT vulnerable.
Description
vLLM integration with mooncake is vaulnerable to remote code execution due to using
picklebased serialization over unsecured ZeroMQ sockets. The vulnerable sockets were set to listen on all network interfaces, increasing the likelihood that an attacker is able to reach the vulnerable ZeroMQ sockets to carry out an attack.This is a similar to GHSA - x3m8 - f7g5 - qhm7, the problem is in
https://github.com/vllm-project/vllm/blob/32b14baf8a1f7195ca09484de3008063569b43c5/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py#L179
Here recv_pyobj() Contains implicit
pickle.loads(), which leads to potential RCE.Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:HReferences
This data is provided by OSV and the GitHub Advisory Database (CC-BY 4.0).
vLLM Allows Remote Code Execution via PyNcclPipe Communication Service
CVE-2025-47277 / GHSA-hjq4-87xh-g4fv
More information
Details
Impacted Environments
This issue ONLY impacts environments using the
PyNcclPipeKV cache transfer integration with the V0 engine. No other configurations are affected.Summary
vLLM supports the use of the
PyNcclPipeclass to establish a peer-to-peer communication domain for data transmission between distributed nodes. The GPU-side KV-Cache transmission is implemented through thePyNcclCommunicatorclass, while CPU-side control message passing is handled via thesend_objandrecv_objmethods on the CPU side.A remote code execution vulnerability exists in the
PyNcclPipeservice. Attackers can exploit this by sending malicious serialized data to gain server control privileges.The intention was that this interface should only be exposed to a private network using the IP address specified by the
--kv-ipCLI parameter. The vLLM documentation covers how this must be limited to a secured network: https://docs.vllm.ai/en/latest/deployment/security.htmlUnfortunately, the default behavior from PyTorch is that the
TCPStoreinterface will listen on ALL interfaces, regardless of what IP address is provided. The IP address given was only used as a client-side address to use. vLLM was fixed to use a workaround to force theTCPStoreinstance to bind its socket to a specified private interface.This issue was reported privately to PyTorch and they determined that this behavior was intentional.
Details
The
PyNcclPipeimplementation contains a critical security flaw where it directly processes client-provided data usingpickle.loads, creating an unsafe deserialization vulnerability that can lead to Remote Code Execution.PyNcclPipeservice configured to listen on port18888when launched:PyNcclPipeservice:The call stack triggering RCE is as follows:
Getshell as follows:
Reporters
This issue was reported independently by three different parties:
Fix
TCPStoresocket to the private interface as configured.Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:HReferences
This data is provided by OSV and the GitHub Advisory Database (CC-BY 4.0).
phi4mm: Quadratic Time Complexity in Input Token Processing leads to denial of service
CVE-2025-46560 / GHSA-vc6m-hm49-g9qg
More information
Details
Summary
A critical performance vulnerability has been identified in the input preprocessing logic of the multimodal tokenizer. The code dynamically replaces placeholder tokens (e.g., <|audio_|>, <|image_|>) with repeated tokens based on precomputed lengths. Due to inefficient list concatenation operations, the algorithm exhibits quadratic time complexity (O(n²)), allowing malicious actors to trigger resource exhaustion via specially crafted inputs.
Details
Affected Component: input_processor_for_phi4mm function.
https://github.com/vllm-project/vllm/blob/8cac35ba435906fb7eb07e44fe1a8c26e8744f4e/vllm/model_executor/models/phi4mm.py#L1182-L1197
The code modifies the input_ids list in-place using input_ids = input_ids[:i] + tokens + input_ids[i+1:]. Each concatenation operation copies the entire list, leading to O(n) operations per replacement. For k placeholders expanding to m tokens, total time becomes O(kmn), approximating O(n²) in worst-case scenarios.
PoC
Test data demonstrates exponential time growth:
Doubling input size increases runtime by ~4x (consistent with O(n²)).
Impact
Denial-of-Service (DoS): An attacker could submit inputs with many placeholders (e.g., 10,000 <|audio_1|> tokens), causing CPU/memory exhaustion.
Example: 10,000 placeholders → ~100 million operations.
Remediation Recommendations
Precompute all placeholder positions and expansion lengths upfront.
Replace dynamic list concatenation with a single preallocated array.
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by OSV and the GitHub Advisory Database (CC-BY 4.0).
CVE-2025-32444 / GHSA-hj4w-hm2g-p6w5 / PYSEC-2025-42
More information
Details
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. Versions starting from 0.6.5 and prior to 0.8.5, having vLLM integration with mooncake, are vulnerable to remote code execution due to using pickle based serialization over unsecured ZeroMQ sockets. The vulnerable sockets were set to listen on all network interfaces, increasing the likelihood that an attacker is able to reach the vulnerable ZeroMQ sockets to carry out an attack. vLLM instances that do not make use of the mooncake integration are not vulnerable. This issue has been patched in version 0.8.5.
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:HReferences
This data is provided by OSV and the PyPI Advisory Database (CC-BY 4.0).
Potential Timing Side-Channel Vulnerability in vLLM’s Chunk-Based Prefix Caching
CVE-2025-46570 / GHSA-4qjh-9fv9-r85r / PYSEC-2025-53
More information
Details
This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.
Description
When a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.
For instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim's input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.
Unlike token-by-token sharing mechanisms, vLLM’s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim's prompt at the chunk level.
Environment
Configuration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.
Leakage
We conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.

Results
In our experiment, we analyzed the response time differences between cache hits and misses in vLLM's PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results:
Fixes
Severity
CVSS:3.1/AV:N/AC:H/PR:L/UI:R/S:U/C:L/I:N/A:NReferences
This data is provided by OSV and the GitHub Advisory Database (CC-BY 4.0).
vLLM DOS: Remotely kill vllm over http with invalid JSON schema
CVE-2025-48942 / GHSA-6qc9-v4r8-22xg / PYSEC-2025-54
More information
Details
Summary
Hitting the /v1/completions API with a invalid json_schema as a Guided Param will kill the vllm server
Details
The following API call
(venv) [derekh@ip-172-31-15-108 ]$ curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'will provoke a Uncaught exceptions from xgrammer in
./lib64/python3.11/site-packages/xgrammar/compiler.pyIssue with more information: https://github.com/vllm-project/vllm/issues/17248
PoC
Make a call to vllm with invalid json_scema e.g.
{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'Impact
vllm crashes
example traceback