Fix PDBackendAsync shared-key races in allocation, transfer, and removal paths#323
Merged
hlin99 merged 5 commits intoMay 29, 2026
Conversation
Copilot
AI
changed the title
[WIP] Fix race conditions in PDBackendAsync to prevent use-after-free
Fix PDBackendAsync shared-key races in allocation, transfer, and removal paths
May 29, 2026
…ount release (#324) * Initial plan * fix(pd): make dedup pin atomic and decrement remove refcount * test(pd): align async race tests with refcount decrement semantics * test(pd): restore requested UT typing and mock specs --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…n, dedup test coverage (#325) * Initial plan * Fix remove() log, add bounds check, change put() log to INFO, add sender dedup tests * Merge TestSenderSideDedup into test_pd_backend_async_race.py, remove separate file --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…nsfer (#326) * Initial plan * fix(pd_backend_async): include deduped keys in abort cleanup tracking --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PDBackendAsyncmishandled shared-prefix keys under concurrency: duplicate allocations could overwrite/free in-flight buffers, andremove()could drop entries still needed by other requests. This ports the sync backend’s shared-key semantics to async so dedup, ownership, and lifecycle are consistent.Receiver dedup in allocation path
_async_allocate_and_putnow checkscontains(key, pin=False)per requested key.already_sent_indexes.AllocResponsenow carries both:remote_indexesfor newly allocated chunksalready_sent_indexesfor deduped chunksSender-side skip for deduped chunks
_async_transfer_taskconsumesalready_sent_indexesand filters outgoing RDMA writes accordingly.put()no longer overwrites existing keyputfor an existing key now drops the new object (refcount down) instead of replacing/freeing the existing in-flight target.remove()made refcount-awareremove()now inspects current object refcount and deletes/frees only whenref_count == 1._alloc_freed_conditionnotification is emitted only on actual removal/free.Race regression test alignment
FakePDBackendDataPathintests/v1/test_pd_backend_async_race.pyto mirror newput/removesemantics.Original prompt
Problem
PDBackendAsynchas two critical race conditions when multiple requests share the sameCacheEngineKey(shared prefix scenario). The sync version (PDBackendinpd_backend.py) already handles this correctly, but the async version does not.Bug 1:
put()overwrites and frees in-flight buffer (use-after-free)In
pd_backend_async.pyline 1430-1440:When Req A and Req B share the same prefix → same key K:
obj_Afor Req A →put(K, obj_A)→ returnsobj_A.addressto Sender Aobj_A.addressobj_Bfor Req B →put(K, obj_B)→ freesobj_A!Bug 2:
remove()unconditionally pops (breaks other request's retrieve)In
pd_backend_async.pyline 1470-1482:When Req A does
remove_after_retrieveon key K, Req B's subsequentget_blocking(K)hitsAssertionError: Key ... not found in local data.Fix (port sync version's approach)
The sync
PDBackend(pd_backend.py) already handles this correctly:1. Dedup in
_allocate_and_put(sync line 917-920):2.
put()does not overwrite (sync line 982-983):3.
remove()checks refcount (sync line 1058-1063):Required Changes in
lmcache/v1/storage_backend/pd_backend_async.pyChange 1: Add dedup check in
_async_allocate_and_put(around line 1320)Before allocating each key, check if it already exists:
And include
already_sent_indexesin the response:Change 2: Fix
put()— do not overwrite existing keys (line 1417-1440)Change to match sync behavior. Since dedup prevents duplicates from reaching put(), we can simplify:
Change 3: Fix
remove()— use refcount check (line 1460-1502)Change to match sync version's refcount-based removal:
Note: The
_alloc_freed_conditionnotification should only happen when the entry is actually removed (refcount == 1).Change 4: Sender side — handle
already_sent_indexesin_async_transfer_taskAfter receiving
AllocResponse, filter out already-sent chunks (skip their RDMA):Tests
Unit tests are already in
tests/v1/test_pd_backend_async_race.py. They currently FAIL (demonstrating the bugs). After the fix, they should PA...This pull request was created from Copilot chat.