PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release#324
Conversation
hlin99
left a comment
There was a problem hiding this comment.
Change Request: Minimize UT modifications
The test file changes include unnecessary cleanups that are not required for the fix. Please revert the following non-essential changes while keeping the necessary refcount-tracking logic:
Revert these changes:
- Keep
pytestimport — do not remove it - Keep
MemoryFormat, MemoryObj, MemoryObjMetadataimports — do not remove them - Keep
spec=MemoryObjMetadataandspec=MemoryObjin_make_memory_obj— only add the dynamic refcount logic (ref_count_up,ref_count_down,get_ref_countside effects and_ref_counttracking) without removing existing mock specs - Keep
meta.fmt,meta.shape,meta.dtype,obj.get_sizefields in_make_memory_obj— do not remove them - Keep type annotation
dict[CacheEngineKey, MemoryObj]inFakePDBackendDataPath— do not change toMagicMock - Keep
put()parameter annotation asmem_obj: MemoryObj— do not change toMagicMock
Keep these changes (they are necessary):
_make_memory_objadding_ref_countfield and dynamicref_count_up/ref_count_down/get_ref_countside effectsFakePDBackendDataPath.remove()semantics change (always decrement, delete at 0)obj_a._ref_count = 2replacingobj_a.get_ref_count.return_value = 2- New assertions like
assert obj_a.get_ref_count() == 1and the secondremove()intest_remove_does_not_affect_other_request
|
@copilot chang request |
Addressed in commit I reverted the non-essential UT cleanups and kept only the required behavior updates:
|
This was already addressed in commit |
78d67f8
into
copilot/fix-race-conditions-in-pdbackendasync
…val paths (#323) * Initial plan * Fix PDBackendAsync shared-key race handling * PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release (#324) * Initial plan * fix(pd): make dedup pin atomic and decrement remove refcount * test(pd): align async race tests with refcount decrement semantics * test(pd): restore requested UT typing and mock specs --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * Follow-up fixes for PDBackendAsync: log correctness, bounds validation, dedup test coverage (#325) * Initial plan * Fix remove() log, add bounds check, change put() log to INFO, add sender dedup tests * Merge TestSenderSideDedup into test_pd_backend_async_race.py, remove separate file --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * Fix abort cleanup to release receiver-side dedup refs in PD async transfer (#326) * Initial plan * fix(pd_backend_async): include deduped keys in abort cleanup tracking --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
PR #323’s async dedup path still had a race:
_async_allocate_and_putcould decide “already present” outsidedata_lock, while a concurrentremove()deleted the same key, causing sender-skip + missing data. It also leaked shared entries becauseremove()returned success forref_count > 1without decrementing.Atomic dedup check + pin in
_async_allocate_and_putcontains(..., pin=False)with inlinedata_lock-guardedkey in self.datacheck.ref_count_up()) at dedup decision time so concurrentremove()cannot drop the object before the deduped consumer retrieves it.already_sent_indexesso sender skips RDMA for deduped chunks.remove()semantics corrected to release every caller referenceremove()now always performsref_count_down()when key exists.self.dataonly when refcount reaches0.Race test updates to match new lifetime contract
FakePDBackendDataPath.remove()now mirrors backend behavior: decrement always, delete at zero.2 -> 1after first remove,1 -> 0+ delete after second remove), includingtest_remove_does_not_affect_other_request.Original prompt
Problem
PR #323 introduced dedup logic in
_async_allocate_and_putbut has two remaining bugs:Bug 1: TOCTOU between
contains()check andremove()In
lmcache/v1/storage_backend/pd_backend_async.py, the_async_allocate_and_putmethod callsself.contains(key, pin=False)without holdingself.data_lock. Between this check returning True and the sender being told to skip RDMA, another thread canremove()the key. Result: sender skips, but the data is gone → data loss.Bug 2:
remove()doesn't decrement refcount when ref_count > 1Current code:
When
ref_count > 1, it returns True but never decrements. This means if two requests share a key (A puts, B dedup-skips with pin), after A removes, ref stays at 2 forever → memory leak.Required Fix
Change 1: Atomic check + pin in
_async_allocate_and_putReplace the current
self.contains(key, pin=False)call with an inline check underself.data_lockthat also pins:This eliminates the TOCTOU window because
remove()also requiresself.data_lock.Change 2: Fix
remove()to always decrement refcountEvery
remove()call decrements. Only when refcount hits 0 is the entry actually deleted and memory freed.Change 3: Update tests in
tests/v1/test_pd_backend_async_race.pyFakePDBackendDataPath.remove()should match the new semantics (always decrement, delete at 0).obj_a.get_ref_count.return_value = 2should verify that after tworemove()calls the object is finally freed.test_remove_does_not_affect_other_requesttest works correctly: after first remove ref goes from 2→1 (not deleted), after second remove ref goes from 1→0 (deleted).Files to modify
lmcache/v1/storage_backend/pd_backend_async.pytests/v1/test_pd_backend_async_race.pyThe following is the prior conversation context from the user's chat exploration (may be truncated):
User: review一下@hlin99/LMCache/pull/323
Assistant: ## PR Review: Fix PDBackendAsync shared-key races
Summary
ww22_PD_race_conditionThis PR ports the sync
PDBackend's shared-key semantics toPDBackendAsync, fixing two race conditions: (1)put()freeing an in-flight RDMA buffer, and (2)remove()unconditionally deleting entries still needed by other requests.Core Changes
1. Receiver dedup in
_async_allocate_and_putMirrors the sync backend's approach — skips allocation for keys already in the store.
2.
put()— no longer overwrites existing keyEliminates use-after-free: the new (duplicate) object is released instead of evicting the in-flight one.
3.
remove()— refcount-aware