Skip to content

PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release#324

Merged
hlin99 merged 4 commits into
copilot/fix-race-conditions-in-pdbackendasyncfrom
copilot/fix-dedup-logic-bugs
May 29, 2026
Merged

PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release#324
hlin99 merged 4 commits into
copilot/fix-race-conditions-in-pdbackendasyncfrom
copilot/fix-dedup-logic-bugs

Conversation

Copy link
Copy Markdown

Copilot AI commented May 29, 2026

PR #323’s async dedup path still had a race: _async_allocate_and_put could decide “already present” outside data_lock, while a concurrent remove() deleted the same key, causing sender-skip + missing data. It also leaked shared entries because remove() returned success for ref_count > 1 without decrementing.

  • Atomic dedup check + pin in _async_allocate_and_put

    • Replaced contains(..., pin=False) with inline data_lock-guarded key in self.data check.
    • Pins existing objects (ref_count_up()) at dedup decision time so concurrent remove() cannot drop the object before the deduped consumer retrieves it.
    • Continues to mark index in already_sent_indexes so sender skips RDMA for deduped chunks.
  • remove() semantics corrected to release every caller reference

    • remove() now always performs ref_count_down() when key exists.
    • Deletes from self.data only when refcount reaches 0.
    • Allocation-freed notification remains tied to actual delete/free only.
  • Race test updates to match new lifetime contract

    • FakePDBackendDataPath.remove() now mirrors backend behavior: decrement always, delete at zero.
    • Shared-key tests now assert two-step release (2 -> 1 after first remove, 1 -> 0 + delete after second remove), including test_remove_does_not_affect_other_request.
# _async_allocate_and_put dedup fast-path
with self.data_lock:
    if key in self.data:
        self.data[key].ref_count_up()  # pin shared object
        already_sent_indexes.append(idx)
        continue

# remove()
with self.data_lock:
    mem_obj = self.data.get(key)
    if mem_obj is not None:
        mem_obj.ref_count_down()
        if mem_obj.get_ref_count() == 0:
            del self.data[key]
        return True
    return False
Original prompt

Problem

PR #323 introduced dedup logic in _async_allocate_and_put but has two remaining bugs:

Bug 1: TOCTOU between contains() check and remove()

In lmcache/v1/storage_backend/pd_backend_async.py, the _async_allocate_and_put method calls self.contains(key, pin=False) without holding self.data_lock. Between this check returning True and the sender being told to skip RDMA, another thread can remove() the key. Result: sender skips, but the data is gone → data loss.

Bug 2: remove() doesn't decrement refcount when ref_count > 1

Current code:

def remove(self, key, force=True):
    with self.data_lock:
        mem_obj = self.data.get(key, None)
        if mem_obj is not None:
            if mem_obj.get_ref_count() == 1:
                del self.data[key]
                mem_obj.ref_count_down()
            return True
        return False

When ref_count > 1, it returns True but never decrements. This means if two requests share a key (A puts, B dedup-skips with pin), after A removes, ref stays at 2 forever → memory leak.

Required Fix

Change 1: Atomic check + pin in _async_allocate_and_put

Replace the current self.contains(key, pin=False) call with an inline check under self.data_lock that also pins:

with self.data_lock:
    if key in self.data:
        self.data[key].ref_count_up()  # pin to prevent remove from deleting
        already_sent_indexes.append(idx)
        continue

This eliminates the TOCTOU window because remove() also requires self.data_lock.

Change 2: Fix remove() to always decrement refcount

def remove(self, key, force=True):
    with self.data_lock:
        mem_obj = self.data.get(key, None)
        if mem_obj is not None:
            mem_obj.ref_count_down()
            if mem_obj.get_ref_count() == 0:
                del self.data[key]
                # notify _alloc_freed_condition only here
            return True
        return False

Every remove() call decrements. Only when refcount hits 0 is the entry actually deleted and memory freed.

Change 3: Update tests in tests/v1/test_pd_backend_async_race.py

  • FakePDBackendDataPath.remove() should match the new semantics (always decrement, delete at 0).
  • Tests that set obj_a.get_ref_count.return_value = 2 should verify that after two remove() calls the object is finally freed.
  • Verify the test_remove_does_not_affect_other_request test works correctly: after first remove ref goes from 2→1 (not deleted), after second remove ref goes from 1→0 (deleted).

Files to modify

  1. lmcache/v1/storage_backend/pd_backend_async.py
  2. tests/v1/test_pd_backend_async_race.py

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: review一下@hlin99/LMCache/pull/323

Assistant: ## PR Review: Fix PDBackendAsync shared-key races

Summary

PR #323 (Draft)
Author Copilot
Base ww22_PD_race_condition
Files changed 2 (+85, −80)
Mergeable Yes
Risk Medium-High — touches concurrent memory lifecycle in a PD (prefill-decode) data path

This PR ports the sync PDBackend's shared-key semantics to PDBackendAsync, fixing two race conditions: (1) put() freeing an in-flight RDMA buffer, and (2) remove() unconditionally deleting entries still needed by other requests.


Core Changes

1. Receiver dedup in _async_allocate_and_put

+        already_sent_indexes: list[int] = []
         for idx, key_str in enumerate(alloc_request.keys):
             key = CacheEngineKey.from_string(key_str)
+            if self.contains(key, pin=False):
+                already_sent_indexes.append(idx)
+                continue

Mirrors the sync backend's approach — skips allocation for keys already in the store.

2. put() — no longer overwrites existing key

         with self.data_lock:
-            old = self.data.pop(key, None)
-            if old is not None:
-                old.ref_count_down()
-            self.data[key] = mem_obj
+            if key in self.data:
+                mem_obj.ref_count_down()
+                return
+            self.data[key] = mem_obj

Eliminates use-after-free: the new (duplicate) object is released instead of evicting the in-flight one.

3. remove() — refcount-aware

-            mem_obj = self.data.pop(key, None)
+            mem_obj = self.data.get(key, None)
             if mem_obj is not None:
-                mem_obj.ref_count_down()
+                if mem_obj.get_ref_count() == 1:
+                    del sel...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

Copilot AI changed the title [WIP] Fix TOCTOU issue and refcount decrement in dedup logic PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release May 29, 2026
Copilot finished work on behalf of hlin99 May 29, 2026 02:59
Copilot AI requested a review from hlin99 May 29, 2026 02:59
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change Request: Minimize UT modifications

The test file changes include unnecessary cleanups that are not required for the fix. Please revert the following non-essential changes while keeping the necessary refcount-tracking logic:

Revert these changes:

  1. Keep pytest import — do not remove it
  2. Keep MemoryFormat, MemoryObj, MemoryObjMetadata imports — do not remove them
  3. Keep spec=MemoryObjMetadata and spec=MemoryObj in _make_memory_obj — only add the dynamic refcount logic (ref_count_up, ref_count_down, get_ref_count side effects and _ref_count tracking) without removing existing mock specs
  4. Keep meta.fmt, meta.shape, meta.dtype, obj.get_size fields in _make_memory_obj — do not remove them
  5. Keep type annotation dict[CacheEngineKey, MemoryObj] in FakePDBackendDataPath — do not change to MagicMock
  6. Keep put() parameter annotation as mem_obj: MemoryObj — do not change to MagicMock

Keep these changes (they are necessary):

  • _make_memory_obj adding _ref_count field and dynamic ref_count_up/ref_count_down/get_ref_count side effects
  • FakePDBackendDataPath.remove() semantics change (always decrement, delete at 0)
  • obj_a._ref_count = 2 replacing obj_a.get_ref_count.return_value = 2
  • New assertions like assert obj_a.get_ref_count() == 1 and the second remove() in test_remove_does_not_affect_other_request

@hlin99
Copy link
Copy Markdown
Owner

hlin99 commented May 29, 2026

@copilot chang request

Copy link
Copy Markdown
Author

Copilot AI commented May 29, 2026

Change Request: Minimize UT modifications

The test file changes include unnecessary cleanups that are not required for the fix. Please revert the following non-essential changes while keeping the ...

Addressed in commit f0f610d.

I reverted the non-essential UT cleanups and kept only the required behavior updates:

  • restored pytest import
  • restored MemoryFormat, MemoryObj, MemoryObjMetadata imports
  • restored spec=MemoryObjMetadata / spec=MemoryObj
  • restored meta.fmt, meta.shape, meta.dtype, and obj.get_size
  • restored type annotations to dict[CacheEngineKey, MemoryObj] and mem_obj: MemoryObj
  • preserved dynamic refcount side effects and the two-step remove assertions

Copy link
Copy Markdown
Author

Copilot AI commented May 29, 2026

@copilot chang request

This was already addressed in commit f0f610d. All six requested reversions are in place — the test file retains pytest import, MemoryFormat/MemoryObj/MemoryObjMetadata imports, spec= mock arguments, metadata fields, and typed annotations, while keeping the necessary refcount side-effects and remove-semantics changes.

Copilot finished work on behalf of hlin99 May 29, 2026 03:07
@hlin99 hlin99 marked this pull request as ready for review May 29, 2026 03:08
@hlin99 hlin99 merged commit 78d67f8 into copilot/fix-race-conditions-in-pdbackendasync May 29, 2026
1 check passed
@hlin99 hlin99 deleted the copilot/fix-dedup-logic-bugs branch May 29, 2026 03:08
hlin99 pushed a commit that referenced this pull request May 29, 2026
…val paths (#323)

* Initial plan

* Fix PDBackendAsync shared-key race handling

* PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release (#324)

* Initial plan

* fix(pd): make dedup pin atomic and decrement remove refcount

* test(pd): align async race tests with refcount decrement semantics

* test(pd): restore requested UT typing and mock specs

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

* Follow-up fixes for PDBackendAsync: log correctness, bounds validation, dedup test coverage (#325)

* Initial plan

* Fix remove() log, add bounds check, change put() log to INFO, add sender dedup tests

* Merge TestSenderSideDedup into test_pd_backend_async_race.py, remove separate file

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

* Fix abort cleanup to release receiver-side dedup refs in PD async transfer (#326)

* Initial plan

* fix(pd_backend_async): include deduped keys in abort cleanup tracking

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants