PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release by Copilot · Pull Request #324 · hlin99/LMCache

Copilot · 2026-05-29T02:46:46Z

PR #323’s async dedup path still had a race: _async_allocate_and_put could decide “already present” outside data_lock, while a concurrent remove() deleted the same key, causing sender-skip + missing data. It also leaked shared entries because remove() returned success for ref_count > 1 without decrementing.

Atomic dedup check + pin in _async_allocate_and_put
- Replaced contains(..., pin=False) with inline data_lock-guarded key in self.data check.
- Pins existing objects (ref_count_up()) at dedup decision time so concurrent remove() cannot drop the object before the deduped consumer retrieves it.
- Continues to mark index in already_sent_indexes so sender skips RDMA for deduped chunks.
remove() semantics corrected to release every caller reference
- remove() now always performs ref_count_down() when key exists.
- Deletes from self.data only when refcount reaches 0.
- Allocation-freed notification remains tied to actual delete/free only.
Race test updates to match new lifetime contract
- FakePDBackendDataPath.remove() now mirrors backend behavior: decrement always, delete at zero.
- Shared-key tests now assert two-step release (2 -> 1 after first remove, 1 -> 0 + delete after second remove), including test_remove_does_not_affect_other_request.

# _async_allocate_and_put dedup fast-path
with self.data_lock:
    if key in self.data:
        self.data[key].ref_count_up()  # pin shared object
        already_sent_indexes.append(idx)
        continue

# remove()
with self.data_lock:
    mem_obj = self.data.get(key)
    if mem_obj is not None:
        mem_obj.ref_count_down()
        if mem_obj.get_ref_count() == 0:
            del self.data[key]
        return True
    return False

Original prompt

Problem

PR #323 introduced dedup logic in _async_allocate_and_put but has two remaining bugs:

Bug 1: TOCTOU between `contains()` check and `remove()`

In lmcache/v1/storage_backend/pd_backend_async.py, the _async_allocate_and_put method calls self.contains(key, pin=False) without holding self.data_lock. Between this check returning True and the sender being told to skip RDMA, another thread can remove() the key. Result: sender skips, but the data is gone → data loss.

Bug 2: `remove()` doesn't decrement refcount when ref_count > 1

Current code:

def remove(self, key, force=True):
    with self.data_lock:
        mem_obj = self.data.get(key, None)
        if mem_obj is not None:
            if mem_obj.get_ref_count() == 1:
                del self.data[key]
                mem_obj.ref_count_down()
            return True
        return False

When ref_count > 1, it returns True but never decrements. This means if two requests share a key (A puts, B dedup-skips with pin), after A removes, ref stays at 2 forever → memory leak.

Required Fix

Change 1: Atomic check + pin in `_async_allocate_and_put`

Replace the current self.contains(key, pin=False) call with an inline check under self.data_lock that also pins:

with self.data_lock:
    if key in self.data:
        self.data[key].ref_count_up()  # pin to prevent remove from deleting
        already_sent_indexes.append(idx)
        continue

This eliminates the TOCTOU window because remove() also requires self.data_lock.

Change 2: Fix `remove()` to always decrement refcount

def remove(self, key, force=True):
    with self.data_lock:
        mem_obj = self.data.get(key, None)
        if mem_obj is not None:
            mem_obj.ref_count_down()
            if mem_obj.get_ref_count() == 0:
                del self.data[key]
                # notify _alloc_freed_condition only here
            return True
        return False

Every remove() call decrements. Only when refcount hits 0 is the entry actually deleted and memory freed.

Change 3: Update tests in `tests/v1/test_pd_backend_async_race.py`

FakePDBackendDataPath.remove() should match the new semantics (always decrement, delete at 0).
Tests that set obj_a.get_ref_count.return_value = 2 should verify that after two remove() calls the object is finally freed.
Verify the test_remove_does_not_affect_other_request test works correctly: after first remove ref goes from 2→1 (not deleted), after second remove ref goes from 1→0 (deleted).

Files to modify

lmcache/v1/storage_backend/pd_backend_async.py
tests/v1/test_pd_backend_async_race.py

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: review一下@hlin99/LMCache/pull/323

Assistant: ## PR Review: Fix PDBackendAsync shared-key races

Summary


PR	#323 (Draft)
Author	Copilot
Base	`ww22_PD_race_condition`
Files changed	2 (+85, −80)
Mergeable	Yes
Risk	Medium-High — touches concurrent memory lifecycle in a PD (prefill-decode) data path

This PR ports the sync PDBackend's shared-key semantics to PDBackendAsync, fixing two race conditions: (1) put() freeing an in-flight RDMA buffer, and (2) remove() unconditionally deleting entries still needed by other requests.

Core Changes

1. Receiver dedup in `_async_allocate_and_put`

+        already_sent_indexes: list[int] = []
         for idx, key_str in enumerate(alloc_request.keys):
             key = CacheEngineKey.from_string(key_str)
+            if self.contains(key, pin=False):
+                already_sent_indexes.append(idx)
+                continue

Mirrors the sync backend's approach — skips allocation for keys already in the store.

2. `put()` — no longer overwrites existing key

         with self.data_lock:
-            old = self.data.pop(key, None)
-            if old is not None:
-                old.ref_count_down()
-            self.data[key] = mem_obj
+            if key in self.data:
+                mem_obj.ref_count_down()
+                return
+            self.data[key] = mem_obj

Eliminates use-after-free: the new (duplicate) object is released instead of evicting the in-flight one.

3. `remove()` — refcount-aware

-            mem_obj = self.data.pop(key, None)
+            mem_obj = self.data.get(key, None)
             if mem_obj is not None:
-                mem_obj.ref_count_down()
+                if mem_obj.get_ref_count() == 1:
+                    del sel...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

hlin99

Change Request: Minimize UT modifications

The test file changes include unnecessary cleanups that are not required for the fix. Please revert the following non-essential changes while keeping the necessary refcount-tracking logic:

Revert these changes:

Keep pytest import — do not remove it
Keep MemoryFormat, MemoryObj, MemoryObjMetadata imports — do not remove them
Keep spec=MemoryObjMetadata and spec=MemoryObj in _make_memory_obj — only add the dynamic refcount logic (ref_count_up, ref_count_down, get_ref_count side effects and _ref_count tracking) without removing existing mock specs
Keep meta.fmt, meta.shape, meta.dtype, obj.get_size fields in _make_memory_obj — do not remove them
Keep type annotation dict[CacheEngineKey, MemoryObj] in FakePDBackendDataPath — do not change to MagicMock
Keep put() parameter annotation as mem_obj: MemoryObj — do not change to MagicMock

Keep these changes (they are necessary):

_make_memory_obj adding _ref_count field and dynamic ref_count_up/ref_count_down/get_ref_count side effects
FakePDBackendDataPath.remove() semantics change (always decrement, delete at 0)
obj_a._ref_count = 2 replacing obj_a.get_ref_count.return_value = 2
New assertions like assert obj_a.get_ref_count() == 1 and the second remove() in test_remove_does_not_affect_other_request

hlin99 · 2026-05-29T03:03:09Z

@copilot chang request

Copilot · 2026-05-29T03:06:05Z

Change Request: Minimize UT modifications

The test file changes include unnecessary cleanups that are not required for the fix. Please revert the following non-essential changes while keeping the ...

Addressed in commit f0f610d.

I reverted the non-essential UT cleanups and kept only the required behavior updates:

restored pytest import
restored MemoryFormat, MemoryObj, MemoryObjMetadata imports
restored spec=MemoryObjMetadata / spec=MemoryObj
restored meta.fmt, meta.shape, meta.dtype, and obj.get_size
restored type annotations to dict[CacheEngineKey, MemoryObj] and mem_obj: MemoryObj
preserved dynamic refcount side effects and the two-step remove assertions

Copilot · 2026-05-29T03:07:32Z

@copilot chang request

This was already addressed in commit f0f610d. All six requested reversions are in place — the test file retains pytest import, MemoryFormat/MemoryObj/MemoryObjMetadata imports, spec= mock arguments, metadata fields, and typed annotations, while keeping the necessary refcount side-effects and remove-semantics changes.

…val paths (#323) * Initial plan * Fix PDBackendAsync shared-key race handling * PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release (#324) * Initial plan * fix(pd): make dedup pin atomic and decrement remove refcount * test(pd): align async race tests with refcount decrement semantics * test(pd): restore requested UT typing and mock specs --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * Follow-up fixes for PDBackendAsync: log correctness, bounds validation, dedup test coverage (#325) * Initial plan * Fix remove() log, add bounds check, change put() log to INFO, add sender dedup tests * Merge TestSenderSideDedup into test_pd_backend_async_race.py, remove separate file --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * Fix abort cleanup to release receiver-side dedup refs in PD async transfer (#326) * Initial plan * fix(pd_backend_async): include deduped keys in abort cleanup tracking --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Initial plan

2f36ec1

Copilot AI assigned Copilot and hlin99 May 29, 2026

Copilot started work on behalf of hlin99 May 29, 2026 02:46 View session

Copilot AI added 2 commits May 29, 2026 02:54

fix(pd): make dedup pin atomic and decrement remove refcount

49583c7

test(pd): align async race tests with refcount decrement semantics

3d84b65

Copilot AI changed the title ~~[WIP] Fix TOCTOU issue and refcount decrement in dedup logic~~ PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release May 29, 2026

Copilot finished work on behalf of hlin99 May 29, 2026 02:59

Copilot AI requested a review from hlin99 May 29, 2026 02:59

hlin99 requested changes May 29, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 29, 2026 03:02 View session

test(pd): restore requested UT typing and mock specs

f0f610d

Copilot finished work on behalf of hlin99 May 29, 2026 03:06

Copilot AI requested a review from hlin99 May 29, 2026 03:06

Copilot started work on behalf of hlin99 May 29, 2026 03:06 View session

Copilot finished work on behalf of hlin99 May 29, 2026 03:07

hlin99 marked this pull request as ready for review May 29, 2026 03:08

hlin99 merged commit 78d67f8 into copilot/fix-race-conditions-in-pdbackendasync May 29, 2026
1 check passed

hlin99 deleted the copilot/fix-dedup-logic-bugs branch May 29, 2026 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release#324

PDBackendAsync: close dedup TOCTOU window and correct shared-key refcount release#324
hlin99 merged 4 commits into
copilot/fix-race-conditions-in-pdbackendasyncfrom
copilot/fix-dedup-logic-bugs

Copilot AI commented May 29, 2026 •

edited

Loading

Uh oh!

hlin99 left a comment

Uh oh!

hlin99 commented May 29, 2026

Uh oh!

Copilot AI commented May 29, 2026

Change Request: Minimize UT modifications

Uh oh!

Copilot AI commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Bug 1: TOCTOU between contains() check and remove()

Bug 2: remove() doesn't decrement refcount when ref_count > 1

Required Fix

Change 1: Atomic check + pin in _async_allocate_and_put

Change 2: Fix remove() to always decrement refcount

Change 3: Update tests in tests/v1/test_pd_backend_async_race.py

Files to modify

Summary

Core Changes

1. Receiver dedup in _async_allocate_and_put

2. put() — no longer overwrites existing key

3. remove() — refcount-aware

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Change Request: Minimize UT modifications

Revert these changes:

Keep these changes (they are necessary):

Uh oh!

hlin99 commented May 29, 2026

Uh oh!

Copilot AI commented May 29, 2026

Change Request: Minimize UT modifications

Uh oh!

Copilot AI commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 29, 2026 •

edited

Loading

Bug 1: TOCTOU between `contains()` check and `remove()`

Bug 2: `remove()` doesn't decrement refcount when ref_count > 1

Change 1: Atomic check + pin in `_async_allocate_and_put`

Change 2: Fix `remove()` to always decrement refcount

Change 3: Update tests in `tests/v1/test_pd_backend_async_race.py`

1. Receiver dedup in `_async_allocate_and_put`

2. `put()` — no longer overwrites existing key

3. `remove()` — refcount-aware