Skip to content

Comments

refactor(sdk): rewrite NonceCache with LRU eviction, drift detection, and structured errors#3111

Open
lklimek wants to merge 19 commits intov3.1-devfrom
fix/sdk-nonce-cache-regression
Open

refactor(sdk): rewrite NonceCache with LRU eviction, drift detection, and structured errors#3111
lklimek wants to merge 19 commits intov3.1-devfrom
fix/sdk-nonce-cache-regression

Conversation

@lklimek
Copy link
Contributor

@lklimek lklimek commented Feb 19, 2026

Issue being fixed or feature implemented

Addresses dashpay/dash-evo-tool#588

The SDK's nonce cache had several design issues that could cause retry loops to fail after a broadcast error:

  • Nonce regression: refresh_identity_nonce() deleted cache entries entirely, causing re-fetch of stale nonces from Platform and recreation of identical (failing) state transitions.
  • Raw upper bits in cache: Platform returns nonces with "missing revisions" bits set in the upper portion. These were stored raw, making comparisons and increments unreliable.
  • Unbounded drift: Repeated bump_first calls without successful broadcasts could push the cached nonce arbitrarily ahead of Platform with no re-fetch trigger.
  • Lock held across I/O: The nonce cache mutex was held during async Platform fetches, blocking all other callers during network round-trips.
  • Duplicated logic: get_identity_nonce and get_identity_contract_nonce contained identical cache logic.

What was done?

Rewrite: Dedicated NonceCache struct

Extracted all nonce caching logic from Sdk into a self-contained NonceCache struct backed by an LRU cache (bounded to 1 000 entries). The struct owns its mutex and exposes a clean async API.

Three-phase locking

get_or_fetch_nonce no longer holds the mutex across the async fetch_from_platform call. The function is structured as: lock → check → unlock → fetch (no lock) → lock → update → unlock. This prevents blocking during network round-trips.

Drift detection

Each cache entry tracks last_platform_nonce. When the cached nonce drifts ≥ MAX_MISSING_IDENTITY_REVISIONS ahead, a re-fetch is forced and Platform's response becomes the source of truth — while still preserving the higher of cached vs Platform nonce to avoid regression.

Upper-bit masking at fetch

The IDENTITY_NONCE_VALUE_FILTER mask is applied once, at fetch time. The cache only holds plain nonce values, eliminating redundant masking on every return path.

Stale marking instead of removal

refresh_identity_nonce() sets cache entry timestamps to 0 (stale) instead of removing entries. This forces a re-fetch on next use while preserving the cached nonce value.

Overflow guard

Added NonceOverflow error variant — returned when a nonce reaches u64::MAX instead of silently wrapping to zero.

IdentityNonceNotFound error

Platform returning no nonce for a queried identity now returns IdentityNonceNotFound instead of silently defaulting to nonce 0. This is a retriable error — the DAPI node may not have indexed the identity yet.

wasm-sdk error handling

Added NonceOverflow and IdentityNonceNotFound variants to WasmSdkErrorKind and the From<SdkError> conversion in the wasm-sdk, fixing a non-exhaustive match compilation failure.

Cleanup

  • Removed unused Sdk::parse_proof and Sdk::parse_proof_with_metadata (dead code).
  • Unified duplicated nonce fetch logic into a single get_or_fetch_nonce helper.

How Has This Been Tested?

  • Extensive unit tests covering: cache hits/misses, stale refresh, drift detection and re-fetch, overflow guard, upper-bit masking, LRU eviction, concurrent access under contention
  • cargo check -p dash-sdk -p wasm-sdk passes cleanly
  • cargo clippy -p dash-sdk passes cleanly
  • cargo test -p dash-sdk passes cleanly
  • All CI workflows pass (Tests, rs-sdk-ffi build, gRPC coverage, PR linter)

Breaking Changes

None — the NonceCache is an internal implementation detail. Public Sdk API is unchanged.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

When broadcast() fails, refresh_identity_nonce() previously removed
nonce cache entries entirely. If the TX had partially reached some
Tenderdash nodes (entered their mempool) before the error, the next
retry would re-fetch the old nonce from Platform (TX not yet processed),
derive the same nonce, create identical TX bytes, and get rejected with
"tx already exists in cache".

Fix by marking cache entries as stale (timestamp=0) instead of removing
them. This forces a re-fetch from Platform on the next use while
preserving the cached nonce value. The existing MAX(platform, cached)
comparison then ensures the nonce always advances, preventing the SDK
from recreating an identical state transition.

Closes dashpay/dash-evo-tool#588

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors nonce caching to store triples (insert_nonce, timestamp, platform_nonce); centralizes nonce retrieval into a shared get_or_fetch_nonce flow with drift-aware re-fetch and masked platform nonces; removes two pub(crate) proof-parsing helpers; refresh now marks entries stale by zeroing timestamps.

Changes

Cohort / File(s) Summary
SDK core changes
packages/rs-sdk/src/sdk.rs
Removed pub(crate) helpers parse_proof and parse_proof_with_metadata. Introduced/use shared get_or_fetch_nonce flow; adapted identity and identity-contract nonce logic to the 3-tuple cache state and masking via IDENTITY_NONCE_VALUE_FILTER; refresh semantics updated to zero timestamps for staleness.
Internal cache shape
packages/rs-sdk/src/internal_cache/mod.rs
Added pub(crate) type NonceCacheEntry = (IdentityNonce, LastQueryTimestamp, IdentityNonce); switched identity_nonce_counter and identity_contract_nonce_counter to store NonceCacheEntry; updated Default init and related mutex/BTreeMap types.
Tests & validation
packages/rs-sdk/src/...
Added/updated tests covering nonce cache behavior: drift, bumps, error propagation, concurrent updates, refresh semantics, masking and isolation.
Docs/comments & manifests
packages/rs-sdk/src/..., Cargo.toml
Updated comments/documentation to describe new 3-tuple cache, masking, and drift logic; manifest lines changed (additions/removals) reflected in Cargo.toml diffs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SDK
    participant InternalCache
    participant Platform

    Client->>SDK: request nonce for identity / identity+contract
    SDK->>InternalCache: read (insert_nonce, ts, platform_nonce)
    alt cache considered fresh (drift/revisions ok)
        InternalCache-->>SDK: return cached insert_nonce (and metadata)
        SDK-->>Client: return cached nonce
    else cache stale or missing
        SDK->>Platform: fetch platform_nonce (async, without cache lock)
        Platform-->>SDK: return platform_nonce
        SDK->>SDK: platform_masked = platform_nonce & IDENTITY_NONCE_VALUE_FILTER
        SDK->>SDK: combine/compute insert_nonce (apply bump_first if required)
        SDK->>InternalCache: write (insert_nonce, now_timestamp, platform_masked)
        SDK-->>Client: return updated nonce
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I found three little numbers, tucked warm in a row,

A timestamp to nap, a platform bit's glow.
I fetch when the drift makes old crumbs untrue,
I mask what I borrow and stitch something new.
Hop—cache refreshed, and the network says whoo!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title describes LRU eviction and structured errors, but the actual changes implement stale marking, drift detection, and nonce filtering—not LRU eviction or structured errors. Update the title to accurately reflect the main changes: 'fix(sdk): prevent nonce cache regression with stale marking, drift detection, and nonce filtering' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR directly addresses issue #588 by implementing all core objectives: preventing nonce reuse after failed broadcasts, reconciling cached nonce with Platform values, stripping malformed bits from Platform nonces, detecting drift, and improving concurrency during Platform fetches.
Out of Scope Changes check ✅ Passed All changes are in-scope: nonce cache refactoring (get_or_fetch_nonce helper, cache state changes), removal of unused parse_proof methods, and new tests directly support the fix. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/sdk-nonce-cache-regression

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lklimek lklimek changed the base branch from v2.1-dev to v3.1-dev February 19, 2026 12:22
@lklimek lklimek requested a review from shumkov as a code owner February 19, 2026 12:22
@github-actions github-actions bot added this to the v3.1.0 milestone Feb 19, 2026
lklimek and others added 2 commits February 19, 2026 13:37
…turn path

The nonce cache previously stored raw values with the upper "missing
revisions" bits intact, requiring IDENTITY_NONCE_VALUE_FILTER masking on
every return path. This made comparisons between platform_nonce and
cached values unreliable since the upper bits could differ.

Strip the upper bits immediately after fetching from Platform so the
cache only ever holds plain nonce values. This simplifies all downstream
comparisons, prevents +1 increments from overflowing into the upper
bits, and removes redundant masking from 7 return paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

✅ gRPC Query Coverage Report

================================================================================
gRPC Query Coverage Report - NEW QUERIES ONLY
================================================================================

Total queries in proto: 53
Previously known queries: 47
New queries found: 6

================================================================================

New Query Implementation Status:
--------------------------------------------------------------------------------
✓ getAddressInfo                                /home/runner/work/platform/platform/packages/rs-sdk/src/platform/query.rs
✓ getAddressesBranchState                       /home/runner/work/platform/platform/packages/rs-sdk/src/platform/address_sync/mod.rs
✓ getAddressesInfos                             /home/runner/work/platform/platform/packages/rs-sdk/src/platform/fetch_many.rs
✓ getAddressesTrunkState                        /home/runner/work/platform/platform/packages/rs-sdk/src/platform/query.rs
✓ getRecentAddressBalanceChanges                /home/runner/work/platform/platform/packages/rs-sdk/src/platform/query.rs
✓ getRecentCompactedAddressBalanceChanges       /home/runner/work/platform/platform/packages/rs-sdk/src/platform/query.rs

================================================================================
Summary:
--------------------------------------------------------------------------------
New queries implemented: 6 (100.0%)
New queries missing: 0 (0.0%)

Total known queries: 53
  - Implemented: 50
  - Not implemented: 2
  - Excluded: 1

Not implemented queries:
  - getConsensusParams
  - getTokenPreProgrammedDistributions

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/rs-sdk/src/sdk.rs (1)

504-550: Consider extracting the shared nonce fetch-and-update logic.

The masking, drift detection, and effective_current computation in get_identity_contract_nonce (lines 505–550) are identical to get_identity_nonce (lines 391–438). A shared helper (e.g., resolve_nonce(platform_nonce, cached_entry, bump_first, current_time_s)) would eliminate duplication and ensure both paths stay in sync when the logic evolves. The existing TODO: Move to a separate struct at line 469 aligns with this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 504 - 550, Extract the duplicated
masking, drift-detection, and cache-update logic into a single helper (e.g.,
resolve_nonce or a method on a new NonceResolver struct) and call it from both
get_identity_contract_nonce and get_identity_nonce; the helper should accept the
platform_nonce (already masked with IDENTITY_NONCE_VALUE_FILTER as done via
IdentityContractNonceFetcher), a mutable reference to the cache Entry
(Vacant/Occupied), bump_first, and current_time_s, compute effective_current
using MAX_MISSING_IDENTITY_REVISIONS, decide insert_nonce (handling
platform_nonce > effective_current and bump_first cases), update the cache entry
with (insert_nonce, current_time_s), and return the insert_nonce or Result as
needed so both call sites reuse identical behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 504-550: Extract the duplicated masking, drift-detection, and
cache-update logic into a single helper (e.g., resolve_nonce or a method on a
new NonceResolver struct) and call it from both get_identity_contract_nonce and
get_identity_nonce; the helper should accept the platform_nonce (already masked
with IDENTITY_NONCE_VALUE_FILTER as done via IdentityContractNonceFetcher), a
mutable reference to the cache Entry (Vacant/Occupied), bump_first, and
current_time_s, compute effective_current using MAX_MISSING_IDENTITY_REVISIONS,
decide insert_nonce (handling platform_nonce > effective_current and bump_first
cases), update the cache entry with (insert_nonce, current_time_s), and return
the insert_nonce or Result as needed so both call sites reuse identical
behavior.

lklimek and others added 2 commits February 19, 2026 13:59
…tform

The nonce cache could drift arbitrarily far ahead of Platform through
repeated bump_first calls without successful broadcasts. Each bump
refreshed the cache timestamp, preventing time-based staleness from
triggering a re-fetch.

Add last_platform_nonce to the cache entry and check drift inline in
should_query_platform. When the cached nonce exceeds the last known
platform value by more than MAX_MISSING_IDENTITY_REVISIONS, force a
re-fetch and use platform's response as the source of truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ce helper

get_identity_nonce and get_identity_contract_nonce had identical cache
logic differing only in key type and fetcher. Extract the shared
staleness check, drift detection, fetch, bump, and cache update into a
generic async helper that both methods delegate to.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek force-pushed the fix/sdk-nonce-cache-regression branch from c508de1 to 8aae8c2 Compare February 19, 2026 13:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/rs-sdk/src/internal_cache/mod.rs (1)

7-15: Consider a named struct to prevent positional confusion between the two IdentityNonce fields.

Both .0 (current_nonce) and .2 (last_platform_nonce) share the same IdentityNonce type. A swap at any destructuring site (e.g., let (last_platform_nonce, _, current_nonce) = e.get()) compiles silently but produces wrong semantics.

♻️ Replace tuple alias with a named struct
-/// Cached nonce state: `(current_nonce, last_fetch_timestamp, last_platform_nonce)`.
-///
-/// - `current_nonce` — the nonce value the SDK will use next (may be bumped
-///   ahead of what Platform reported).
-/// - `last_fetch_timestamp` — when we last fetched from Platform.
-/// - `last_platform_nonce` — the value Platform returned on the last fetch
-///   (after stripping upper bits). Used to detect when local bumps have
-///   drifted too far ahead, triggering a re-fetch.
-pub(crate) type NonceCacheEntry = (IdentityNonce, LastQueryTimestamp, IdentityNonce);
+/// Cached nonce state for identity/contract nonce tracking.
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct NonceCacheEntry {
+    /// The nonce value the SDK will use next (may be bumped ahead of Platform).
+    pub current_nonce: IdentityNonce,
+    /// Unix timestamp (seconds) of the last Platform fetch; 0 means stale.
+    pub last_fetch_timestamp: LastQueryTimestamp,
+    /// Plain nonce Platform returned on the last fetch (upper bits stripped).
+    pub last_platform_nonce: IdentityNonce,
+}

All destructuring sites in sdk.rs would then use named fields instead of positional indexing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/internal_cache/mod.rs` around lines 7 - 15, The tuple
alias NonceCacheEntry uses two IdentityNonce fields which is error-prone;
replace the alias with a small named struct (e.g., struct NonceCacheEntry {
current_nonce: IdentityNonce, last_fetch_timestamp: LastQueryTimestamp,
last_platform_nonce: IdentityNonce }) and update all uses (including
destructuring sites in sdk.rs that reference NonceCacheEntry::get()/e.get()) to
use the named fields instead of positional .0/.1/.2 to prevent silent swaps
between current_nonce and last_platform_nonce. Ensure the struct derives or
implements the same traits previously relied on (Clone/Copy/Debug/etc.) so
callers keep compiling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 407-420: In get_or_fetch_nonce, inside the should_query_platform =
true branch where you compute insert_nonce from platform_nonce, update the logic
to account for any existing cached nonce: when handling Entry::Occupied (and
generally before inserting), read the existing cached nonce (the first element
of the stored tuple) and set base = max(existing_cached_nonce, platform_nonce),
then compute insert_nonce = base + 1 and store (insert_nonce, current_time_s,
platform_nonce); do this instead of unconditionally using platform_nonce so
retries use max(existing, platform) as the base.

---

Nitpick comments:
In `@packages/rs-sdk/src/internal_cache/mod.rs`:
- Around line 7-15: The tuple alias NonceCacheEntry uses two IdentityNonce
fields which is error-prone; replace the alias with a small named struct (e.g.,
struct NonceCacheEntry { current_nonce: IdentityNonce, last_fetch_timestamp:
LastQueryTimestamp, last_platform_nonce: IdentityNonce }) and update all uses
(including destructuring sites in sdk.rs that reference
NonceCacheEntry::get()/e.get()) to use the named fields instead of positional
.0/.1/.2 to prevent silent swaps between current_nonce and last_platform_nonce.
Ensure the struct derives or implements the same traits previously relied on
(Clone/Copy/Debug/etc.) so callers keep compiling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/rs-sdk/src/sdk.rs (2)

496-518: refresh_identity_nonce — stale-marking logic is correct; consider a range scan for contract nonces.

Setting timestamp = 0 reliably triggers the staleness check on next access (since 0 < current_time_s − stale_time_s holds for any real UNIX timestamp). The logic is sound.

Minor: the contract-nonce path (lines 510–516) does a full linear scan over all entries to find those matching identity_id. Since the BTreeMap key is (identity_id, contract_id), matching entries are contiguous and a range query (range((identity_id, Identifier::default())..)) would avoid iterating unrelated entries. Not a correctness issue, just a minor efficiency opportunity if the cache grows large.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 496 - 518, The current
refresh_identity_nonce implementation correctly marks timestamps as 0 but scans
the entire internal_cache.identity_contract_nonce_counter BTreeMap to find
entries for an identity; optimize by performing a contiguous range scan instead
of iterating all entries — use a range_mut over keys starting at
(identity_id.clone(), <min contract id>) and ending at (identity_id.clone(),
<max contract id>) (or appropriate Bound bounds) to visit only entries whose
first tuple element equals identity_id and set their timestamp to 0; keep the
existing identity_nonce_counter logic unchanged.

371-383: Tokio mutex held across async platform fetch serialises all nonce requests.

Both get_identity_nonce and get_identity_contract_nonce lock the shared nonce counter, then await a potential network call inside get_or_fetch_nonce. Because the Entry borrows the map, the lock cannot be dropped until the fetch completes. Under concurrent callers this means every identity's nonce request queues behind any in-flight platform fetch, regardless of identity.

This is an inherent limitation of the current Entry-based API and is pre-existing — but the refactor makes it structurally clearer. A future optimisation would be a per-identity (or per-key) lock rather than a single map-level mutex.

Also applies to: 407-427

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 371 - 383, The code currently holds
internal_cache.identity_nonce_counter.lock() across an await inside
Self::get_or_fetch_nonce, which serialises all concurrent nonce requests; change
the flow so the map lock is not held during the async fetch by: 1) access the
map under internal_cache.identity_nonce_counter only to check/clone the existing
Entry value or to insert a lightweight “pending” marker, then drop the mutex
before performing the async platform call invoked via
IdentityNonceFetcher::fetch_with_settings; 2) after the fetch completes,
re-acquire the identity_nonce_counter mutex and update/replace the entry with
the fetched nonce (or remove the pending marker and insert the real value);
alternatively replace the single map mutex with a per-identity lock (e.g., a
separate Mutex per identity or a keyed lock structure) so get_identity_nonce and
get_identity_contract_nonce only serialize for the same identity rather than
globally. Ensure this change touches the codepaths using get_or_fetch_nonce,
internal_cache.identity_nonce_counter, and the async fetch closure so no mutex
guard is held across await.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 461-473: The stale-by-time re-fetch path in refresh_identity_nonce
currently overwrites the cache via entry.and_modify(|e| *e = (insert_nonce,
...)) using platform_nonce (plus bump_first) and ignores the existing cached
nonce, causing stale fetches to return the same nonce; change the logic so when
handling the stale-by-time branch you read the existing cached nonce from the
entry (if present) and compute insert_nonce = max(existing_cached_nonce,
platform_nonce + (if bump_first {1} else {0})), while preserving the original
"drift" branch behavior that intentionally resets to platform_nonce; update the
code around fetch_from_platform(), insert_nonce, and the
entry.and_modify/or_insert calls to use that max for the stale path.

---

Nitpick comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 496-518: The current refresh_identity_nonce implementation
correctly marks timestamps as 0 but scans the entire
internal_cache.identity_contract_nonce_counter BTreeMap to find entries for an
identity; optimize by performing a contiguous range scan instead of iterating
all entries — use a range_mut over keys starting at (identity_id.clone(), <min
contract id>) and ending at (identity_id.clone(), <max contract id>) (or
appropriate Bound bounds) to visit only entries whose first tuple element equals
identity_id and set their timestamp to 0; keep the existing
identity_nonce_counter logic unchanged.
- Around line 371-383: The code currently holds
internal_cache.identity_nonce_counter.lock() across an await inside
Self::get_or_fetch_nonce, which serialises all concurrent nonce requests; change
the flow so the map lock is not held during the async fetch by: 1) access the
map under internal_cache.identity_nonce_counter only to check/clone the existing
Entry value or to insert a lightweight “pending” marker, then drop the mutex
before performing the async platform call invoked via
IdentityNonceFetcher::fetch_with_settings; 2) after the fetch completes,
re-acquire the identity_nonce_counter mutex and update/replace the entry with
the fetched nonce (or remove the pending marker and insert the real value);
alternatively replace the single map mutex with a per-identity lock (e.g., a
separate Mutex per identity or a keyed lock structure) so get_identity_nonce and
get_identity_contract_nonce only serialize for the same identity rather than
globally. Ensure this change touches the codepaths using get_or_fetch_nonce,
internal_cache.identity_nonce_counter, and the async fetch closure so no mutex
guard is held across await.

… higher cached nonce

- Do not hold the Mutex across the async fetch_from_platform call;
  split get_or_fetch_nonce into lock-check-unlock / fetch / lock-update
  so other callers are not blocked during the network round-trip.
- After re-fetching, keep the higher of the cached nonce vs the Platform
  nonce to prevent regression when Platform has not yet indexed a recent
  successful broadcast.
- Replace panic!("vacant entry not possible") with a structure that
  cannot reach the impossible state (early-return from cache hit).
- Tighten drift threshold from > to >= MAX_MISSING_IDENTITY_REVISIONS
  as a precaution to stay within Platform's own limit.
- Simplify cache write from and_modify().or_insert() to a single insert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/rs-sdk/src/sdk.rs (1)

412-428: Optional: add tracing to get_identity_contract_nonce for consistency with get_identity_nonce.

get_identity_nonce (lines 390–395) emits a trace! after the fetch. get_identity_contract_nonce returns directly without logging, making it harder to observe nonce progression in traces.

♻️ Proposed addition
     ) -> Result<IdentityNonce, Error> {
         let settings = settings.unwrap_or_default();
-        Self::get_or_fetch_nonce(
+        let nonce = Self::get_or_fetch_nonce(
             &self.internal_cache.identity_contract_nonce_counter,
             (identity_id, contract_id),
             bump_first,
             &settings,
             || async {
                 Ok(IdentityContractNonceFetcher::fetch_with_settings(
                     self,
                     (identity_id, contract_id),
                     settings.request_settings,
                 )
                 .await?
                 .unwrap_or(IdentityContractNonceFetcher(0))
                 .0)
             },
         )
-        .await
+        .await?;
+
+        tracing::trace!(
+            identity_id = %identity_id,
+            contract_id = %contract_id,
+            bump_first,
+            nonce,
+            "Fetched identity contract nonce"
+        );
+
+        Ok(nonce)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 412 - 428, Add a trace call to
get_identity_contract_nonce to mirror get_identity_nonce: after awaiting the
result of Self::get_or_fetch_nonce in get_identity_contract_nonce, emit a trace!
that logs the resolved nonce along with identity_id and contract_id for
observability; update the function surrounding the
Self::get_or_fetch_nonce(...).await call to capture the returned nonce in a
local variable and then call trace! before returning it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 412-428: Add a trace call to get_identity_contract_nonce to mirror
get_identity_nonce: after awaiting the result of Self::get_or_fetch_nonce in
get_identity_contract_nonce, emit a trace! that logs the resolved nonce along
with identity_id and contract_id for observability; update the function
surrounding the Self::get_or_fetch_nonce(...).await call to capture the returned
nonce in a local variable and then call trace! before returning it.

lklimek and others added 2 commits February 19, 2026 15:08
Cover get_or_fetch_nonce with 25 parameterized test cases:
- empty cache fetch (basic, zero, filter-max, upper-bits stripping)
- fresh cache hit (no-bump, bump, drift below threshold)
- stale/drifted cache re-fetch (no-regress, platform-higher, drift-at-max)
- error propagation, refresh staleness, key isolation, concurrency races

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/rs-sdk/src/sdk.rs (2)

1367-1389: Test coverage is comprehensive; minor nit on test case name.

The "filter max with bump overflows" label (line 1372) implies a u64 integer overflow. Since IDENTITY_NONCE_VALUE_FILTER ≈ (1 << 40) - 1, the bumped result IDENTITY_NONCE_VALUE_FILTER + 1 = 1 << 40 is well within u64's range. A clearer name would be "nonce exceeds filter range after bump" to signal that the stored value has bits beyond what Platform would return (not an arithmetic overflow).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 1367 - 1389, Test case label is
misleading: in the test function empty_cache_fetch update the test_case string
"filter max with bump overflows" to a clearer label such as "nonce exceeds
filter range after bump" so it no longer implies an arithmetic u64 overflow;
modify the test_case for IDENTITIY_NONCE_VALUE_FILTER + 1 accordingly in the
list of #[test_case(... ; "...")] entries.

400-429: get_identity_contract_nonce is missing the observability trace log present in get_identity_nonce.

get_identity_nonce (line 390) logs identity_id, bump_first, and the resolved nonce. get_identity_contract_nonce has no equivalent trace, making it harder to debug nonce issues for identity–contract pairs.

🪵 Suggested trace log
     ) -> Result<IdentityNonce, Error> {
         let settings = settings.unwrap_or_default();
-        Self::get_or_fetch_nonce(
+        let nonce = Self::get_or_fetch_nonce(
             &self.internal_cache.identity_contract_nonce_counter,
             (identity_id, contract_id),
             bump_first,
             &settings,
             || async {
                 Ok(IdentityContractNonceFetcher::fetch_with_settings(
                     self,
                     (identity_id, contract_id),
                     settings.request_settings,
                 )
                 .await?
                 .unwrap_or(IdentityContractNonceFetcher(0))
                 .0)
             },
         )
-        .await
+        .await?;
+
+        tracing::trace!(
+            identity_id = %identity_id,
+            contract_id = %contract_id,
+            bump_first,
+            nonce,
+            "Fetched identity contract nonce"
+        );
+
+        Ok(nonce)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/rs-sdk/src/sdk.rs` around lines 400 - 429, Add the same
observability trace in get_identity_contract_nonce that get_identity_nonce uses:
after resolving the nonce (i.e., once the Value returned by
Self::get_or_fetch_nonce is available), emit the same trace-level log including
identity_id, contract_id, bump_first and the resolved nonce using the same trace
logging call used in get_identity_nonce; locate this in
get_identity_contract_nonce (and/or inside the fetch closure if needed) and
format the fields identically to the existing get_identity_nonce trace so
debugging for identity–contract nonces matches the single-identity case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 437-503: The code uses IDENTITY_NONCE_VALUE_FILTER in a bitwise
AND with a u64 and compares MAX_MISSING_IDENTITY_REVISIONS with a u64 result, so
ensure both DPP constants are u64 (or explicitly cast them); locate the
constants IDENTITY_NONCE_VALUE_FILTER and MAX_MISSING_IDENTITY_REVISIONS in DPP
and make them pub const ...: u64 = ... (or, if you cannot change their
definitions, cast where used in get_or_fetch_nonce: e.g.
fetch_from_platform().await? & (IDENTITY_NONCE_VALUE_FILTER as u64) and compare
with (MAX_MISSING_IDENTITY_REVISIONS as u64) in the drifted calculation to
guarantee types align).

---

Nitpick comments:
In `@packages/rs-sdk/src/sdk.rs`:
- Around line 1367-1389: Test case label is misleading: in the test function
empty_cache_fetch update the test_case string "filter max with bump overflows"
to a clearer label such as "nonce exceeds filter range after bump" so it no
longer implies an arithmetic u64 overflow; modify the test_case for
IDENTITIY_NONCE_VALUE_FILTER + 1 accordingly in the list of #[test_case(... ;
"...")] entries.
- Around line 400-429: Add the same observability trace in
get_identity_contract_nonce that get_identity_nonce uses: after resolving the
nonce (i.e., once the Value returned by Self::get_or_fetch_nonce is available),
emit the same trace-level log including identity_id, contract_id, bump_first and
the resolved nonce using the same trace logging call used in get_identity_nonce;
locate this in get_identity_contract_nonce (and/or inside the fetch closure if
needed) and format the fields identically to the existing get_identity_nonce
trace so debugging for identity–contract nonces matches the single-identity
case.

/// - `last_platform_nonce` — the value Platform returned on the last fetch
/// (after stripping upper bits). Used to detect when local bumps have
/// drifted too far ahead, triggering a re-fetch.
pub(crate) type NonceCacheEntry = (IdentityNonce, LastQueryTimestamp, IdentityNonce);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do struct

BTreeMap<(Identifier, Identifier), (prelude::IdentityNonce, LastQueryTimestamp)>,
>,
pub(crate) identity_contract_nonce_counter:
tokio::sync::Mutex<BTreeMap<(Identifier, Identifier), NonceCacheEntry>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Identifier, Identifier) terrible too

Comment on lines 372 to 388
let nonce = Self::get_or_fetch_nonce(
&self.internal_cache.identity_nonce_counter,
identity_id,
bump_first,
&settings,
|| async {
Ok(IdentityNonceFetcher::fetch_with_settings(
self,
identity_id,
settings.request_settings,
)
.await?
.unwrap_or(IdentityNonceFetcher(0))
.0)
},
)
.await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this into a method

/// returns an error or is updated.
pub(crate) identity_nonce_counter:
tokio::sync::Mutex<BTreeMap<Identifier, (prelude::IdentityNonce, LastQueryTimestamp)>>,
pub(crate) identity_nonce_counter: tokio::sync::Mutex<BTreeMap<Identifier, NonceCacheEntry>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better design would be to have a separate struct that inclapsulates all related logic (methods) which instance we have in SDK. In this case we don't need to spread logic around.

Copy link
Collaborator

@shumkov shumkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a functional test that repoduce previous problem but fixed with new code

lklimek and others added 8 commits February 23, 2026 08:47
Address PR #3111 review comments from shumkov:
- Replace NonceCacheEntry tuple with named struct (current_nonce,
  last_fetch_timestamp, last_fetched_platform_nonce)
- Replace (Identifier, Identifier) key with IdentityContractPair struct
- Create NonceCache struct encapsulating all nonce logic with per-map
  Mutex locking and configurable stale time
- Remove InternalSdkCache (was nonce-only, pointless wrapper)
- Make Sdk nonce methods thin wrappers delegating to NonceCache
- Move get_or_fetch_nonce, get_current_time_seconds,
  DEFAULT_IDENTITY_NONCE_STALE_TIME_S, and tests to internal_cache/mod.rs
- Remove LastQueryTimestamp and StalenessInSeconds type aliases from sdk.rs

Public API unchanged — all callers continue using sdk.get_identity_nonce(),
sdk.get_identity_contract_nonce(), and sdk.refresh_identity_nonce().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…race and bound memory

- Switch from BTreeMap to lru::LruCache (bounded to 100 entries) for
  both identity and contract nonce caches, fixing unbounded growth
- Hold cache lock end-to-end across async Platform fetch, eliminating
  the TOCTOU race condition (SEC-001) that could cause nonce regression
- Apply IDENTITY_NONCE_VALUE_FILTER mask on every nonce bump, not just
  after Platform fetch (SEC-002)
- Change refresh() to pop() entries entirely instead of zeroing
  timestamps, matching pre-refactor behavior (CODE-001)
- Return Result from get_current_time_seconds() instead of panicking
  on clock error (RUST-001)
- Do not update last_fetch_timestamp on cache-only bumps (CODE-002)
- Make lru a non-optional dependency (remove feature-gating behind mocks)
- Update tests: add LRU eviction, nonce filter masking, and timestamp
  preservation tests; remove concurrent_update tests (race window gone)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and tracing

Address five security findings (SEC-001 through SEC-005) in the NonceCache:

- SEC-001: Return `Error::NonceOverflow` instead of silently wrapping the
  nonce to zero when it reaches `IDENTITY_NONCE_VALUE_FILTER`. Wrapping
  would permanently lock out the identity.
- SEC-002: Bump default LRU cache capacity from 100 to 1000 to reduce
  silent eviction of actively-bumped nonces.
- SEC-003: Restore three-phase lock pattern — check cache under lock,
  release lock during platform fetch, re-acquire and merge with
  `max(cached, platform)`. Eliminates serialization of all nonce fetches
  across the async boundary.
- SEC-004: Add `tracing::trace!` for cache misses, stale/drift refetches,
  and anti-regression events. No logging on the hot cache-hit path.
- SEC-005: Capture a fresh timestamp after the platform fetch so
  `last_fetch_timestamp` reflects actual receipt time.

Also removes all runtime `expect()` calls — the remaining `.expect()` on
`NonZeroUsize::new()` is in a `const` context (compile-time evaluation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `.unwrap_or(0)` fallback with an explicit error when Platform
returns no nonce for an identity or identity-contract pair.  A stale
DAPI node that has not yet indexed the identity would previously cause
the SDK to silently use nonce 0, leading to rejected state transitions.

Now the closure surfaces `Error::IdentityNonceNotFound` and logs a
`warn!` so the caller can retry against a different node.

Also documents the LRU eviction, drift threshold, and DAPI staleness
edge cases on `NonceCache`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead `From<(Identifier, Identifier)>` impl (RUST-005)
- Replace double hash lookup (peek + get_mut) with single get_mut (RUST-002)
- Add `Debug` bound to generic key, include key in trace logs (RUST-003)
- Add manual `Debug` impl for `NonceCache` (RUST-004)
- Remove redundant test imports after `use super::*` (RUST-006)
- Add test for bump_first=false after bump_first=true sequence (CODE-004)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Preserve max(old, new) last_fetched_platform_nonce when cached nonce
  wins the Phase 3 merge, preventing drift inflation from a temporarily
  lower platform value (CODE-001)
- Simplify bump_nonce to explicit boundary check instead of indirect
  masking arithmetic (RUST-001)
- Add concurrent bump test (20 tasks via Barrier on fresh cache)
- Add concurrent stale-fetch test (50 tasks through Phase 2 merge)

Both concurrency tests assert all returned nonces are unique, validating
the three-phase locking under contention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ariants

The rs-sdk added two new error variants (NonceOverflow, IdentityNonceNotFound)
but the wasm-sdk error conversion was not updated, causing a non-exhaustive
match compilation failure in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek changed the title fix(sdk): prevent nonce cache regression on broadcast failure refactor(sdk): rewrite NonceCache with LRU eviction, drift detection, and structured errors Feb 23, 2026
@lklimek lklimek requested a review from shumkov February 23, 2026 12:12
lklimek and others added 3 commits February 23, 2026 13:28
… retry

The LruCache refactoring (d1225cc) inadvertently changed refresh() from
"mark stale" to "pop entry", reintroducing the exact bug from
dashpay/dash-evo-tool#588: after a broadcast failure, the bumped nonce
was lost, causing the retry to create a byte-identical state transition
("tx already exists in cache").

Changed refresh() to set last_fetch_timestamp = 0 instead of popping,
preserving the cached nonce so Phase 3's max(cached, platform) merge
advances past the already-used nonce.

Added tests covering both scenarios: Platform returning a stale nonce
(cached value preserved) and Platform returning a higher nonce (Platform
value used).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract STALE_TIMESTAMP constant to replace magic value 0
- Document intentional get_mut() LRU promotion in refresh() —
  refreshed entries are actively in use and should be preserved
  under eviction pressure
- Fix stale section comment that still said "removes entries"
- Fix primary test to use PutSettings::default() so the stale
  timestamp actually triggers a Platform re-fetch; add AtomicBool
  assertion verifying the fetch closure is invoked

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…instead

Move Platform fetch logic from Sdk call sites into NonceCache's
pub(crate) methods (get_identity_nonce, get_identity_contract_nonce).
The private get_or_fetch_nonce helper retains its generic closure
signature unchanged.

This eliminates closure boilerplate from sdk.rs and co-locates the
fetch + cache logic in one place. Refresh tests now use Sdk::new_mock()
with expect_fetch expectations instead of hand-rolled closures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants