Correctness fixes: maxFee, 504 retries, key cache#238
Conversation
There was a problem hiding this comment.
Pull request overview
Three independent correctness bug fixes to the core SDK and key-management package, each backed by a focused unit test.
Changes:
- Forward
maxFeeinsubmitWithFeeIncreaserecursive retries and tighten the guard tomaxFee !== undefinedso the cap is enforced on every retry, not just the first. - Replace unbounded recursion in
submitTransaction's 504 handling with a bounded iterative loop (5 retries) using capped exponential backoff + jitter, plus optional chaining one?.response?.statusto harden non-axios error handling. - Fix
KeyManager._writeIndexCacheso it actually deletes the cached entry whenkeyisundefined, ensuringremoveKeypurges the in-memory cache.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| @stellar/typescript-wallet-sdk/src/walletSdk/Horizon/Stellar.ts | Bounded 504 retry loop with backoff; forward maxFee and tighten its guard in submitWithFeeIncrease. |
| @stellar/typescript-wallet-sdk/test/stellar.test.ts | Adds tests for multi-retry maxFee enforcement and the three 504-handling paths (retry+success, non-504 rethrow, exhaustion). |
| @stellar/typescript-wallet-sdk-km/src/keyManager.ts | Split cache guard so _writeIndexCache(id, undefined) deletes the entry. |
| @stellar/typescript-wallet-sdk-km/test/keyManager.test.ts | Adds removeKey cache-purge test; minor formatting cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The recursive call rebuilt the params object without maxFee, so the cap was enforced on the first retry only. After that, maxFee was undefined and the guard short-circuited to false, letting the fee grow without bound. Forward maxFee in the recursive call and tighten the guard from `maxFee && …` to `maxFee !== undefined && …`.
The previous implementation recursed on every 504 with no delay, no retry limit, and no backoff. Under sustained Horizon 504s this both hammered the already-congested server and eventually crashed the process with RangeError: Maximum call stack size exceeded. Convert to an iterative loop with capped exponential backoff (1s base, 30s ceiling, additive jitter up to +50%) and a fixed five-retry budget. After exhausting retries, the last 504 is rethrown. Also use optional chaining on `e?.response?.status` to avoid a latent crash when a non-axios error (e.g., TransactionSubmitFailedError from the `!response.successful` branch) enters the catch.
The guard `if (this.shouldCache && key)` skipped the body when key was undefined, so removeKey's `_writeIndexCache(id, undefined)` call was a no-op. The cached Key remained in memory and subsequent signTransaction / fetchAuthToken calls returned the "deleted" key. Split the guard so the method writes when given a key and deletes the cache entry when given undefined.
a216429 to
4593135
Compare
piyalbasu
left a comment
There was a problem hiding this comment.
A couple of correctness concerns on the new submitTransaction / submitWithFeeIncrease logic.
| const newFee = parseInt(transaction.fee) + baseFeeIncrease; | ||
|
|
||
| if (maxFee && newFee > maxFee) { | ||
| if (maxFee !== undefined && newFee > maxFee) { |
There was a problem hiding this comment.
maxFee: 0 regression. This guard changed from if (maxFee && newFee > maxFee) to if (maxFee !== undefined && newFee > maxFee). The old behavior treated maxFee === 0 as falsy ("no cap"); the new behavior treats it as a hard zero cap that always trips, since newFee = parseInt(transaction.fee) + baseFeeIncrease > 0.
Any caller passing maxFee: 0 (defensive default, numeric coercion of an unset env var, or as a documented no-cap sentinel under the old behavior) will now throw TransactionSubmitWithFeeIncreaseFailedError(0, e) on the very first tx_too_late and never resubmit. Worth a CHANGELOG note flagging the breaking change, or — if 0 was previously undocumented — at least narrowing the type to a positive number.
Separately, this is also bypassed when parseInt(transaction.fee) returns NaN: NaN > maxFee is false, so submitWithFeeIncrease recurses with baseFee: NaN. Worth guarding with Number.isFinite(newFee).
There was a problem hiding this comment.
On maxFee: 0: agreed, reverted in 4939bb2 — the guard is back to if (maxFee && newFee > maxFee), preserving the maxFee: 0 ≈ no cap behavior. The actual bug fix (forwarding maxFee across the recursive call) is what closes the report; the guard tightening was spec creep.
On NaN guard: pushing back. transaction.fee comes from @stellar/stellar-sdk's TransactionBuilder, which sets fee from a numeric value — it's always a numeric string. For parseInt to return NaN here, the SDK contract would have to break, and we already depend on that contract elsewhere (baseFee.toString() in makeFeeBump, the builder constructor, etc.). I'd rather not sprinkle Number.isFinite defenses against contracts we already lean on; happy to revisit if there's a concrete path where transaction.fee could be non-numeric.
| if (e.response.status === 504) { | ||
| // in case of 504, keep retrying this tx until submission succeeds or we get a different error | ||
| let lastError: unknown; | ||
| for (let attempt = 0; attempt <= SUBMIT_504_MAX_RETRIES; attempt++) { |
There was a problem hiding this comment.
504 retry semantics changed from unbounded to ~6 attempts / ~46s. The previous implementation retried recursively forever; this caps at SUBMIT_504_MAX_RETRIES + 1 attempts and then throws the last 504. Two follow-ups worth considering:
-
Caller cascade:
submitWithFeeIncreasecatchessubmitTransactionerrors and only handlestx_too_late. A bubbled 504 hasgetResultCode(e) === '', so it rethrows immediately — even though the original transaction may still land on-chain. A naive caller that retries on its own will double-submit. Worth either documenting this in the JSDoc or special-casing 504 insubmitWithFeeIncrease. -
Jitter exceeds the documented cap (line 160):
cappedDelay = Math.min(BASE * 2**attempt, MAX)is clamped to 30s, but jitter (Math.random() * cappedDelay/2, up to 15s) is then added on top, giving ~45s waits on the final retries — 50% overSUBMIT_504_MAX_DELAY_MS. If the cap is meant to bound caller latency, apply jitter first andMath.minafter, or use full-jitter (Math.random() * cappedDelay).
There was a problem hiding this comment.
On 1 (caller cascade): you're right that an exhausted-504 bubbles through submitWithFeeIncrease unchanged, and the developer-experience concern stands. Documented in e7aba81 — the new JSDoc on submitTransaction explicitly calls out that an exhausted-504 leaves the on-chain status indeterminate and that callers should poll the tx hash rather than resubmit blindly (and notes that resubmitting the same signed tx will fail with tx_bad_seq once the original lands, so it's a DX problem rather than a fund-loss one — Stellar's sequence numbers prevent the double-submit you're concerned about).
On 2 (jitter exceeds the cap): fixed in ba3cb75 — switched to equal-jitter (cappedDelay/2 + Math.random() * cappedDelay/2), so sleeps now stay in [cap/2, cap]. Retains the herd-smoothing benefit with a deterministic progressive floor, and the named SUBMIT_504_MAX_DELAY_MS actually bounds the wait now.
piyalbasu
left a comment
There was a problem hiding this comment.
This looks good to me. I'd say the maxFee comment is worth addressing, but the 504 improvement could maybe be pushed off
The previous additive jitter (cappedDelay + Math.random()*cappedDelay/2) let sleeps exceed SUBMIT_504_MAX_DELAY_MS by 50% (up to 45s with the 30s cap). Switch to equal-jitter: each attempt waits cappedDelay/2 plus a random amount up to cappedDelay/2, so the total stays in [cap/2, cap]. The deterministic floor keeps the schedule progressive while preserving the herd-smoothing benefit. Update the 504 retry tests to match the new (cappedDelay/2)-floor sleeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous tightening to \`maxFee !== undefined && newFee > maxFee\` changed behavior for callers passing \`maxFee: 0\` (e.g. as a defensive default or numeric coercion of an unset env var): the old truthy check treated 0 as "no cap" and recursed, while the strict check threw on the very first tx_too_late. Restore the original \`maxFee && newFee > maxFee\` guard. The actual fix — forwarding \`maxFee\` across the recursive call — remains in place, which is what closes the underlying bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the JSDoc describe the actual behavior: a bounded number of 504 retries with exponential backoff, immediate rethrow on non-504 errors, and an explicit note that an exhausted-504 leaves the transaction's on-chain status indeterminate so callers should poll the tx hash instead of resubmitting blindly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@piyalbasu Both your comments have been addressed, thanks for the review! |
Summary
Three independent correctness bug fixes, one commit each:
submitWithFeeIncreasenow respectsmaxFeeon every retry, not just the first — previously the fee cap was silently dropped after the first retry, letting fees grow unbounded.submitTransaction504 retries are now bounded with exponential backoff instead of recursing forever with no delay — avoids hammering Horizon.KeyManager.removeKeynow clears the in-memory cache — previously a removed key lingered in the cache and could still be used to sign whenshouldCachewas on.No public API changes. No new dependencies.
1. Forward
maxFeeacrosssubmitWithFeeIncreaseretries (1dd8411)submitWithFeeIncreaserebuilt its params object on each recursive retry but omittedmaxFee. As a result the cap was only enforced on the first retry; from the second retry onwardmaxFeewasundefinedand the guardif (maxFee && …)short-circuited tofalse, letting the fee grow unbounded.maxFeein the recursive call.maxFee && newFee > maxFeetomaxFee !== undefined && newFee > maxFee.2. Bound
submitTransaction504 retries with exponential backoff (c1dc76f)On HTTP 504 the method recursed with no retry limit, no delay, and no backoff. Under sustained Horizon 504s this hammered the already-slow server and could eventually exhaust the call stack (
RangeError: Maximum call stack size exceeded), since V8 doesn't apply tail-call optimization toawait.e?.response?.statusto avoid a latent crash when a non-axios error enters the catch block.3. Clear
KeyManagercache entry onremoveKey(e2d9691)_writeIndexCache's guardif (this.shouldCache && key)skipped the body whenkeywasundefined, soremoveKey's_writeIndexCache(id, undefined)call was a no-op. WithshouldCache: true, a removed key stayed in the in-memory cache and subsequentsignTransaction/fetchAuthTokencalls kept returning it.undefined.Testing
Each fix ships with a focused unit test:
submitWithFeeIncreasecap is enforced across multiple retries (not just the first).submitTransaction504 handling: retries then succeeds, re-throws non-504 immediately, and gives up after the retry budget.removeKeypurges the cache so a subsequent sign attempt misses and fails as expected.No public API changes. No new dependencies.
🤖 Generated with Claude Code