Skip to content

fix(meshcore): retry initial auto-connect attempt with backoff#3919

Open
Yeraze wants to merge 4 commits into
mainfrom
claude/vibrant-volta-10neq7
Open

fix(meshcore): retry initial auto-connect attempt with backoff#3919
Yeraze wants to merge 4 commits into
mainfrom
claude/vibrant-volta-10neq7

Conversation

@Yeraze

@Yeraze Yeraze commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #3918.

MeshCoreManager.connect() previously made exactly one attempt and gave up permanently on failure — leaving the source stuck disconnected until a manual "Connect" click, even with "Automatically connect on startup" enabled. This differs from Meshtastic TCP sources, whose transport retries forever with exponential backoff by default. MeshCore's existing reconnect machinery (scheduleNextReconnect/attemptReconnect) only ever engaged for a drop after an already-successful connect, and only when the opt-in heartbeat feature (heartbeatIntervalSeconds) was configured — never for the initial connect attempt itself.

  • connect() now arms shouldReconnect and calls scheduleNextReconnect() when the very first attempt fails, so a transient startup race (device/native backend not yet ready) recovers on its own via the existing exponential-backoff timer instead of requiring manual intervention.
  • On a successful connect, shouldReconnect is explicitly reset to false so the heartbeat feature's own opt-in gating for post-connect drops is unaffected — this change does not silently enable heartbeat-style auto-reconnect for sources that never configured it.

Changes

  • src/server/meshcoreManager.ts: retry-on-initial-failure in connect()'s catch block; explicit shouldReconnect reset on success.
  • src/server/meshcoreManager.connect-error.test.ts: stub scheduleNextReconnect in the existing catch-handler tests so they don't leak a real retry timer.
  • src/server/meshcoreManager.initialConnectRetry.test.ts (new): regression tests covering the retry-on-failure path, the shouldReconnect reset on success, and an end-to-end retry-then-succeed scenario via attemptReconnect().

Test plan

  • Targeted tests: meshcoreManager.connect-error.test.ts, meshcoreManager.initialConnectRetry.test.ts, meshcoreManager.dropDetection.test.ts — all pass
  • Full Vitest suite — 547 passed / 9 failed suites, all 9 failures traced to a missing protobufs git submodule in the sandbox (unrelated to this change); re-ran after git submodule update --init and all 9 pass (48/48 tests)
  • eslint on changed files — no new warnings/errors (21 pre-existing issues at unrelated lines, confirmed present before this change via git stash)

🤖 Generated with Claude Code

https://claude.ai/code/session_01Gr4SAmd6b69UzqEnamyc4c


Generated by Claude Code

MeshCoreManager.connect() previously made exactly one attempt and gave
up on failure, leaving the source stuck disconnected until a manual
Connect click — even with "Automatically connect on startup" enabled.
Unlike Meshtastic TCP sources, whose transport retries forever with
backoff by default, MeshCore's only retry machinery only covered a
drop after an already-successful connect, and only when the opt-in
heartbeat feature was configured.

connect() now arms shouldReconnect and schedules a backoff retry
(scheduleNextReconnect/attemptReconnect) when the first attempt fails,
and resets shouldReconnect back to false on a successful connect so
the heartbeat feature's own gating for post-connect drops is
unaffected.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Gr4SAmd6b69UzqEnamyc4c
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Read changed files
  • Analyze implementation
  • Post review feedback

Summary

This is a well-targeted, minimal fix. The implementation is clean, the reasoning is sound, and the test coverage is thorough. A few observations follow.


Code Quality

src/server/meshcoreManager.ts (lines 905–931)

The catch block change is straightforward and correctly ordered:

  1. disconnect() is awaited (resets shouldReconnect = false at line 988)
  2. shouldReconnect is re-armed to true
  3. connectionState is set to 'reconnecting'
  4. scheduleNextReconnect() is called

The order matters — if steps 1 and 2 were swapped, disconnect() would undo the arm. The current ordering is correct.

shouldReconnect reset on success (line 857)

Resetting shouldReconnect = false on a successful connect is the right call. Without it, sources without heartbeatIntervalSeconds configured would gain implicit post-connect auto-reconnect as a side-effect of this fix. The inline comment explains the reasoning well.

One subtle behavioral note worth considering:

scheduleNextReconnect() (line 4984) honors reconnectMaxAttempts:

const maxAttempts = this.config?.reconnectMaxAttempts ?? 0;
if (maxAttempts > 0 && this.reconnectAttempts >= maxAttempts) {
  this.connectionState = 'failed';
  this.emit('reconnect_giveup', ...);
  return;
}

The default is 0 (unlimited), so initial-failure retries will run forever unless an operator sets reconnectMaxAttempts. This is consistent with how Meshtastic TCP already behaves, but it's worth confirming this is the intended UX — a device that's simply offline at startup will retry indefinitely with no log-level escalation after the cap (which defaults to infinite).


Test Coverage

The three test cases in meshcoreManager.initialConnectRetry.test.ts cover the important paths:

  1. arms shouldReconnect and schedules a retry — verifies the flag, state, and scheduleNextReconnect call count on first failure. ✓
  2. resets shouldReconnect to false on success — verifies the heartbeat-gate reset. ✓
  3. End-to-end retry-then-succeed via attemptReconnect() — drives the full retry cycle manually after stubbing the scheduler. ✓

The update to meshcoreManager.connect-error.test.ts correctly stubs scheduleNextReconnect to prevent timer leaks in existing tests. ✓

Minor test note: In the third test (retries the eventual reconnect attempt), void secondAttempt on line 99 is a no-op since secondAttempt is the resolved Promise<void> value (it's already undefined). This is harmless but slightly misleading — the await on line 98 already consumed the promise.


Potential Concerns

No concern with infinite retry for initially-failing sources: Already the behavior for post-connect drops with heartbeat configured, and for Meshtastic TCP. Consistent.

reconnectAttempts is not reset before the initial-failure retry loop begins. reconnectAttempts is only reset on a successful connect (line 849). If a source fails once, retries, and eventually succeeds, the attempt counter is properly reset. This is correct.

disconnect() in the catch block: Calling disconnect() here resets many state fields. The subsequent lines re-arm shouldReconnect and set connectionState = 'reconnecting' after the disconnect call, which is correct since disconnect() sets connectionState = 'disconnected' and shouldReconnect = false.


Verdict

Approve. The fix is minimal, correctly targeted, well-commented, and regression-tested. No blocking issues found.


· Branch: claude/vibrant-volta-10neq7

Nit from PR review on #3919 — attemptReconnect() resolves to undefined,
so voiding the awaited result was a no-op that read as if it meant
something.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Gr4SAmd6b69UzqEnamyc4c
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Read changed files
  • Analyze implementation
  • Post review feedback

Summary

This is a well-targeted fix for a real UX problem. The core change is minimal and the reasoning is clearly explained in the PR description and inline comments. However, there is one behavioral bug introduced by the fix that needs attention before merge.


Bug: Double-scheduleNextReconnect() call on every retry after the first

The problem:

When attemptReconnect() calls connect() and connect() fails (which is the normal retry loop), scheduleNextReconnect() ends up being called twice per failure:

  1. Inside connect()'s catch block (line 929) — increments reconnectAttempts, sets reconnectTimer = T1
  2. Back in attemptReconnect() (line 5976) — !ok && shouldReconnect is true, so it calls scheduleNextReconnect() again — increments reconnectAttempts a second time, overwrites reconnectTimer = T2 (T1 is now an orphaned leak)

scheduleNextReconnect() has no guard against being entered while reconnectTimer is already set; it just overwrites the field.

Effect:

  • reconnectAttempts is incremented twice per failed retry (so maxAttempts cap triggers at half the expected count)
  • The backoff delay doubles faster than intended (because delay = initial * 2^reconnectAttempts uses the inflated counter)
  • An orphaned timer fires after T1's delay, redundantly calling attemptReconnect() a second time
  • Double reconnecting events are emitted to the UI per failure

The fix for this is straightforward: the connect() catch block should not call scheduleNextReconnect() itself. It should only arm shouldReconnect = true and set connectionState = 'reconnecting'. The callers (attemptReconnect and the initial call path) already handle scheduling. Specifically:

  • For the first connect (direct call, not via attemptReconnect): the caller is whatever triggered the initial connect. But connect() isn't called by scheduleNextReconnect directly — the retry loop goes through attemptReconnectconnect. So who calls scheduleNextReconnect for the very first failure? That's the intent of the new code.

Looking more carefully at the call graph:

Initial startup → connect()  [first attempt]
  → fails → catch: arm shouldReconnect, call scheduleNextReconnect()  ← correct, no caller will re-schedule
  → returns false
  
scheduleNextReconnect timer fires → attemptReconnect() → connect()  [second attempt]
  → fails → catch: arm shouldReconnect, call scheduleNextReconnect()  ← T1 scheduled
  → returns false
  attemptReconnect: !ok && shouldReconnect → scheduleNextReconnect()  ← T2 scheduled (T1 leaked, counter doubled)

The fix needs to distinguish the first attempt from subsequent attempts. One approach: check whether reconnectAttempts > 0 before calling scheduleNextReconnect() in the catch — if we're already in the retry loop, attemptReconnect will handle re-scheduling. Another approach: remove the scheduleNextReconnect() call from attemptReconnect's !ok branch (since connect() already schedules it), but that changes pre-existing behavior for the heartbeat-driven path.

The cleanest fix is: in the connect() catch block, call scheduleNextReconnect() only when reconnectAttempts === 0 (i.e., this is the first failure, and no one else will re-schedule), or equivalently add a guard to scheduleNextReconnect to no-op if reconnectTimer is already pending.


Test Coverage Gap

The existing tests stub scheduleNextReconnect so they don't catch the double-call issue. A test that calls attemptReconnect() with a real (unstubbed) scheduleNextReconnect() would expose this — the reconnectAttempts counter would be 2 after a single failed retry instead of 1.


Other Observations (non-blocking)

void secondAttempt on line 99 of the new test — the previous automated review noted this. await (m as any).attemptReconnect() already resolves the promise; the void on line 99 is a no-op on undefined. Harmless but slightly misleading.

Infinite retry UX — with reconnectMaxAttempts defaulting to 0 (unlimited), a MeshCore source that can never connect (e.g. wrong serial port) will retry forever with no escalation. This is consistent with Meshtastic TCP behavior, but worth a UI indicator or at least a warning log after N consecutive failures. This is a pre-existing design choice, not introduced by this PR.

shouldReconnect = false reset on successful connect (line 857) — correct and well-commented.


Verdict

The intent and approach are sound. The double-scheduleNextReconnect call is a real behavioral regression that will cause accelerated backoff exhaustion (when reconnectMaxAttempts is set) and orphaned timers in production. The fix is small. I'd suggest addressing this before merge.


· Branch: claude/vibrant-volta-10neq7

Flagged by automated PR review: connect()'s catch block schedules a
retry, and attemptReconnect()'s own `!ok` branch then scheduled a
second one on top of it every time the retry loop drove a failing
connect(). That doubled reconnectAttempts (and the backoff growth it
drives), left the first timer orphaned, and emitted 'reconnecting'
twice per failure.

scheduleNextReconnect() now no-ops if a reconnect is already pending,
making it safe to call from multiple points in the same failure path.
Added a regression test using the real (unstubbed) scheduler to catch
this — verified it fails without the guard (reconnectAttempts == 2)
and passes with it.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Gr4SAmd6b69UzqEnamyc4c

Yeraze commented Jul 4, 2026

Copy link
Copy Markdown
Owner Author

Good catch — that was a real bug. Fixed in 77adfb5: scheduleNextReconnect() now no-ops if a reconnect is already pending, so it's safe to call from both connect()'s catch block and attemptReconnect()'s !ok branch without double-incrementing reconnectAttempts or leaking the first timer.

Added a regression test using the real (unstubbed) scheduler — confirmed it fails without the guard (reconnectAttempts comes out as 2 instead of 1) and passes with it.

-- Authored by Roger 🤓


Generated by Claude Code

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Read changed files
  • Analyze implementation
  • Post review feedback

Summary

This is a clean, well-targeted fix. The previous automated reviews already caught and resolved the main behavioral regression (double-scheduleNextReconnect in the retry loop). After reviewing the final state of all three commits (including the guard fix in 77adfb5), the implementation is sound.


src/server/meshcoreManager.ts

Catch block (lines 905–931)

The ordering is correct:

  1. disconnect() is awaited — this resets shouldReconnect = false (line 988) and clears any pending reconnectTimer
  2. shouldReconnect is re-armed to true
  3. connectionState is set to 'reconnecting'
  4. scheduleNextReconnect() is called

scheduleNextReconnect() guard (line 4985)

if (this.reconnectTimer) return;

This is the right place for the idempotency guard. It prevents the double-scheduling that the second automated review flagged — where connect()'s catch block and attemptReconnect()'s !ok branch would both call scheduleNextReconnect(), doubling reconnectAttempts and leaking the first timer. The guard is early enough to no-op before the counter increment at line 5001. ✓

shouldReconnect = false on success (line 857)

Correct. Without this reset, sources without heartbeatIntervalSeconds configured would silently gain post-connect auto-reconnect as a side effect. The inline comment explains the reasoning. ✓

One minor observation — distanceDeleteScheduler.start() in the success path (line 861)

this.distanceDeleteScheduler.start().catch(...)

The second test (resets shouldReconnect to false once a connect attempt succeeds) doesn't stub distanceDeleteScheduler.start(). The real DistanceDeleteScheduler.start() hits databaseService.settings.getSettingForSource(...), which isn't mocked in that test. In practice this appears to work (the test passes per the PR description), but it's relying on graceful failure behavior from the real scheduler rather than a deliberate mock. This is a pre-existing test hygiene issue, not introduced by this PR — but worth noting.


src/server/meshcoreManager.initialConnectRetry.test.ts

Test 1: arms shouldReconnect and schedules a retry when the first attempt fails

Stubs disconnect and scheduleNextReconnect correctly. Verifies the flag state, connectionState, and call count.

Test 2: resets shouldReconnect to false once a connect attempt succeeds without heartbeat configured

Pre-arms shouldReconnect = true to simulate the post-failure state before a successful retry, then confirms it's cleared on success. This is the regression guard for the heartbeat gating concern.

Test 3: retries the eventual reconnect attempt via attemptReconnect once scheduled

End-to-end: first call fails, schedules a retry (stubbed), then manually drives attemptReconnect() and confirms connected = true and shouldReconnect = false. The previous review's note about void secondAttempt on line 99 is no longer applicable — the current code at line 98–99 is await (m as any).attemptReconnect() with no stray void statement.

Test 4: does not double-schedule a reconnect when attemptReconnect() drives a failing connect() (regression)

Uses real (unstubbed) scheduleNextReconnect() with vi.useFakeTimers(). Confirms reconnectAttempts === 1 and vi.getTimerCount() === 1 after a single failed retry. This is the most important test — it would have caught the bug fixed in 77adfb5 and will catch any future regression to that guard. Cleanup via finally { vi.useRealTimers() } is correct.


src/server/meshcoreManager.connect-error.test.ts

The +4 lines stub scheduleNextReconnect in makeConfiguredManager() so the existing catch-handler log-message tests don't leak a real setTimeout. This is the correct fix — without it, the existing tests would start retrying after the assertions complete, potentially causing interference. ✓


Design observation (not a bug)

With reconnectMaxAttempts defaulting to 0 (unlimited), a MeshCore source that can never connect (wrong serial port, device permanently offline) will retry forever. This is consistent with Meshtastic TCP behavior and is a pre-existing design choice. However, now that initial-failure retry is enabled by default for all MeshCore sources (not just heartbeat-configured ones), operators hitting a misconfiguration will see perpetual reconnect noise in logs. A log-level escalation (e.g., logger.warn after N consecutive failures) could improve observability, but that's an enhancement for a follow-up issue rather than a blocker here.


Verdict

Approve. The fix is minimal, correctly implemented, and the regression tests are thorough — including the real-scheduler test (test 4) that directly validates the 77adfb5 guard. No blocking issues.


· Branch: claude/vibrant-volta-10neq7

Nit from PR review on #3919 — the two success-path tests reached
connect()'s real distanceDeleteScheduler.start(), which hits
databaseService.settings.getSettingForSource() unmocked. Stub the
scheduler explicitly instead of relying on it failing gracefully.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Gr4SAmd6b69UzqEnamyc4c
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Read changed files
  • Analyze implementation
  • Post review feedback

Summary

This is a clean, well-scoped fix. The previous automated reviews caught and resolved the most significant behavioral issue (double-scheduling via the reconnectTimer guard in 77adfb5). After reviewing the final committed code, the implementation is correct and the test coverage is thorough. A few observations follow.


Code Quality

src/server/meshcoreManager.ts — catch block (lines 905–931)

The ordering of operations is correct:

  1. disconnect() is awaited — this resets shouldReconnect = false and clears reconnectTimer
  2. shouldReconnect is re-armed to true
  3. connectionState is set to 'reconnecting'
  4. scheduleNextReconnect() is called

The inline comment on the catch block (lines 919–926) is unusually thorough and directly useful for future maintainers. The reasoning for why disconnect() must precede the re-arm is clearly stated.

scheduleNextReconnect() guard (line 4985)

if (this.reconnectTimer) return;

This is the right fix for the double-scheduling bug. It correctly no-ops before the counter increment at line 5001, so reconnectAttempts stays accurate regardless of which caller gets there first. The comment explaining the two callers (lines 4980–4984) is well-placed.

shouldReconnect = false on success (line 857)

Correct and well-commented. Without this, sources without heartbeatIntervalSeconds configured would gain implicit post-connect auto-reconnect as a side effect. The comment at lines 851–856 explains the exact reasoning for the reset.

attemptReconnect() — potential double-call to scheduleNextReconnect

In attemptReconnect() (lines 5980–5988), both the !ok branch and the catch branch call scheduleNextReconnect(). Since connect()'s own catch block also calls scheduleNextReconnect(), the flow on failure is:

  • connect() catch → scheduleNextReconnect() → sets reconnectTimer, counter incremented
  • attemptReconnect() returns false!ok && shouldReconnectscheduleNextReconnect() → hits the if (this.reconnectTimer) return guard → no-ops ✓

The guard handles this cleanly. However, the catch branch of attemptReconnect() (line 5984) handles the case where connect() throws rather than returning false. If connect() throws, its own catch block won't have run (since the throw propagates past it), so scheduleNextReconnect() hasn't been called yet — meaning attemptReconnect()'s catch branch calling scheduleNextReconnect() directly is the only scheduler in that path. This is correct.


Test Coverage

meshcoreManager.initialConnectRetry.test.ts

Four tests covering the important paths:

  1. Arms shouldReconnect and schedules a retry on first failure — verifies flag, state, and call count. ✓
  2. Resets shouldReconnect to false on success — pre-arms the flag to simulate a post-failure retry state, then confirms it's cleared. ✓
  3. End-to-end retry-then-succeed via attemptReconnect() — manually drives the reconnect after a stubbed scheduler, confirms connected = true and shouldReconnect = false. ✓
  4. No double-schedule regression — uses real (unstubbed) scheduleNextReconnect() with vi.useFakeTimers(). Confirms reconnectAttempts === 1 and vi.getTimerCount() === 1. This is the most important test — it directly validates the 77adfb5 guard. ✓

meshcoreManager.connect-error.test.ts

The +4 lines stub scheduleNextReconnect in makeConfiguredManager() to prevent timer leaks from the existing log-message tests. Correct. ✓

Minor: test 2 doesn't stub distanceDeleteScheduler.start()'s downstream DB calls

Line 68 stubs distanceDeleteScheduler entirely ({ start: vi.fn().mockResolvedValue(undefined), stop: vi.fn() }), so this is actually fine — the real scheduler is replaced before connect() runs. ✓


Design Observations (non-blocking)

Infinite retry with no log escalation. With reconnectMaxAttempts defaulting to 0 (unlimited), a MeshCore source that permanently fails (wrong serial port, device removed) retries forever. The log at line 5009 emits at info level on every attempt. After enough failures, this will produce continuous log noise with no escalation. Consider a logger.warn after some threshold (e.g., 5 consecutive failures). This is a pre-existing design choice and consistent with Meshtastic TCP behavior, so it's not a blocker.

reconnectTimer is cleared by disconnect() before the re-arm. If disconnect() clears reconnectTimer to null, the guard in scheduleNextReconnect() correctly proceeds. Worth confirming disconnect() does clear the timer — this is the assumed behavior based on the comment at lines 987–988 mentioned in earlier reviews.


Verdict

Approve. The implementation is minimal, correctly ordered, and addresses a real UX regression (#3918). The double-scheduling fix (77adfb5) and its regression test (test 4) are the key quality additions. No blocking issues.


· Branch: claude/vibrant-volta-10neq7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] MeshCore: Meshmonitor does not automatically reconnect MeshCore source

2 participants