Skip to content

Fix flaky heartbeat and view change tests#4

Open
chgeuer wants to merge 1 commit intoityonemo:mainfrom
chgeuer:fix-flaky-tests
Open

Fix flaky heartbeat and view change tests#4
chgeuer wants to merge 1 commit intoityonemo:mainfrom
chgeuer:fix-flaky-tests

Conversation

@chgeuer
Copy link

@chgeuer chgeuer commented Mar 9, 2026

Fixes #3

Changes

ViewChangeTest (line 130)

Replaced the live 3-node cluster with an isolated single-replica setup using dummy peer processes. The original test asserted intermediate view_change_votes state which gets cleared when the view change completes — a TOCTOU race. Now asserts on the [:view_change, :do_view_change, :sent] telemetry event instead, which proves majority was reached without inspecting transient state.

HeartbeatTest (line 109)

Three fixes:

  1. Deterministic node names: Used hbt_a_/hbt_b_/hbt_c_ prefixes that sort correctly under :erlang.term_to_binary/1, so the test identifies and stops the actual primary for view 0.

  2. Dummy process for dead primary: After stopping the primary, registers a dummy process under its name so the backups' start_manual_view_change broadcast via send/2 does not crash on an unregistered atom.

  3. Telemetry listener ordering: Attaches the [:timer, :primary_timeout] listener before stopping the primary, eliminating the window where the event fires with no listener.

Verification

Both tests pass 5/5 consecutive runs (previously failed ~50-80% of the time). Full suite: 95 tests, 0 failures.

ViewChangeTest: Replaced live 3-node cluster with isolated single-
replica setup using dummy peer processes. The original test asserted
intermediate vote state (view_change_votes map) which gets cleared
when the view change completes — a TOCTOU race. Now asserts on
DoViewChange telemetry event instead.

HeartbeatTest: Three fixes:
- Used deterministically-sorted node names (hbt_a/b/c) so the correct
  primary is identified. The original names (primary_xxx/backup1_xxx)
  sorted incorrectly under :erlang.term_to_binary/1, causing the test
  to stop the wrong node.
- Register a dummy process under the dead primary's name after
  stopping it, so the backups' broadcast in start_manual_view_change
  doesn't crash on send/2 to an unregistered atom.
- Attach telemetry listener before stopping the primary to avoid a
  window where the timeout event fires before anyone is listening.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky tests: HeartbeatTest and ViewChangeTest fail intermittently

1 participant