Skip to content

Fix publish replication reliability#4093

Open
Bojan131 wants to merge 1 commit intov8/developfrom
fix/publish-replication
Open

Fix publish replication reliability#4093
Bojan131 wants to merge 1 commit intov8/developfrom
fix/publish-replication

Conversation

@Bojan131
Copy link
Collaborator

What

Fixes the "Not replicated to enough nodes!" errors that happen when publishing large knowledge assets or during parallel publishes.

Changes

  • Added a semaphore (max 3 concurrent) to avoid flooding all shard nodes with messages at once
  • Replication now batches nodes in groups of minAcks+2 and exits early once minimum replication is met
  • Each node message is wrapped in try/catch — one failing peer no longer kills the entire operation
  • Added a single retry on NACK before giving up on a peer
  • Bumped publish message timeout from 15s to 60s for larger payloads

Why

Under load (parallel publishes with large assets), the node was sending all replication messages simultaneously. If any single peer failed or was slow, the whole publish would fail. This makes the replication process more resilient without changing the minimum replication requirements.

Made with Cursor

- Add semaphore (3 concurrent) to limit parallel replication messages
- Batch replication to groups of minAcks+2 with early exit when minimum reached
- Wrap individual node messages in try/catch so one failing peer doesn't kill the whole operation
- Add single retry on NACK before giving up on a peer
- Increase publish message timeout from 15s to 60s for large knowledge assets

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant