Skip to content

fix: repair flood, dead nodes in responses, gRPC keepalive, and placement free-space filtering#28

Merged
Satyam709 merged 1 commit into
mainfrom
opencode/swift-sailor
May 5, 2026
Merged

fix: repair flood, dead nodes in responses, gRPC keepalive, and placement free-space filtering#28
Satyam709 merged 1 commit into
mainfrom
opencode/swift-sailor

Conversation

@Satyam709

Copy link
Copy Markdown
Owner

Summary

Fixes two critical bugs discovered during cluster testing:

1. gRPC "too_many_pings" / ENHANCE_YOUR_CALM

The PeerDialer sends keepalive pings every 5s, but gRPC's default KeepaliveEnforcementPolicy.MinTime is 5 minutes, causing the server to send GoAway. Fixed by configuring KeepaliveEnforcementPolicy{MinTime: 10s, PermitWithoutStream: true} on both storage and metadata gRPC servers.

2. Replication/Repair Flood (7 root causes)

Cause Fix
EnqueueRepair had no dedup — same (chunkID, target) delivered on every heartbeat Added in-flight tracking map
scheduleRepairJobs created duplicate FSM repair jobs Added HasActiveRepairForChunkTarget check
Failed repair re-scheduled immediately with zero backoff Exponential backoff retryDelay() via time.AfterFunc
ReportRepairResult had no NodeIdOnJobComplete couldn't clear pendingJobs, causing infinite redelivery Worker now passes m.nodeID in the request
TriggerRepair counted dead node as live replica (Raft async race) Explicitly exclude dead node from liveReplicas
Dead nodes in ChunkRecord.Replicas never evicted TriggerRepair now calls proposeEvictChunk
GetFile handler returned dead/draining node addresses (and raw node IDs as fallback addresses) Filter dead/draining nodes, skip unknown entries

3. Placement didn't consider free space

ScheduleRepairJobs selected any live node regardless of free space. Added minSpace int parameter to SelectNodes/SelectNodeReverse interface — caller passes chunk size, nodes with FreeSpace < minSpace are excluded. Applied in ScheduleRepairJobs (passes chunk.Size) and CreateFile (passes req.ChunkSize).

Tests

  • TestRepairAfterNodeDeathConverges — 5-node cluster, upload chunk, kill replica holder, verify 1 repair job completes, no flood, dead node evicted from FSM, not in GetFile response
  • TestNoTooManyPingsSent — 3-node cluster, 3 chunks, 20s keepalive cycle, all chunks fully replicated
  • TestPlacementExcludesFullNodes — verifies full nodes (FreeSpace=0) excluded from placement selection
  • 8 new placement strategy unit tests for minSpace filtering
  • All 42 existing integration tests pass, all 29 placement unit tests pass

…ment free-space filtering

- gRPC: add KeepaliveEnforcementPolicy to storage and metadata servers
  to prevent 'too_many_pings'/ENHANCE_YOUR_CALM GoAway errors

- Repair flood: add in-flight dedup in EnqueueRepair to prevent
  duplicate (chunkID,target) repair jobs from heartbeat redelivery

- Repair flood: add HasActiveRepairForChunkTarget check in
  scheduleRepairJobs to prevent duplicate FSM repair jobs

- Repair flood: add exponential backoff on failed repair re-scheduling
  instead of immediate retry

- Repair flood: pass NodeId in ReportRepairResult so OnJobComplete can
  properly clear pendingJobs (was empty, causing infinite redelivery)

- Repair: exclude dead node from liveReplicas in TriggerRepair to
  handle async Raft race where node isn't marked dead yet

- Dead node leaks: filter Dead/Draining nodes in GetFile handler,
  stop using raw node IDs as fallback addresses

- Dead node leaks: evict dead node from ChunkRecord.Replicas in
  TriggerRepair via proposeEvictChunk

- Placement: add minSpace parameter to SelectNodes/SelectNodeReverse,
  only select nodes with FreeSpace >= chunk size during repair and
  initial placement (scheduler passes chunk.Size, file_handler passes
  req.ChunkSize)

- Tests: integration test for repair convergence after node death,
  integration test for no-too-many-pings connections,
  integration test for full-node exclusion from placement,
  unit tests for minSpace filtering in both strategies
@Satyam709 Satyam709 merged commit 20c4e1c into main May 5, 2026
4 checks passed
@Satyam709 Satyam709 deleted the opencode/swift-sailor branch May 5, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant