fix: repair flood, dead nodes in responses, gRPC keepalive, and placement free-space filtering#28
Merged
Merged
Conversation
…ment free-space filtering - gRPC: add KeepaliveEnforcementPolicy to storage and metadata servers to prevent 'too_many_pings'/ENHANCE_YOUR_CALM GoAway errors - Repair flood: add in-flight dedup in EnqueueRepair to prevent duplicate (chunkID,target) repair jobs from heartbeat redelivery - Repair flood: add HasActiveRepairForChunkTarget check in scheduleRepairJobs to prevent duplicate FSM repair jobs - Repair flood: add exponential backoff on failed repair re-scheduling instead of immediate retry - Repair flood: pass NodeId in ReportRepairResult so OnJobComplete can properly clear pendingJobs (was empty, causing infinite redelivery) - Repair: exclude dead node from liveReplicas in TriggerRepair to handle async Raft race where node isn't marked dead yet - Dead node leaks: filter Dead/Draining nodes in GetFile handler, stop using raw node IDs as fallback addresses - Dead node leaks: evict dead node from ChunkRecord.Replicas in TriggerRepair via proposeEvictChunk - Placement: add minSpace parameter to SelectNodes/SelectNodeReverse, only select nodes with FreeSpace >= chunk size during repair and initial placement (scheduler passes chunk.Size, file_handler passes req.ChunkSize) - Tests: integration test for repair convergence after node death, integration test for no-too-many-pings connections, integration test for full-node exclusion from placement, unit tests for minSpace filtering in both strategies
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two critical bugs discovered during cluster testing:
1. gRPC "too_many_pings" / ENHANCE_YOUR_CALM
The PeerDialer sends keepalive pings every 5s, but gRPC's default
KeepaliveEnforcementPolicy.MinTimeis 5 minutes, causing the server to send GoAway. Fixed by configuringKeepaliveEnforcementPolicy{MinTime: 10s, PermitWithoutStream: true}on both storage and metadata gRPC servers.2. Replication/Repair Flood (7 root causes)
EnqueueRepairhad no dedup — same(chunkID, target)delivered on every heartbeatscheduleRepairJobscreated duplicate FSM repair jobsHasActiveRepairForChunkTargetcheckretryDelay()viatime.AfterFuncReportRepairResulthad noNodeId—OnJobCompletecouldn't clearpendingJobs, causing infinite redeliverym.nodeIDin the requestTriggerRepaircounted dead node as live replica (Raft async race)liveReplicasChunkRecord.Replicasnever evictedTriggerRepairnow callsproposeEvictChunkGetFilehandler returned dead/draining node addresses (and raw node IDs as fallback addresses)3. Placement didn't consider free space
ScheduleRepairJobsselected any live node regardless of free space. AddedminSpace intparameter toSelectNodes/SelectNodeReverseinterface — caller passes chunk size, nodes withFreeSpace < minSpaceare excluded. Applied inScheduleRepairJobs(passeschunk.Size) andCreateFile(passesreq.ChunkSize).Tests
TestRepairAfterNodeDeathConverges— 5-node cluster, upload chunk, kill replica holder, verify 1 repair job completes, no flood, dead node evicted from FSM, not in GetFile responseTestNoTooManyPingsSent— 3-node cluster, 3 chunks, 20s keepalive cycle, all chunks fully replicatedTestPlacementExcludesFullNodes— verifies full nodes (FreeSpace=0) excluded from placement selectionminSpacefiltering