Skip to content

fix: resolve three bugs — file commit, replica ID/address format, spurious repair cascade#27

Merged
Satyam709 merged 4 commits into
mainfrom
bug-fixes
May 4, 2026
Merged

fix: resolve three bugs — file commit, replica ID/address format, spurious repair cascade#27
Satyam709 merged 4 commits into
mainfrom
bug-fixes

Conversation

@Satyam709

Copy link
Copy Markdown
Owner

Summary

Fixes three bugs discovered during cluster operation testing.

Bug 1: File status stays "creating" forever

Root cause: client/internal/service/upload.go never called CommitFile after all chunks were uploaded.

Fixes:

  • Added CommitFile to the metadata client interface, gRPC client, and mock
  • Upload service calls CommitFile after all chunks uploaded successfully
  • GetFile handler rejects files with status != "complete"
  • ListFiles filters to only return complete files
  • CLI removes the STATUS column (all listed files are complete)

Bug 2: Replicas stored as mixed node IDs and raw gRPC addresses

Root cause: storage_handler.go sent confirmedNodes as [self_nodeID, replica_address:port, ...] mixing node IDs with raw addresses. GetFile returned these raw values to clients, causing connection failures.

Fixes:

  • Added normalizeReplicaNodeIDs() in the FSM that converts raw addresses to node IDs before storing in ChunkRecord.Replicas
  • GetFile handler now resolves node IDs to full gRPC addresses in ChunkInfo.Replicas

Bug 3: Unconditional repair scheduling on every chunk commit

Root cause: chunk_handlers.go called ScheduleRepairForChunk after every CommitChunk regardless of replication state.

Fix: Only schedules repair when len(liveReplicas) < ReplicationFactor.

Tests

Added 6 new integration tests and updated existing ones. All 52 tests pass.

Satyam709 added 2 commits May 4, 2026 18:16
Bug 1: Upload now calls CommitFile after all chunks uploaded,
GetFile rejects non-complete files, ListFiles filters to complete only,
CLI removes status column.

Bug 2: FSM normalizeReplicaNodeIDs converts raw addresses to node IDs
at insert time. GetFile resolves node IDs to gRPC addresses.

Bug 3: CommitChunk only schedules repair when replicas < ReplicationFactor
instead of unconditionally.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses three operational bugs in the distributed FS: files stuck in creating, replica identifiers stored/returned in inconsistent formats, and repair jobs being scheduled unnecessarily on every chunk commit. The changes span metadata FSM/server behavior, client upload flow, CLI output, and integration/unit tests.

Changes:

  • Finalize uploads by adding CommitFile to the client interface/implementations and calling it from the upload service; metadata now rejects GetFile for non-complete files and ListFiles only returns complete files.
  • Normalize chunk replica entries to node IDs in the FSM and resolve node IDs back to gRPC addresses when returning GetFile chunk replicas.
  • Only schedule repair after CommitChunk when live replica count is below the configured replication factor; add/adjust integration tests for these scenarios.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
metadata/server/file_handler.go Enforces complete-only GetFile/ListFiles and resolves chunk replica node IDs to addresses in GetFile.
metadata/server/chunk_handlers.go Gates repair scheduling after CommitChunk on under-replication.
metadata/fsm/fsm.go Normalizes replica entries to node IDs during chunk commit.
metadata/app.go Exposes FSM and Scheduler on MetadataApp for integration testing access and updates internal wiring.
integration/meta_storage/upload_download_test.go Adds integration coverage for missing vs present CommitFile, plus a full upload→commit→download round trip.
integration/meta_storage/replication_test.go Adds integration tests around spurious repairs, replica formats, and repair convergence behavior.
integration/meta_storage/main_test.go Increases integration cluster to 3 storage nodes to exercise RF=3 behavior.
integration/meta_storage/file_ops_test.go Updates file ops tests to commit files and align expectations with complete-only listing and GetFile rejection of deleted files.
integration/meta_storage/chunksize_test.go Updates chunk size tests to commit files before GetFile and align with new lifecycle rules.
client/internal/service/upload.go Calls CommitFile after chunk upload completes.
client/internal/service/service_test.go Updates unit tests to assert CommitFile is invoked and removes manual status mutation.
client/internal/metadataclient/mock.go Adds CommitFile to mock client with call tracking and state update.
client/internal/metadataclient/grpc.go Implements CommitFile RPC wrapper.
client/internal/metadataclient/client.go Extends metadata client interface with CommitFile.
client/cmd/dfs-cli/main.go Removes STATUS column from dfs-cli list output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return nil, h.leaderRedirect(ctx)
}

// TODO: Make the flow simpler - client will have no responsiblity of file id, server will generate it
Comment thread metadata/server/file_handler.go Outdated
return nil, status.Errorf(codes.NotFound, "file not found: %s", req.FileId)
}
if file.Status != fsm.FileStatusComplete {
return nil, status.Errorf(codes.FailedPrecondition, "file %s is not ready (status: %s)", req.FileId, file.Status)
Comment on lines +124 to +131
resolvedAddrs := make([]string, 0, len(ck.Replicas))
for _, rep := range ck.Replicas {
node, err := h.deps.FSM.GetNode(rep)
if err == nil && node.Address != "" {
resolvedAddrs = append(resolvedAddrs, node.Address)
} else {
resolvedAddrs = append(resolvedAddrs, rep)
}
Comment on lines +19 to +25
func countPendingRepairJobs(f *fsm.MetadataFSM) int {
jobs, err := f.GetJobsByStatus(fsm.RepairStatusPending)
if err != nil {
return -1
}
return len(jobs)
}
Comment on lines +153 to +163
// Upload to PRIMARY ONLY — explicitly skip replicas.
primaryClient := testutil.DialStorage(t, primary.Address)
testutil.PutChunkData(t, ctx, primaryClient, chunkID, fileID, payload, nil)
// The storage handler will still auto-replicate to the assigned replicas.
// We can't prevent that from here, so we skip this test if auto-replication
// is unavoidable.

t.Skip("TODO: need a way to suppress auto-replication to test deficit detection. " +
"The storage handler currently always replicates. This test is a placeholder " +
"for when single-node write capability is verified separately.")
}
@Satyam709 Satyam709 merged commit bc8ac69 into main May 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants