ClusterKit

A lightweight, production-ready distributed cluster coordination library for Go

ClusterKit handles cluster coordination so you can focus on building your application.

Features • Quick Start • Service Discovery • Event Hooks • Documentation • Examples

🤔 Why ClusterKit Exists

The Problem

Building a distributed system is hard. You need to solve:

"Where does this data go?" - Partition assignment across nodes
"Who's in charge?" - Leader election and consensus
"Is everyone alive?" - Health checking and failure detection
"What happens when nodes join/leave?" - Rebalancing and data migration
"How do I know when to move data?" - Event notifications

Most developers end up either:

❌ Reinventing the wheel - Writing complex coordination logic from scratch
❌ Over-engineering - Using heavy frameworks that dictate your entire architecture
❌ Coupling tightly - Mixing coordination logic with business logic

The Solution

ClusterKit provides just the coordination layer - nothing more, nothing less.

┌─────────────────────────────────────────────────────────┐
│  Your Application (Storage, Replication, Business Logic) │
├─────────────────────────────────────────────────────────┤
│  ClusterKit (Coordination, Partitioning, Consensus)      │
└─────────────────────────────────────────────────────────┘

You get:

✅ Production-ready coordination (Raft consensus, health checking)
✅ Simple API (7 methods + hooks)
✅ Complete flexibility (bring your own storage/replication)
✅ Zero lock-in (just a library, not a framework)

🎯 What is ClusterKit?

ClusterKit is a coordination library that manages the distributed aspects of your cluster while giving you complete control over data storage and replication. It handles:

✅ Partition Management - Consistent hashing to determine which partition owns a key
✅ Node Discovery - Automatic cluster membership and health monitoring
✅ Service Discovery - Register and discover application services across nodes
✅ Leader Election - Raft-based consensus for cluster decisions
✅ Rebalancing - Automatic partition redistribution when nodes join/leave
✅ Event Hooks - Rich notifications for partition changes, node lifecycle events
✅ Failure Detection - Automatic health checking and node removal
✅ Rejoin Handling - Smart detection and data sync for returning nodes

You control:

🔧 Data storage (PostgreSQL, Redis, files, memory, etc.)
🔧 Replication protocol (HTTP, gRPC, TCP, etc.)
🔧 Consistency model (strong, eventual, causal, etc.)
🔧 Business logic

✨ Key Features

Core Capabilities

Simple API - 7 core methods + rich event hooks
Minimal Configuration - Only 2 required fields (NodeID, HTTPAddr)
Service Discovery - Register multiple services per node (HTTP, gRPC, WebSocket, etc.)
Production-Ready - WAL, snapshots, crash recovery, metrics
Health Checking - Automatic failure detection and node removal
Smart Rejoin - Detects returning nodes and triggers data sync

Event System

Rich Context - Events include timestamps, reasons, offline duration, partition ownership
7 Lifecycle Hooks - OnPartitionChange, OnNodeJoin, OnNodeRejoin, OnNodeLeave, OnRebalanceStart, OnRebalanceComplete, OnClusterHealthChange
Async Execution - Hooks run in background goroutines (max 50 concurrent)
Panic Recovery - Hooks are isolated and won't crash your application

Distributed Coordination

Raft Consensus - Built on HashiCorp Raft for strong consistency
Consistent Hashing - MD5-based partition assignment
Configurable Replication - Set replication factor (default: 3)
HTTP API - RESTful endpoints for cluster information

🏗️ Architecture

ClusterKit uses a layered architecture combining Raft consensus with consistent hashing:

┌─────────────────────────────────────────────────────────────┐
│                    Your Application Layer                    │
│         (KV Store, Cache, Queue, Custom Logic)              │
└────────────────────┬────────────────────────────────────────┘
                     │ API Calls
┌────────────────────▼────────────────────────────────────────┐
│                   ClusterKit Public API                      │
│  GetPartition() • IsPrimary() • GetReplicas() • Hooks       │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│                  Coordination Layer                          │
│  ┌──────────────┬──────────────┬──────────────────────┐    │
│  │ Partition    │ Health       │ Hook                 │    │
│  │ Manager      │ Checker      │ Manager              │    │
│  │              │              │                      │    │
│  │ • Consistent │ • Heartbeats │ • Event dispatch     │    │
│  │   Hashing    │ • Failure    │ • Async execution    │    │
│  │ • Rebalance  │   detection  │ • 7 lifecycle hooks  │    │
│  └──────┬───────┴──────┬───────┴──────────┬───────────┘    │
│         │              │                  │                 │
│  ┌──────▼──────────────▼──────────────────▼───────────┐    │
│  │           Raft Consensus Layer                      │    │
│  │  • Leader election                                  │    │
│  │  • Log replication                                  │    │
│  │  • State machine (cluster state)                    │    │
│  └──────────────────┬──────────────────────────────────┘    │
└─────────────────────┼──────────────────────────────────────┘
                      │
┌─────────────────────▼──────────────────────────────────────┐
│                  Persistence Layer                          │
│  • WAL (Write-Ahead Log)                                   │
│  • Snapshots (cluster state)                               │
│  • JSON state files                                        │
└────────────────────────────────────────────────────────────┘

How It Works

Partition Assignment - MD5 hash of key → partition ID (0-63)
Node Selection - Consistent hashing assigns partitions to nodes
Consensus - Raft ensures all nodes agree on cluster state
Health Monitoring - Periodic checks detect failures
Rebalancing - Automatic when topology changes
Event Notification - Hooks fire for lifecycle events

See docs/architecture.md for detailed design

🎬 Quick Demo

See ClusterKit in action with a 3-node cluster:

# Clone the repository
git clone https://github.com/skshohagmiah/clusterkit
cd clusterkit/example/sync

# Start 3-node cluster
./run.sh

# Output shows:
# ✅ Node formation
# ✅ Leader election  
# ✅ Partition distribution
# ✅ Data replication
# ✅ Automatic rebalancing

Example Output:

🚀 Starting node-1 (bootstrap) on ports 8080/9080
   [RAFT] Becoming leader
   [CLUSTER] Leader elected: node-1
   
🔗 Starting node-2 (joining) on ports 8081/9081
   [JOIN] node-2 joining via node-1
   [RAFT] Adding voter: node-2
   [REBALANCE] Starting rebalance (trigger: node_join)
   [PARTITION] partition-0: node-1 → node-2
   [PARTITION] partition-15: node-1 → node-2
   [REBALANCE] Complete (moved 21 partitions in 2.3s)
   
🔗 Starting node-3 (joining) on ports 8082/9082
   [JOIN] node-3 joining via node-1
   [REBALANCE] Starting rebalance (trigger: node_join)
   [PARTITION] partition-5: node-1 → node-3
   [REBALANCE] Complete (moved 14 partitions in 1.8s)

✅ Cluster ready: 3 nodes, 64 partitions, RF=3

📦 Installation

go get github.com/skshohagmiah/clusterkit

🚀 Quick Start

Bootstrap Node (First Node)

package main

import (
    "log"
    "time"
    "github.com/skshohagmiah/clusterkit"
)

func main() {
    // Create first node - only 2 fields required!
    ck, err := clusterkit.New(clusterkit.Options{
        NodeID:   "node-1",
        HTTPAddr: ":8080",
        // Optional: Register application services
        Services: map[string]string{
            "kv":   ":9080",     // Your KV store API
            "api":  ":3000",     // Your REST API
            "grpc": ":50051",    // Your gRPC service
        },
        // Optional: Enable health checking
        HealthCheck: clusterkit.HealthCheckConfig{
            Enabled:          true,
            Interval:         5 * time.Second,
            Timeout:          2 * time.Second,
            FailureThreshold: 3,
        },
    })
    if err != nil {
        log.Fatal(err)
    }
    
    if err := ck.Start(); err != nil {
        log.Fatal(err)
    }
    defer ck.Stop()
    
    log.Println("✅ Bootstrap node started on :8080")
    
    select {} // Keep running
}

Additional Nodes (Join Cluster)

ck, err := clusterkit.New(clusterkit.Options{
    NodeID:   "node-2",
    HTTPAddr: ":8081",
    JoinAddr: "localhost:8080", // Bootstrap node address
    HealthCheck: clusterkit.HealthCheckConfig{
        Enabled:          true,
        Interval:         5 * time.Second,
        Timeout:          2 * time.Second,
        FailureThreshold: 3,
    },
})

📚 API Reference

Core Methods

// 1. Get partition for a key
partition, err := ck.GetPartition(key string) (*Partition, error)

// 2. Get primary node for partition
primary := ck.GetPrimary(partition *Partition) *Node

// 3. Get replica nodes for partition
replicas := ck.GetReplicas(partition *Partition) []Node

// 4. Get all nodes (primary + replicas)
nodes := ck.GetNodes(partition *Partition) []Node

// 5. Check if I'm the primary
isPrimary := ck.IsPrimary(partition *Partition) bool

// 6. Check if I'm a replica
isReplica := ck.IsReplica(partition *Partition) bool

// 7. Get my node ID
myID := ck.GetMyNodeID() string

Cluster Operations

// Get cluster information
cluster := ck.GetCluster() *Cluster

// Trigger manual rebalancing
err := ck.RebalancePartitions() error

// Get metrics
metrics := ck.GetMetrics() *Metrics

// Health check
health := ck.HealthCheck() *HealthStatus

🔍 Service Discovery

ClusterKit includes built-in service discovery to help smart clients find your application services across the cluster.

Server Registration

// Register multiple services per node
ck, err := clusterkit.New(clusterkit.Options{
    NodeID:   "node-1",
    HTTPAddr: ":8080",  // ClusterKit coordination API
    Services: map[string]string{
        "kv":        ":9080",     // Key-Value store
        "api":       ":3000",     // REST API
        "grpc":      ":50051",    // gRPC service
        "websocket": ":8081",     // WebSocket server
        "metrics":   ":9090",     // Prometheus metrics
    },
})

Smart Client Discovery

// Get cluster topology with services
resp, err := http.Get("http://localhost:8080/cluster")
var cluster ClusterResponse
json.NewDecoder(resp.Body).Decode(&cluster)

// Route requests to appropriate services
for _, node := range cluster.Cluster.Nodes {
    kvAddr := node.Services["kv"]      // ":9080"
    apiAddr := node.Services["api"]    // ":3000"
    grpcAddr := node.Services["grpc"]  // ":50051"
    
    // Route different request types to different services
    routeKVRequest(node.ID, "localhost"+kvAddr)
    routeAPIRequest(node.ID, "localhost"+apiAddr)
    routeGRPCRequest(node.ID, "localhost"+grpcAddr)
}

API Response Format

The /cluster endpoint returns service information for each node:

{
  "cluster": {
    "nodes": [
      {
        "id": "node-1",
        "ip": ":8080",
        "name": "Server-1",
        "status": "active",
        "services": {
          "kv": ":9080",
          "api": ":3000",
          "grpc": ":50051"
        }
      }
    ]
  }
}

Benefits

🎯 No hardcoded ports - Services are explicitly registered and discoverable
🔧 Multi-service nodes - Support HTTP, gRPC, WebSocket, etc. on same node
📡 Dynamic discovery - Clients automatically find services as nodes join/leave
⚖️ Load balancing - Route different request types to different services
🚀 Zero configuration - Services field is optional and backward compatible

🎣 Event Hooks System

ClusterKit provides a comprehensive event system with rich context for all cluster lifecycle events.

1. OnPartitionChange - Partition Assignment Changes

Triggered when: Partitions are reassigned due to rebalancing

ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
    // Only act if I'm the destination node
    if event.CopyToNode.ID != myNodeID {
        return
    }
    
    log.Printf("📦 Partition %s moving (reason: %s)", 
        event.PartitionID, event.ChangeReason)
    log.Printf("   From: %d nodes", len(event.CopyFromNodes))
    log.Printf("   Primary changed: %s → %s", event.OldPrimary, event.NewPrimary)
    
    // Fetch and merge data from all source nodes
    for _, source := range event.CopyFromNodes {
        data := fetchPartitionData(source, event.PartitionID)
        mergeData(data)
    }
})

Event Structure:

type PartitionChangeEvent struct {
    PartitionID   string    // e.g., "partition-5"
    CopyFromNodes []*Node   // Nodes that have the data
    CopyToNode    *Node     // Node that needs the data
    ChangeReason  string    // "node_join", "node_leave", "rebalance"
    OldPrimary    string    // Previous primary node ID
    NewPrimary    string    // New primary node ID
    Timestamp     time.Time // When the change occurred
}

Use Cases:

Migrate data when partitions move
Update local indexes
Trigger background sync jobs

2. OnNodeJoin - New Node Joins Cluster

Triggered when: A brand new node joins the cluster

ck.OnNodeJoin(func(event *clusterkit.NodeJoinEvent) {
    log.Printf("🎉 Node %s joined (cluster size: %d)", 
        event.Node.ID, event.ClusterSize)
    
    if event.IsBootstrap {
        log.Println("   This is the bootstrap node - initializing cluster")
        initializeSchema()
    }
    
    // Update monitoring dashboards
    updateNodeCount(event.ClusterSize)
})

Event Structure:

type NodeJoinEvent struct {
    Node        *Node     // The joining node
    ClusterSize int       // Total nodes after join
    IsBootstrap bool      // Is this the first node?
    Timestamp   time.Time // When the node joined
}

Use Cases:

Initialize cluster-wide resources on bootstrap
Update monitoring/alerting systems
Trigger capacity planning checks

3. OnNodeRejoin - Node Returns After Being Offline

Triggered when: A node that was previously in the cluster rejoins

ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
    if event.Node.ID == myNodeID {
        log.Printf("🔄 I'm rejoining after %v offline", event.OfflineDuration)
        log.Printf("   Last seen: %v", event.LastSeenAt)
        log.Printf("   Had %d partitions before leaving", 
            len(event.PartitionsBeforeLeave))
        
        // Clear stale data
        clearAllLocalData()
        
        // Wait for OnPartitionChange to sync fresh data
        log.Println("   Ready for partition reassignment")
    } else {
        log.Printf("📡 Node %s rejoined after %v", 
            event.Node.ID, event.OfflineDuration)
    }
})

Event Structure:

type NodeRejoinEvent struct {
    Node                  *Node         // The rejoining node
    OfflineDuration       time.Duration // How long it was offline
    LastSeenAt            time.Time     // When it was last seen
    PartitionsBeforeLeave []string      // Partitions it had before
    Timestamp             time.Time     // When it rejoined
}

Use Cases:

Clear stale local data before rebalancing
Decide sync strategy based on offline duration
Log rejoin events for debugging
Alert if offline duration was too long

Important: This hook fires BEFORE rebalancing. Use it to prepare (clear data), then let OnPartitionChange handle the actual data sync with correct partition assignments.

4. OnNodeLeave - Node Leaves or Fails

Triggered when: A node is removed from the cluster (failure or graceful shutdown)

ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
    log.Printf("❌ Node %s left (reason: %s)", 
        event.Node.ID, event.Reason)
    log.Printf("   Owned %d partitions (primary)", len(event.PartitionsOwned))
    log.Printf("   Replicated %d partitions", len(event.PartitionsReplica))
    
    // Clean up connections
    closeConnectionTo(event.Node.IP)
    
    // Alert if critical
    if event.Reason == "health_check_failure" && len(event.PartitionsOwned) > 10 {
        alertOps("High partition loss - node failed!")
    }
})

Event Structure:

type NodeLeaveEvent struct {
    Node              *Node     // Full node info
    Reason            string    // "health_check_failure", "graceful_shutdown", "removed_by_admin"
    PartitionsOwned   []string  // Partitions this node was primary for
    PartitionsReplica []string  // Partitions this node was replica for
    Timestamp         time.Time // When it left
}

Use Cases:

Clean up network connections
Alert operations team
Update capacity planning
Log failure events

5. OnRebalanceStart - Rebalancing Begins

Triggered when: Partition rebalancing operation starts

ck.OnRebalanceStart(func(event *clusterkit.RebalanceEvent) {
    log.Printf("⚖️  Rebalance starting (trigger: %s)", event.Trigger)
    log.Printf("   Triggered by: %s", event.TriggerNodeID)
    log.Printf("   Partitions to move: %d", event.PartitionsToMove)
    log.Printf("   Nodes affected: %v", event.NodesAffected)
    
    // Pause background jobs during rebalance
    pauseBackgroundJobs()
    
    // Increase operation timeouts
    increaseTimeouts()
})

Event Structure:

type RebalanceEvent struct {
    Trigger          string    // "node_join", "node_leave", "manual"
    TriggerNodeID    string    // Which node caused it
    PartitionsToMove int       // How many partitions will move
    NodesAffected    []string  // Which nodes are affected
    Timestamp        time.Time // When rebalance started
}

6. OnRebalanceComplete - Rebalancing Finishes

Triggered when: Partition rebalancing operation completes

ck.OnRebalanceComplete(func(event *clusterkit.RebalanceEvent, duration time.Duration) {
    log.Printf("✅ Rebalance completed in %v", duration)
    log.Printf("   Moved %d partitions", event.PartitionsToMove)
    
    // Resume background jobs
    resumeBackgroundJobs()
    
    // Reset timeouts
    resetTimeouts()
    
    // Update metrics
    recordRebalanceDuration(duration)
})

7. OnClusterHealthChange - Cluster Health Status Changes

Triggered when: Overall cluster health status changes

ck.OnClusterHealthChange(func(event *clusterkit.ClusterHealthEvent) {
    log.Printf("🏥 Cluster health: %s", event.Status)
    log.Printf("   Healthy: %d/%d nodes", event.HealthyNodes, event.TotalNodes)
    
    if event.Status == "critical" {
        log.Printf("   Unhealthy nodes: %v", event.UnhealthyNodeIDs)
        alertOps("Cluster in critical state!")
        enableReadOnlyMode()
    } else if event.Status == "healthy" {
        log.Println("   All systems operational")
        disableReadOnlyMode()
    }
})

Event Structure:

type ClusterHealthEvent struct {
    HealthyNodes     int       // Number of healthy nodes
    UnhealthyNodes   int       // Number of unhealthy nodes
    TotalNodes       int       // Total nodes in cluster
    Status           string    // "healthy", "degraded", "critical"
    UnhealthyNodeIDs []string  // IDs of unhealthy nodes
    Timestamp        time.Time // When health changed
}

🔄 Understanding Cluster Lifecycle

Scenario 1: Node Join

1. New node starts and sends join request
   ↓
2. OnNodeJoin fires
   - Event includes: node info, cluster size, bootstrap flag
   ↓
3. OnRebalanceStart fires
   - Event includes: trigger reason, partitions to move
   ↓
4. Partitions are reassigned
   ↓
5. OnPartitionChange fires (multiple times, once per partition)
   - Event includes: partition ID, source nodes, destination node, reason
   ↓
6. OnRebalanceComplete fires
   - Event includes: duration, partitions moved

Your Application:

In OnNodeJoin: Log event, update monitoring
In OnPartitionChange: Migrate data for assigned partitions
In OnRebalanceComplete: Resume normal operations

Scenario 2: Node Failure & Removal

1. Health checker detects node failure (3 consecutive failures)
   ↓
2. OnNodeLeave fires
   - Event includes: node info, reason="health_check_failure", partitions owned
   ↓
3. OnRebalanceStart fires
   ↓
4. Partitions are reassigned to remaining nodes
   ↓
5. OnPartitionChange fires (for each reassigned partition)
   ↓
6. OnRebalanceComplete fires

Your Application:

In OnNodeLeave: Clean up connections, alert ops team
In OnPartitionChange: Take ownership of reassigned partitions
Data already exists on replicas, so migration is fast!

Scenario 3: Node Rejoin (After Failure)

1. Failed node restarts and rejoins
   ↓
2. OnNodeRejoin fires
   - Event includes: node info, offline duration, partitions before leave
   ↓
3. Your app clears stale local data
   ↓
4. OnRebalanceStart fires
   ↓
5. Partitions are reassigned (may be different than before!)
   ↓
6. OnPartitionChange fires (for each assigned partition)
   - Your app fetches fresh data from replicas
   ↓
7. OnRebalanceComplete fires

Your Application:

In OnNodeRejoin: Clear ALL stale data (important!)
In OnPartitionChange: Fetch fresh data for NEW partition assignments
Don't assume you'll get the same partitions you had before!

Why clear data?

Node was offline - data is stale
Partition assignments may have changed
Other nodes have the latest data
Clean slate ensures consistency

💡 Complete Example: Building a Distributed KV Store

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "sync"
    "time"
    
    "github.com/skshohagmiah/clusterkit"
)

type KVStore struct {
    ck          *clusterkit.ClusterKit
    data        map[string]string
    mu          sync.RWMutex
    nodeID      string
    isRejoining bool
    rejoinMu    sync.Mutex
}

func NewKVStore(ck *clusterkit.ClusterKit, nodeID string) *KVStore {
    kv := &KVStore{
        ck:     ck,
        data:   make(map[string]string),
        nodeID: nodeID,
    }
    
    // Register all hooks
    ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
        kv.handlePartitionChange(event)
    })
    
    ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
        if event.Node.ID == kv.nodeID {
            kv.handleRejoin(event)
        }
    })
    
    ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
        log.Printf("[KV] Node %s left (reason: %s, partitions: %d)", 
            event.Node.ID, event.Reason, 
            len(event.PartitionsOwned)+len(event.PartitionsReplica))
    })
    
    return kv
}

func (kv *KVStore) handleRejoin(event *clusterkit.NodeRejoinEvent) {
    kv.rejoinMu.Lock()
    defer kv.rejoinMu.Unlock()
    
    if kv.isRejoining {
        return // Already rejoining
    }
    
    kv.isRejoining = true
    
    log.Printf("[KV] 🔄 Rejoining after %v offline", event.OfflineDuration)
    log.Printf("[KV] 🗑️  Clearing stale data")
    
    // Clear ALL stale data
    kv.mu.Lock()
    kv.data = make(map[string]string)
    kv.mu.Unlock()
    
    log.Printf("[KV] ✅ Ready for partition reassignment")
}

func (kv *KVStore) handlePartitionChange(event *clusterkit.PartitionChangeEvent) {
    if event.CopyToNode.ID != kv.nodeID {
        return
    }
    
    // Check if rejoining
    kv.rejoinMu.Lock()
    isRejoining := kv.isRejoining
    kv.rejoinMu.Unlock()
    
    if len(event.CopyFromNodes) == 0 {
        log.Printf("[KV] New partition %s assigned", event.PartitionID)
        return
    }
    
    log.Printf("[KV] 🔄 Migrating partition %s (reason: %s)", 
        event.PartitionID, event.ChangeReason)
    
    // Fetch data from source nodes
    for _, source := range event.CopyFromNodes {
        data := kv.fetchPartitionData(source, event.PartitionID)
        
        kv.mu.Lock()
        for key, value := range data {
            kv.data[key] = value
        }
        kv.mu.Unlock()
        
        log.Printf("[KV] ✅ Migrated %d keys from %s", len(data), source.ID)
        break // Successfully migrated
    }
    
    // Clear rejoin flag after first partition
    if isRejoining {
        kv.rejoinMu.Lock()
        kv.isRejoining = false
        kv.rejoinMu.Unlock()
        log.Printf("[KV] ✅ Rejoin complete")
    }
}

func (kv *KVStore) fetchPartitionData(node *clusterkit.Node, partitionID string) map[string]string {
    url := fmt.Sprintf("http://%s/migrate?partition=%s", node.IP, partitionID)
    resp, err := http.Get(url)
    if err != nil {
        return nil
    }
    defer resp.Body.Close()
    
    var result map[string]string
    json.NewDecoder(resp.Body).Decode(&result)
    return result
}

func (kv *KVStore) Set(key, value string) error {
    partition, err := kv.ck.GetPartition(key)
    if err != nil {
        return err
    }
    
    if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
        kv.mu.Lock()
        kv.data[key] = value
        kv.mu.Unlock()
        return nil
    }
    
    // Forward to primary
    primary := kv.ck.GetPrimary(partition)
    return kv.forwardToPrimary(primary, key, value)
}

func (kv *KVStore) Get(key string) (string, error) {
    partition, err := kv.ck.GetPartition(key)
    if err != nil {
        return "", err
    }
    
    if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
        kv.mu.RLock()
        defer kv.mu.RUnlock()
        
        value, exists := kv.data[key]
        if !exists {
            return "", fmt.Errorf("key not found")
        }
        return value, nil
    }
    
    // Forward to primary
    primary := kv.ck.GetPrimary(partition)
    return kv.readFromPrimary(primary, key)
}

func main() {
    ck, _ := clusterkit.New(clusterkit.Options{
        NodeID:   "node-1",
        HTTPAddr: ":8080",
        HealthCheck: clusterkit.HealthCheckConfig{
            Enabled:          true,
            Interval:         5 * time.Second,
            Timeout:          2 * time.Second,
            FailureThreshold: 3,
        },
    })
    ck.Start()
    defer ck.Stop()
    
    kv := NewKVStore(ck, "node-1")
    
    // Use the KV store
    kv.Set("user:123", "John Doe")
    value, _ := kv.Get("user:123")
    fmt.Println("Value:", value)
    
    select {}
}

🏗️ Configuration Options

Minimal (2 required fields)

ck, _ := clusterkit.New(clusterkit.Options{
    NodeID:   "node-1",  // Required
    HTTPAddr: ":8080",   // Required
})

Production Configuration

ck, _ := clusterkit.New(clusterkit.Options{
    // Required
    NodeID:   "node-1",
    HTTPAddr: ":8080",
    
    // Cluster Formation
    JoinAddr:  "node-1:8080", // For non-bootstrap nodes
    Bootstrap: false,          // Auto-detected
    
    // Partitioning
    PartitionCount:    64,  // More partitions = better distribution
    ReplicationFactor: 3,   // Survive 2 node failures
    
    // Storage
    DataDir: "/var/lib/clusterkit",
    
    // Health Checking
    HealthCheck: clusterkit.HealthCheckConfig{
        Enabled:          true,
        Interval:         5 * time.Second,  // Check every 5s
        Timeout:          2 * time.Second,  // Request timeout
        FailureThreshold: 3,                // Remove after 3 failures
    },
})

📊 HTTP API

ClusterKit exposes RESTful endpoints:

# Get cluster state (includes service discovery)
curl http://localhost:8080/cluster

# Get metrics
curl http://localhost:8080/metrics

# Get detailed health
curl http://localhost:8080/health/detailed

# Check if ready
curl http://localhost:8080/ready

The /cluster endpoint returns comprehensive cluster information including:

Node membership and status
Service discovery - All registered services per node
Partition assignments and replica locations
Cluster configuration and hash settings

🧪 Running Examples

ClusterKit includes 3 complete examples:

# SYNC - Strong consistency (quorum-based)
cd example/sync && ./run.sh

# ASYNC - Maximum throughput (eventual consistency)
cd example/async && ./run.sh

# Server-Side - Simple HTTP clients
cd example/server-side && ./run.sh

Each example demonstrates:

✅ Cluster formation (10 nodes)
✅ Data distribution (1000 keys)
✅ Automatic rebalancing
✅ Health checking and failure recovery
✅ Node rejoin handling

🐳 Docker Deployment

docker-compose.yml

version: '3.8'

services:
  node-1:
    image: your-registry/clusterkit:latest
    environment:
      - NODE_ID=node-1
      - HTTP_PORT=8080
      - DATA_DIR=/data
    ports:
      - "8080:8080"
    volumes:
      - node1-data:/data

  node-2:
    image: your-registry/clusterkit:latest
    environment:
      - NODE_ID=node-2
      - HTTP_PORT=8080
      - JOIN_ADDR=node-1:8080
      - DATA_DIR=/data
    ports:
      - "8081:8080"
    volumes:
      - node2-data:/data
    depends_on:
      - node-1

volumes:
  node1-data:
  node2-data:

📚 Documentation

Comprehensive guides in the docs/ directory:

Core Concepts

Architecture - Detailed system design, Raft + consistent hashing
Partitioning - How data is distributed across nodes
Replication - Replication strategies and consistency models
Rebalancing - How partitions move when topology changes

Advanced Topics

Node Rejoin - Handling stale data when nodes return
Health Checking - Failure detection and recovery
Hooks Guide - Complete event system reference
Production Deployment - Best practices for production

Examples

SYNC Mode - Strong consistency (quorum-based)
ASYNC Mode - Maximum throughput (eventual consistency)
Server-Side - Simple HTTP clients

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

📄 License

MIT License - see LICENSE for details.

🌟 Why ClusterKit?

Simple - 7 methods + hooks, minimal config
Flexible - Bring your own storage and replication
Production-Ready - Raft consensus, health checking, metrics
Well-Documented - Comprehensive guides and examples
Battle-Tested - Used in production distributed systems

Start building your distributed system today! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docs		docs
example		example
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
clusterkit.go		clusterkit.go
consensus.go		consensus.go
context.go		context.go
custom_data_test.go		custom_data_test.go
go.mod		go.mod
go.sum		go.sum
health.go		health.go
helpers.go		helpers.go
hooks.go		hooks.go
integration_test.go		integration_test.go
logger.go		logger.go
partitions.go		partitions.go
partitions_test.go		partitions_test.go
servers.go		servers.go
sync.go		sync.go
types.go		types.go

License

skshohagmiah/clusterkit

Folders and files

Latest commit

History

Repository files navigation

ClusterKit

🤔 Why ClusterKit Exists

The Problem

The Solution

🎯 What is ClusterKit?

✨ Key Features

Core Capabilities

Event System

Distributed Coordination

🏗️ Architecture

How It Works

🎬 Quick Demo

📦 Installation

🚀 Quick Start

Bootstrap Node (First Node)

Additional Nodes (Join Cluster)

📚 API Reference

Core Methods

Cluster Operations

🔍 Service Discovery

Server Registration

Smart Client Discovery

API Response Format

Benefits

🎣 Event Hooks System

1. OnPartitionChange - Partition Assignment Changes

2. OnNodeJoin - New Node Joins Cluster

3. OnNodeRejoin - Node Returns After Being Offline

4. OnNodeLeave - Node Leaves or Fails

5. OnRebalanceStart - Rebalancing Begins

6. OnRebalanceComplete - Rebalancing Finishes

7. OnClusterHealthChange - Cluster Health Status Changes

🔄 Understanding Cluster Lifecycle

Scenario 1: Node Join

Scenario 2: Node Failure & Removal

Scenario 3: Node Rejoin (After Failure)

💡 Complete Example: Building a Distributed KV Store

🏗️ Configuration Options

Minimal (2 required fields)

Production Configuration

📊 HTTP API

🧪 Running Examples

🐳 Docker Deployment

docker-compose.yml

📚 Documentation

Core Concepts

Advanced Topics

Examples

🤝 Contributing

📄 License

🌟 Why ClusterKit?

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages