A lightweight, production-ready distributed cluster coordination library for Go
ClusterKit handles cluster coordination so you can focus on building your application.
Features β’ Quick Start β’ Service Discovery β’ Event Hooks β’ Documentation β’ Examples
Building a distributed system is hard. You need to solve:
- "Where does this data go?" - Partition assignment across nodes
- "Who's in charge?" - Leader election and consensus
- "Is everyone alive?" - Health checking and failure detection
- "What happens when nodes join/leave?" - Rebalancing and data migration
- "How do I know when to move data?" - Event notifications
Most developers end up either:
- β Reinventing the wheel - Writing complex coordination logic from scratch
- β Over-engineering - Using heavy frameworks that dictate your entire architecture
- β Coupling tightly - Mixing coordination logic with business logic
ClusterKit provides just the coordination layer - nothing more, nothing less.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application (Storage, Replication, Business Logic) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ClusterKit (Coordination, Partitioning, Consensus) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
You get:
- β Production-ready coordination (Raft consensus, health checking)
- β Simple API (7 methods + hooks)
- β Complete flexibility (bring your own storage/replication)
- β Zero lock-in (just a library, not a framework)
ClusterKit is a coordination library that manages the distributed aspects of your cluster while giving you complete control over data storage and replication. It handles:
- β Partition Management - Consistent hashing to determine which partition owns a key
- β Node Discovery - Automatic cluster membership and health monitoring
- β Service Discovery - Register and discover application services across nodes
- β Leader Election - Raft-based consensus for cluster decisions
- β Rebalancing - Automatic partition redistribution when nodes join/leave
- β Event Hooks - Rich notifications for partition changes, node lifecycle events
- β Failure Detection - Automatic health checking and node removal
- β Rejoin Handling - Smart detection and data sync for returning nodes
You control:
- π§ Data storage (PostgreSQL, Redis, files, memory, etc.)
- π§ Replication protocol (HTTP, gRPC, TCP, etc.)
- π§ Consistency model (strong, eventual, causal, etc.)
- π§ Business logic
- Simple API - 7 core methods + rich event hooks
- Minimal Configuration - Only 2 required fields (NodeID, HTTPAddr)
- Service Discovery - Register multiple services per node (HTTP, gRPC, WebSocket, etc.)
- Production-Ready - WAL, snapshots, crash recovery, metrics
- Health Checking - Automatic failure detection and node removal
- Smart Rejoin - Detects returning nodes and triggers data sync
- Rich Context - Events include timestamps, reasons, offline duration, partition ownership
- 7 Lifecycle Hooks - OnPartitionChange, OnNodeJoin, OnNodeRejoin, OnNodeLeave, OnRebalanceStart, OnRebalanceComplete, OnClusterHealthChange
- Async Execution - Hooks run in background goroutines (max 50 concurrent)
- Panic Recovery - Hooks are isolated and won't crash your application
- Raft Consensus - Built on HashiCorp Raft for strong consistency
- Consistent Hashing - MD5-based partition assignment
- Configurable Replication - Set replication factor (default: 3)
- HTTP API - RESTful endpoints for cluster information
ClusterKit uses a layered architecture combining Raft consensus with consistent hashing:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application Layer β
β (KV Store, Cache, Queue, Custom Logic) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β API Calls
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β ClusterKit Public API β
β GetPartition() β’ IsPrimary() β’ GetReplicas() β’ Hooks β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β Coordination Layer β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ β
β β Partition β Health β Hook β β
β β Manager β Checker β Manager β β
β β β β β β
β β β’ Consistent β β’ Heartbeats β β’ Event dispatch β β
β β Hashing β β’ Failure β β’ Async execution β β
β β β’ Rebalance β detection β β’ 7 lifecycle hooks β β
β ββββββββ¬ββββββββ΄βββββββ¬ββββββββ΄βββββββββββ¬ββββββββββββ β
β β β β β
β ββββββββΌβββββββββββββββΌβββββββββββββββββββΌββββββββββββ β
β β Raft Consensus Layer β β
β β β’ Leader election β β
β β β’ Log replication β β
β β β’ State machine (cluster state) β β
β ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β Persistence Layer β
β β’ WAL (Write-Ahead Log) β
β β’ Snapshots (cluster state) β
β β’ JSON state files β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Partition Assignment - MD5 hash of key β partition ID (0-63)
- Node Selection - Consistent hashing assigns partitions to nodes
- Consensus - Raft ensures all nodes agree on cluster state
- Health Monitoring - Periodic checks detect failures
- Rebalancing - Automatic when topology changes
- Event Notification - Hooks fire for lifecycle events
See docs/architecture.md for detailed design
See ClusterKit in action with a 3-node cluster:
# Clone the repository
git clone https://github.com/skshohagmiah/clusterkit
cd clusterkit/example/sync
# Start 3-node cluster
./run.sh
# Output shows:
# β
Node formation
# β
Leader election
# β
Partition distribution
# β
Data replication
# β
Automatic rebalancingExample Output:
π Starting node-1 (bootstrap) on ports 8080/9080
[RAFT] Becoming leader
[CLUSTER] Leader elected: node-1
π Starting node-2 (joining) on ports 8081/9081
[JOIN] node-2 joining via node-1
[RAFT] Adding voter: node-2
[REBALANCE] Starting rebalance (trigger: node_join)
[PARTITION] partition-0: node-1 β node-2
[PARTITION] partition-15: node-1 β node-2
[REBALANCE] Complete (moved 21 partitions in 2.3s)
π Starting node-3 (joining) on ports 8082/9082
[JOIN] node-3 joining via node-1
[REBALANCE] Starting rebalance (trigger: node_join)
[PARTITION] partition-5: node-1 β node-3
[REBALANCE] Complete (moved 14 partitions in 1.8s)
β
Cluster ready: 3 nodes, 64 partitions, RF=3
go get github.com/skshohagmiah/clusterkitpackage main
import (
"log"
"time"
"github.com/skshohagmiah/clusterkit"
)
func main() {
// Create first node - only 2 fields required!
ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080",
// Optional: Register application services
Services: map[string]string{
"kv": ":9080", // Your KV store API
"api": ":3000", // Your REST API
"grpc": ":50051", // Your gRPC service
},
// Optional: Enable health checking
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})
if err != nil {
log.Fatal(err)
}
if err := ck.Start(); err != nil {
log.Fatal(err)
}
defer ck.Stop()
log.Println("β
Bootstrap node started on :8080")
select {} // Keep running
}ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-2",
HTTPAddr: ":8081",
JoinAddr: "localhost:8080", // Bootstrap node address
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})// 1. Get partition for a key
partition, err := ck.GetPartition(key string) (*Partition, error)
// 2. Get primary node for partition
primary := ck.GetPrimary(partition *Partition) *Node
// 3. Get replica nodes for partition
replicas := ck.GetReplicas(partition *Partition) []Node
// 4. Get all nodes (primary + replicas)
nodes := ck.GetNodes(partition *Partition) []Node
// 5. Check if I'm the primary
isPrimary := ck.IsPrimary(partition *Partition) bool
// 6. Check if I'm a replica
isReplica := ck.IsReplica(partition *Partition) bool
// 7. Get my node ID
myID := ck.GetMyNodeID() string// Get cluster information
cluster := ck.GetCluster() *Cluster
// Trigger manual rebalancing
err := ck.RebalancePartitions() error
// Get metrics
metrics := ck.GetMetrics() *Metrics
// Health check
health := ck.HealthCheck() *HealthStatusClusterKit includes built-in service discovery to help smart clients find your application services across the cluster.
// Register multiple services per node
ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080", // ClusterKit coordination API
Services: map[string]string{
"kv": ":9080", // Key-Value store
"api": ":3000", // REST API
"grpc": ":50051", // gRPC service
"websocket": ":8081", // WebSocket server
"metrics": ":9090", // Prometheus metrics
},
})// Get cluster topology with services
resp, err := http.Get("http://localhost:8080/cluster")
var cluster ClusterResponse
json.NewDecoder(resp.Body).Decode(&cluster)
// Route requests to appropriate services
for _, node := range cluster.Cluster.Nodes {
kvAddr := node.Services["kv"] // ":9080"
apiAddr := node.Services["api"] // ":3000"
grpcAddr := node.Services["grpc"] // ":50051"
// Route different request types to different services
routeKVRequest(node.ID, "localhost"+kvAddr)
routeAPIRequest(node.ID, "localhost"+apiAddr)
routeGRPCRequest(node.ID, "localhost"+grpcAddr)
}The /cluster endpoint returns service information for each node:
{
"cluster": {
"nodes": [
{
"id": "node-1",
"ip": ":8080",
"name": "Server-1",
"status": "active",
"services": {
"kv": ":9080",
"api": ":3000",
"grpc": ":50051"
}
}
]
}
}- π― No hardcoded ports - Services are explicitly registered and discoverable
- π§ Multi-service nodes - Support HTTP, gRPC, WebSocket, etc. on same node
- π‘ Dynamic discovery - Clients automatically find services as nodes join/leave
- βοΈ Load balancing - Route different request types to different services
- π Zero configuration - Services field is optional and backward compatible
ClusterKit provides a comprehensive event system with rich context for all cluster lifecycle events.
Triggered when: Partitions are reassigned due to rebalancing
ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
// Only act if I'm the destination node
if event.CopyToNode.ID != myNodeID {
return
}
log.Printf("π¦ Partition %s moving (reason: %s)",
event.PartitionID, event.ChangeReason)
log.Printf(" From: %d nodes", len(event.CopyFromNodes))
log.Printf(" Primary changed: %s β %s", event.OldPrimary, event.NewPrimary)
// Fetch and merge data from all source nodes
for _, source := range event.CopyFromNodes {
data := fetchPartitionData(source, event.PartitionID)
mergeData(data)
}
})Event Structure:
type PartitionChangeEvent struct {
PartitionID string // e.g., "partition-5"
CopyFromNodes []*Node // Nodes that have the data
CopyToNode *Node // Node that needs the data
ChangeReason string // "node_join", "node_leave", "rebalance"
OldPrimary string // Previous primary node ID
NewPrimary string // New primary node ID
Timestamp time.Time // When the change occurred
}Use Cases:
- Migrate data when partitions move
- Update local indexes
- Trigger background sync jobs
Triggered when: A brand new node joins the cluster
ck.OnNodeJoin(func(event *clusterkit.NodeJoinEvent) {
log.Printf("π Node %s joined (cluster size: %d)",
event.Node.ID, event.ClusterSize)
if event.IsBootstrap {
log.Println(" This is the bootstrap node - initializing cluster")
initializeSchema()
}
// Update monitoring dashboards
updateNodeCount(event.ClusterSize)
})Event Structure:
type NodeJoinEvent struct {
Node *Node // The joining node
ClusterSize int // Total nodes after join
IsBootstrap bool // Is this the first node?
Timestamp time.Time // When the node joined
}Use Cases:
- Initialize cluster-wide resources on bootstrap
- Update monitoring/alerting systems
- Trigger capacity planning checks
Triggered when: A node that was previously in the cluster rejoins
ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
if event.Node.ID == myNodeID {
log.Printf("π I'm rejoining after %v offline", event.OfflineDuration)
log.Printf(" Last seen: %v", event.LastSeenAt)
log.Printf(" Had %d partitions before leaving",
len(event.PartitionsBeforeLeave))
// Clear stale data
clearAllLocalData()
// Wait for OnPartitionChange to sync fresh data
log.Println(" Ready for partition reassignment")
} else {
log.Printf("π‘ Node %s rejoined after %v",
event.Node.ID, event.OfflineDuration)
}
})Event Structure:
type NodeRejoinEvent struct {
Node *Node // The rejoining node
OfflineDuration time.Duration // How long it was offline
LastSeenAt time.Time // When it was last seen
PartitionsBeforeLeave []string // Partitions it had before
Timestamp time.Time // When it rejoined
}Use Cases:
- Clear stale local data before rebalancing
- Decide sync strategy based on offline duration
- Log rejoin events for debugging
- Alert if offline duration was too long
Important: This hook fires BEFORE rebalancing. Use it to prepare (clear data), then let OnPartitionChange handle the actual data sync with correct partition assignments.
Triggered when: A node is removed from the cluster (failure or graceful shutdown)
ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
log.Printf("β Node %s left (reason: %s)",
event.Node.ID, event.Reason)
log.Printf(" Owned %d partitions (primary)", len(event.PartitionsOwned))
log.Printf(" Replicated %d partitions", len(event.PartitionsReplica))
// Clean up connections
closeConnectionTo(event.Node.IP)
// Alert if critical
if event.Reason == "health_check_failure" && len(event.PartitionsOwned) > 10 {
alertOps("High partition loss - node failed!")
}
})Event Structure:
type NodeLeaveEvent struct {
Node *Node // Full node info
Reason string // "health_check_failure", "graceful_shutdown", "removed_by_admin"
PartitionsOwned []string // Partitions this node was primary for
PartitionsReplica []string // Partitions this node was replica for
Timestamp time.Time // When it left
}Use Cases:
- Clean up network connections
- Alert operations team
- Update capacity planning
- Log failure events
Triggered when: Partition rebalancing operation starts
ck.OnRebalanceStart(func(event *clusterkit.RebalanceEvent) {
log.Printf("βοΈ Rebalance starting (trigger: %s)", event.Trigger)
log.Printf(" Triggered by: %s", event.TriggerNodeID)
log.Printf(" Partitions to move: %d", event.PartitionsToMove)
log.Printf(" Nodes affected: %v", event.NodesAffected)
// Pause background jobs during rebalance
pauseBackgroundJobs()
// Increase operation timeouts
increaseTimeouts()
})Event Structure:
type RebalanceEvent struct {
Trigger string // "node_join", "node_leave", "manual"
TriggerNodeID string // Which node caused it
PartitionsToMove int // How many partitions will move
NodesAffected []string // Which nodes are affected
Timestamp time.Time // When rebalance started
}Triggered when: Partition rebalancing operation completes
ck.OnRebalanceComplete(func(event *clusterkit.RebalanceEvent, duration time.Duration) {
log.Printf("β
Rebalance completed in %v", duration)
log.Printf(" Moved %d partitions", event.PartitionsToMove)
// Resume background jobs
resumeBackgroundJobs()
// Reset timeouts
resetTimeouts()
// Update metrics
recordRebalanceDuration(duration)
})Triggered when: Overall cluster health status changes
ck.OnClusterHealthChange(func(event *clusterkit.ClusterHealthEvent) {
log.Printf("π₯ Cluster health: %s", event.Status)
log.Printf(" Healthy: %d/%d nodes", event.HealthyNodes, event.TotalNodes)
if event.Status == "critical" {
log.Printf(" Unhealthy nodes: %v", event.UnhealthyNodeIDs)
alertOps("Cluster in critical state!")
enableReadOnlyMode()
} else if event.Status == "healthy" {
log.Println(" All systems operational")
disableReadOnlyMode()
}
})Event Structure:
type ClusterHealthEvent struct {
HealthyNodes int // Number of healthy nodes
UnhealthyNodes int // Number of unhealthy nodes
TotalNodes int // Total nodes in cluster
Status string // "healthy", "degraded", "critical"
UnhealthyNodeIDs []string // IDs of unhealthy nodes
Timestamp time.Time // When health changed
}1. New node starts and sends join request
β
2. OnNodeJoin fires
- Event includes: node info, cluster size, bootstrap flag
β
3. OnRebalanceStart fires
- Event includes: trigger reason, partitions to move
β
4. Partitions are reassigned
β
5. OnPartitionChange fires (multiple times, once per partition)
- Event includes: partition ID, source nodes, destination node, reason
β
6. OnRebalanceComplete fires
- Event includes: duration, partitions moved
Your Application:
- In
OnNodeJoin: Log event, update monitoring - In
OnPartitionChange: Migrate data for assigned partitions - In
OnRebalanceComplete: Resume normal operations
1. Health checker detects node failure (3 consecutive failures)
β
2. OnNodeLeave fires
- Event includes: node info, reason="health_check_failure", partitions owned
β
3. OnRebalanceStart fires
β
4. Partitions are reassigned to remaining nodes
β
5. OnPartitionChange fires (for each reassigned partition)
β
6. OnRebalanceComplete fires
Your Application:
- In
OnNodeLeave: Clean up connections, alert ops team - In
OnPartitionChange: Take ownership of reassigned partitions - Data already exists on replicas, so migration is fast!
1. Failed node restarts and rejoins
β
2. OnNodeRejoin fires
- Event includes: node info, offline duration, partitions before leave
β
3. Your app clears stale local data
β
4. OnRebalanceStart fires
β
5. Partitions are reassigned (may be different than before!)
β
6. OnPartitionChange fires (for each assigned partition)
- Your app fetches fresh data from replicas
β
7. OnRebalanceComplete fires
Your Application:
- In
OnNodeRejoin: Clear ALL stale data (important!) - In
OnPartitionChange: Fetch fresh data for NEW partition assignments - Don't assume you'll get the same partitions you had before!
Why clear data?
- Node was offline - data is stale
- Partition assignments may have changed
- Other nodes have the latest data
- Clean slate ensures consistency
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
"sync"
"time"
"github.com/skshohagmiah/clusterkit"
)
type KVStore struct {
ck *clusterkit.ClusterKit
data map[string]string
mu sync.RWMutex
nodeID string
isRejoining bool
rejoinMu sync.Mutex
}
func NewKVStore(ck *clusterkit.ClusterKit, nodeID string) *KVStore {
kv := &KVStore{
ck: ck,
data: make(map[string]string),
nodeID: nodeID,
}
// Register all hooks
ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
kv.handlePartitionChange(event)
})
ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
if event.Node.ID == kv.nodeID {
kv.handleRejoin(event)
}
})
ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
log.Printf("[KV] Node %s left (reason: %s, partitions: %d)",
event.Node.ID, event.Reason,
len(event.PartitionsOwned)+len(event.PartitionsReplica))
})
return kv
}
func (kv *KVStore) handleRejoin(event *clusterkit.NodeRejoinEvent) {
kv.rejoinMu.Lock()
defer kv.rejoinMu.Unlock()
if kv.isRejoining {
return // Already rejoining
}
kv.isRejoining = true
log.Printf("[KV] π Rejoining after %v offline", event.OfflineDuration)
log.Printf("[KV] ποΈ Clearing stale data")
// Clear ALL stale data
kv.mu.Lock()
kv.data = make(map[string]string)
kv.mu.Unlock()
log.Printf("[KV] β
Ready for partition reassignment")
}
func (kv *KVStore) handlePartitionChange(event *clusterkit.PartitionChangeEvent) {
if event.CopyToNode.ID != kv.nodeID {
return
}
// Check if rejoining
kv.rejoinMu.Lock()
isRejoining := kv.isRejoining
kv.rejoinMu.Unlock()
if len(event.CopyFromNodes) == 0 {
log.Printf("[KV] New partition %s assigned", event.PartitionID)
return
}
log.Printf("[KV] π Migrating partition %s (reason: %s)",
event.PartitionID, event.ChangeReason)
// Fetch data from source nodes
for _, source := range event.CopyFromNodes {
data := kv.fetchPartitionData(source, event.PartitionID)
kv.mu.Lock()
for key, value := range data {
kv.data[key] = value
}
kv.mu.Unlock()
log.Printf("[KV] β
Migrated %d keys from %s", len(data), source.ID)
break // Successfully migrated
}
// Clear rejoin flag after first partition
if isRejoining {
kv.rejoinMu.Lock()
kv.isRejoining = false
kv.rejoinMu.Unlock()
log.Printf("[KV] β
Rejoin complete")
}
}
func (kv *KVStore) fetchPartitionData(node *clusterkit.Node, partitionID string) map[string]string {
url := fmt.Sprintf("http://%s/migrate?partition=%s", node.IP, partitionID)
resp, err := http.Get(url)
if err != nil {
return nil
}
defer resp.Body.Close()
var result map[string]string
json.NewDecoder(resp.Body).Decode(&result)
return result
}
func (kv *KVStore) Set(key, value string) error {
partition, err := kv.ck.GetPartition(key)
if err != nil {
return err
}
if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
kv.mu.Lock()
kv.data[key] = value
kv.mu.Unlock()
return nil
}
// Forward to primary
primary := kv.ck.GetPrimary(partition)
return kv.forwardToPrimary(primary, key, value)
}
func (kv *KVStore) Get(key string) (string, error) {
partition, err := kv.ck.GetPartition(key)
if err != nil {
return "", err
}
if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
kv.mu.RLock()
defer kv.mu.RUnlock()
value, exists := kv.data[key]
if !exists {
return "", fmt.Errorf("key not found")
}
return value, nil
}
// Forward to primary
primary := kv.ck.GetPrimary(partition)
return kv.readFromPrimary(primary, key)
}
func main() {
ck, _ := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080",
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})
ck.Start()
defer ck.Stop()
kv := NewKVStore(ck, "node-1")
// Use the KV store
kv.Set("user:123", "John Doe")
value, _ := kv.Get("user:123")
fmt.Println("Value:", value)
select {}
}ck, _ := clusterkit.New(clusterkit.Options{
NodeID: "node-1", // Required
HTTPAddr: ":8080", // Required
})ck, _ := clusterkit.New(clusterkit.Options{
// Required
NodeID: "node-1",
HTTPAddr: ":8080",
// Cluster Formation
JoinAddr: "node-1:8080", // For non-bootstrap nodes
Bootstrap: false, // Auto-detected
// Partitioning
PartitionCount: 64, // More partitions = better distribution
ReplicationFactor: 3, // Survive 2 node failures
// Storage
DataDir: "/var/lib/clusterkit",
// Health Checking
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second, // Check every 5s
Timeout: 2 * time.Second, // Request timeout
FailureThreshold: 3, // Remove after 3 failures
},
})ClusterKit exposes RESTful endpoints:
# Get cluster state (includes service discovery)
curl http://localhost:8080/cluster
# Get metrics
curl http://localhost:8080/metrics
# Get detailed health
curl http://localhost:8080/health/detailed
# Check if ready
curl http://localhost:8080/readyThe /cluster endpoint returns comprehensive cluster information including:
- Node membership and status
- Service discovery - All registered services per node
- Partition assignments and replica locations
- Cluster configuration and hash settings
ClusterKit includes 3 complete examples:
# SYNC - Strong consistency (quorum-based)
cd example/sync && ./run.sh
# ASYNC - Maximum throughput (eventual consistency)
cd example/async && ./run.sh
# Server-Side - Simple HTTP clients
cd example/server-side && ./run.shEach example demonstrates:
- β Cluster formation (10 nodes)
- β Data distribution (1000 keys)
- β Automatic rebalancing
- β Health checking and failure recovery
- β Node rejoin handling
version: '3.8'
services:
node-1:
image: your-registry/clusterkit:latest
environment:
- NODE_ID=node-1
- HTTP_PORT=8080
- DATA_DIR=/data
ports:
- "8080:8080"
volumes:
- node1-data:/data
node-2:
image: your-registry/clusterkit:latest
environment:
- NODE_ID=node-2
- HTTP_PORT=8080
- JOIN_ADDR=node-1:8080
- DATA_DIR=/data
ports:
- "8081:8080"
volumes:
- node2-data:/data
depends_on:
- node-1
volumes:
node1-data:
node2-data:Comprehensive guides in the docs/ directory:
- Architecture - Detailed system design, Raft + consistent hashing
- Partitioning - How data is distributed across nodes
- Replication - Replication strategies and consistency models
- Rebalancing - How partitions move when topology changes
- Node Rejoin - Handling stale data when nodes return
- Health Checking - Failure detection and recovery
- Hooks Guide - Complete event system reference
- Production Deployment - Best practices for production
- SYNC Mode - Strong consistency (quorum-based)
- ASYNC Mode - Maximum throughput (eventual consistency)
- Server-Side - Simple HTTP clients
Contributions welcome! Please read CONTRIBUTING.md first.
MIT License - see LICENSE for details.
Simple - 7 methods + hooks, minimal config
Flexible - Bring your own storage and replication
Production-Ready - Raft consensus, health checking, metrics
Well-Documented - Comprehensive guides and examples
Battle-Tested - Used in production distributed systems
Start building your distributed system today! π