feat(controller): Add leader election for high availability#851
Conversation
f7ac650 to
0df5f10
Compare
| }, | ||
| OnNewLeader: func(identity string) { | ||
| if identity == lockID { | ||
| return |
There was a problem hiding this comment.
-
Can you add a code comment here, explaining that scenario? Does that mean the
OnNewLeader()callback can get invoked for us with us being the leader before and after the callback gets invoked? -
Can you emit a log message here on level 6? Something like
klog.V(6).Infof("OnNewLeader() callback, new identity is still my lock ID")
There was a problem hiding this comment.
Great catch.
Source code comments for OnNewLeader:
OnNewLeader is called when the client observes a leader that is not the previously observed leader. This includes the first observed leader when the client starts.
I will add the requested klog.V(6) log to track this. The if identity == lockID { return } check is a standard practice in K8s controllers to avoid redundant processing when the 'new' leader is actually ourselves.
There was a problem hiding this comment.
done
I've just pushed an update that adds the requested V(6) logs and includes a detailed architecture note in the code to explain the lifecycle and shutdown strategy. The implementation has been verified with the network partition test as discussed.
There was a problem hiding this comment.
Done. Added comment and klog.V(6) log as suggested.
| } | ||
| }, | ||
| OnStoppedLeading: func() { | ||
| klog.Warningf("Lost leader election (id: %s), waiting to re-compete", lockID) |
There was a problem hiding this comment.
This can happen when we previously were the leader, but now we're transitioning into not being the leader anymore, correct? At least, a typical failover scenario has to be for the old leader to not be the leader anymore. Are there any more actions that we need to perform here in that case? Do we need to perform a controller shutdown? If not: why?
If we later become the leader again, it looks like we would call controller.Run() again, potentially for the Nth time during our lifetime. That of course needs to be safe. Is it safe as of now?
There was a problem hiding this comment.
This is a critical point. In new implementation, we rely on Context-based cancellation.
- Shutdown: When we lose leadership, the
leaderelectionlibrary cancels theleaderCtxpassed tocontroller.Run(leaderCtx). Ourcontrolleris designed to watch this context and shut down all workers gracefully when it's canceled. - Process Exit: By calling
cancelElector()in thedeferblock insideOnStartedLeading, we ensure that as soon as the controller stops, the entireelector.Runloop terminates. This allows the pod to exit and be restarted by Kubernetes (the 'crash-everything-and-start-again' strategy), which is the safest way to clear any in-memory state. - Safety: Re-running
controller.Run()within the same process lifetime is generally avoided here because we prefer the pod restart to ensure a clean slate. Should I clarify this in the comments?
There was a problem hiding this comment.
Done. The controller now shuts down via context cancellation, and the pod restarts for clean state.
| klog.Infof("Became leader, starting controller (id: %s)", lockID) | ||
| if err := controller.Run(ctx); err != nil { | ||
| klog.Errorf("Error running controller as leader: %v", err) | ||
| return |
There was a problem hiding this comment.
Do we need to return an error here?
There was a problem hiding this comment.
Yes, but we cannot return it directly from the callback because the callback signature doesn't support it.
In the updated logic, we capture it in the controllerErr variable and then call cancelElector(). This signals the main goroutine (blocked at elector.Run) to wake up, see the error, and return it to the CLI framework. This ensures the fail-fast behavior.
There was a problem hiding this comment.
Done. Error is propagated correctly via error channel now.
| klog.InfoS("Context canceled, stopping leader elector", "lockID", lockID) | ||
| }() | ||
|
|
||
| elector.Run(ctx) |
There was a problem hiding this comment.
Does elector.Run(ctx) have an interesting return value?
There was a problem hiding this comment.
leaderelection.Run itself doesn't return anything (it's a blocking call that returns when the context is canceled).
| }() | ||
|
|
||
| elector.Run(ctx) | ||
| return nil |
There was a problem hiding this comment.
Do we always want to return nil here?
There was a problem hiding this comment.
In our updated version, we do not always return nil. We check controllerErr after elector.Run returns. If the controller failed while it was the leader, we propagate that error so the process exits with code 1. If it returns nil, it means the process received a standard SIGTERM and is exiting gracefully.
| case err := <-errChan: | ||
| cancel() | ||
| if err != nil { | ||
| return fmt.Errorf("run controller: %w", err) |
There was a problem hiding this comment.
Here, previously, we would return an error to the CLI framework -- which would after all terminate this process with a non-zero exit code.
We still need a way for the program to crash upon well-defined situations, and return with a non-zero exit code. Is this still ensured?
There was a problem hiding this comment.
Yes, this is still strictly ensured.
In the updated logic, if controller.Run(leaderCtx) returns an error, it is captured in the controllerErr variable. Immediately after, we call cancelElector(), which causes the blocking elector.Run(electorCtx) to return.
Once elector.Run returns, the function checks controllerErr. If it's non-nil, we return a wrapped error back to the caller (the CLI framework). Since the CLI framework receives a non-nil error, it will handle the process termination with a non-zero exit code as before.
This approach allows us to achieve two things:
Graceful Lease Release: It ensures ReleaseOnCancel is triggered so the leader identity is cleared from the Lease object immediately.
Non-zero Exit: It maintains the existing behavior of crashing the process with an error state when the controller fails.
jgehrcke
left a comment
There was a problem hiding this comment.
Hey! Thanks for the contribution! Great!
As always, what's critical, is to design the error handling story well, and to then implement it according to that design.
In the spirit of that, I have left a few comments and questions.
Generally, with distributed system fail-overs, I consider everything 'broken' that is not explicitly tested with fault injection tests. I know that testing such things can be a lot of effort. Can you describe to which extent you could test certain scenarios so far?
After all, with leader election in place, the state space becomes much more complicated than with a single instance of the controller. When the single instance crashes, it gets started again from scratch, and has a great chance to heal that way.
With leader election in place, there is a chance for the entire ensemble to transition into a long-term dysfunctional state because the "crash-everything-and-start-again" story isn't that simple anymore. From that point of view, we need to make extra sure that this type of HA deployment with leader election only ever improves things, and never worsens things (compared to the single-replica situation). To that end, we need quite a bit of discussion and testing. I'm confident that this will lead to something good!
Thanks for your work on this!
@jgehrcke
TODO: Network Partition: Simulating API Server unreachability for the leader. Observed that To ensure we don't worsen things, we've implemented UUID-based lock IDs. Even if a Pod is restarted with the same name, the new instance gets a fresh ID, preventing it from accidentally assuming it still holds a stale lease. |
|
Oops, sorry for the noise! I accidentally hit the 'Close with comment' button. Reopening now—please ignore the notification. |
Hi @jgehrcke , I've conducted fault injection tests (simulating API server unreachability) and the logs confirm the robustness of the implementation.
|
dff5bd2 to
1e75766
Compare
|
Hi @jgehrcke, I have updated the PR and pushed the latest changes via commit amend:
The implementation now fully reflects our discussion in code comments and has been validated with fault injection. Looking forward to your final review! |
| elector.Run(electorCtx) | ||
|
|
||
| // If exiting due to a controller failure, propagate the error to main | ||
| if controllerErr != nil { |
There was a problem hiding this comment.
controllerErr is modified inside of OnStartedLeading and read in run after elector.Run returns.
OnStartLeading is in a separate goroutine, wondering if that could cause a potential subtle data race.
Consider the following scenario
- Pod becomes leader
leaderelection.Runruns theOnStartedLeadingin a seperate go routine, that runscontrolller.Run(leaderCtx)
go le.config.Callbacks.OnStartedLeading(ctx)
le.renew(ctx)- Network failure, so leadership is lost
- Since leadership is not acquired, context is cancel
- Two things happen concurrently
a. Goroutine ofruntries to readcontrollerError.if controllerError != nil
b. Goroutine ofonStartedLeadingis still in the middle ofcontroller.Runand modifiescontrollerError
Timing is nondeterministic
run()may seecontrollerErrbeing nil before the write happens, which returns incorrect nil error- read might be happening while
controllerErris been written
There was a problem hiding this comment.
It might be safer to pass the controllerErr to a error channel that's read inside of run
There was a problem hiding this comment.
Fixed. Replaced shared variable with thread-safe error channel, no data race now.
|
Thanks for your reviews! I will fix all the issues you mentioned one by one (data race, flag abstraction, env naming, RBAC, code refactor, affinity, etc.) and push the updated code shortly. Will keep you posted! |
| select { | ||
| case controllerErrCh <- err: | ||
| default: | ||
| } |
There was a problem hiding this comment.
Is this select necessary? I would hope that erroring here would trigger elector.Run() to fail, meaning we can / should just do a direct write here, i.e.:
controllerErrCh <- err
Which will then block until the follow up call to:
err := <-controllerErrCh
happens later (which we want to block to ensure the controller has shutdown completely before returning the error).
| }, | ||
| } | ||
|
|
||
| controllerErrCh := make(chan error, 1) |
There was a problem hiding this comment.
Do we want this to be a buffered channel? I would think we want it to be unbuffered.
| return fmt.Errorf("controller execution failed: %w", err) | ||
| } | ||
| default: | ||
| } |
There was a problem hiding this comment.
As above, I don't think we want a select here, just:
if err := <-controllerErrCh; err != nil {
klog.ErrorS(err, "Process exiting due to controller failure")
return fmt.Errorf("controller execution failed: %w", err)
}
We can / should block here until the controller has pushed an error into this channel.
Is there ever a case we could get here without that happening?
There was a problem hiding this comment.
Hi @klueska , thanks for the review!
Actually, making the channel unbuffered and removing the select blocks would lead to deadlocks in both the error path and the graceful shutdown path.
-
Why
make(chan error, 1)and the non-blocking write are required:
elector.Run()does not automatically return when the callback (OnStartedLeading) errors out. We must explicitly triggercancelElector()to stop the leader election loop. If we use an unbuffered channel and a blocking write (controllerErrCh <- err), the callback goroutine will block forever waiting for a reader. Since it's blocked,defer cancelElector()is never reached,elector.Run()never returns, and the reader at the bottom is never reached. Deadlock. -
Why the non-blocking read (the bottom
select) is required:
You asked: "Is there ever a case we could get here without that happening?" Yes, absolutely! During a normal pod termination (e.g., receivingSIGTERM), the globalctxis cancelled, causingelector.Run()to return gracefully. In this case,controller.Run()exits without error, and nothing is pushed tocontrollerErrCh. If we block on<-controllerErrChat the bottom, the process will hang forever during graceful shutdown until Kubernetes SIGKILLs it. Theselect+defaultallows us to safely check for errors without hanging during a normal exit.
The current buffered channel + non-blocking select pattern acts as a safe 'error mailbox' across goroutine boundaries, ensuring we never block the crucial cancelElector() call or the graceful shutdown flow.
There was a problem hiding this comment.
OK. That makes sense. I'm still not 100% happy with the way it reads with these selects, but I'll defer to @shivamerla to decide if something should be changed here.
klueska
left a comment
There was a problem hiding this comment.
Looking good, jsut a few small comments about error propagation / proper shutdown semantics.
|
/ok to test f3b4a4e |
|
Can you please squash to a single commit? |
f3b4a4e to
d9414d1
Compare
Hi @klueska , I've squashed all the changes into a single commit as requested. The commit history is clean now. I think a new /ok-to-test might be needed since the commit SHA has changed after the force push. Thanks! |
|
/ok-to-test d9414d1 |
| for { | ||
| select { | ||
| case <-sigs: | ||
| klog.Info("Received signal, shutting down") |
There was a problem hiding this comment.
We don't need to do this in this PR. But we should make sure to log the exact signal received.
There was a problem hiding this comment.
Good catch! It's a quick fix, so I've updated the log to include the exact signal (klog.InfoS(..., "signal", sig.String())) in the latest rebase.
| enabled: false | ||
| leaseDuration: "15s" | ||
| renewDeadline: "10s" | ||
| retryPeriod: "2s" |
There was a problem hiding this comment.
Where did we take inspiration for these values? :)
There was a problem hiding this comment.
https://github.com/kubernetes/client-go/blob/v0.34.0/tools/leaderelection/leaderelection.go#L116
type LeaderElectionConfig struct {
// Lock is the resource that will be used for locking
Lock rl.Interface
// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack.
//
// A client needs to wait a full LeaseDuration without observing a change to
// the record before it can attempt to take over. When all clients are
// shutdown and a new set of clients are started with different names against
// the same leader record, they must wait the full LeaseDuration before
// attempting to acquire the lease. Thus LeaseDuration should be as short as
// possible (within your tolerance for clock skew rate) to avoid a possible
// long waits in the scenario.
//
// Core clients default this value to 15 seconds.
LeaseDuration time.Duration
// RenewDeadline is the duration that the acting master will retry
// refreshing leadership before giving up.
//
// Core clients default this value to 10 seconds.
RenewDeadline time.Duration
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions.
//
// Core clients default this value to 2 seconds.
RetryPeriod time.Durationthe recommended defaults in the client-go leaderelection package.
There was a problem hiding this comment.
Thanks for all the great work and discussion here.
Can this be tested pragmatically in CI?
Note that we currently have
- two k8s nodes in total in the CI envs
- only one of these nodes being labeled with
control-plane
Maybe this would need a bit of monkey-patching (removing a label selector?). Anyway, I can imagine that with a bit of trickery we can add a test to the current test suite that at least covers basic leader election code paths. Of course it would be nice to still do this in this PR -- but I also don't want to slow this down.
Summary: if this is testable with reasonable effort in current CI we must do it, either in this PR or in a follow-up patch.
I can also help building the test if needed.
|
@herb-duan looking at e.g. this makes me think that you didn't rebase on the current HEAD of main in this repo -- please do that, only then CI has a chance of succeeding :). |
- Implement client-go leaderelection with safe error propagation. - Group leader election flags in `pkg/flags/leaderelection.go`. - Update Helm charts for leader election. Change-Id: I2296b433f6d8dcdda6b95fdf487c91f4f195f35e Signed-off-by: Herb Duan <herbertduan@qq.com>
d9414d1 to
3ec74e0
Compare
Good catch! I've just rebased the branch onto the latest Could you please trigger |
Hi @jgehrcke, since However, considering the complexity of adding an end-to-end chaos test (e.g., killing the active pod and verifying the lease handover), I completely agree with your suggestion to handle this in a follow-up patch to keep this PR focused. |
|
/ok-to-test 3ec74e0 |
jgehrcke
left a comment
There was a problem hiding this comment.
CI passed. Approved! Thank you @herb-duan.
(largely relying on Kevin's last review)
|
@shivamerla do you want to merge this? :) |
Fixes #815
The compute-domain-controller currently operates as a singleton, which presents a single point of failure (SPOF) in a production environment. To enhance the reliability and availability of the controller, this change introduces a leader election mechanism to support a multi-replica, high-availability (HA) deployment model.
When HA mode is enabled, only one of the controller replicas becomes the leader and executes the core business logic as well as all change operations (e.g., configuration updates, resource modifications). All other replicas remain in a hot-standby state and will not perform any business or change-related work. This leader election dependency is critical for change operations in particular — concurrent execution of change logic by multiple controller replicas must be strictly prohibited to avoid data inconsistency, conflicting operations, or unintended side effects.
If the current leader fails, one of the standby replicas will automatically take over as the new leader, ensuring uninterrupted service continuity and consistent execution of both core business and change operations.