fix(manager): resolve race conditions in CRD waiting and manager initialization#4988
Open
kalakotima wants to merge 1 commit into
Open
fix(manager): resolve race conditions in CRD waiting and manager initialization#4988kalakotima wants to merge 1 commit into
kalakotima wants to merge 1 commit into
Conversation
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
ExceptionalHandler
requested changes
May 12, 2026
| @@ -41,6 +41,7 @@ import ( | |||
| var ( | |||
| initOnce, startOnce sync.Once | |||
| manager *ControllerManager | |||
| managerErr error // : Capture init error for safe retrieval | |||
Contributor
There was a problem hiding this comment.
Does this even compile?
I do not see a corresponding main.go change!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Summary
This PR fixes five production-grade bugs discovered in manager.go through static analysis and race detector review. The bugs span across three functions- Get(), WaitCRDs(), and WaitCRDsWithResync() .
All bugs are reproducible under go test race and represent real failure modes in multi-go routine Kubernetes controller environments where informer callbacks and the main goroutine operate concurrently.
requesting to find fixes mentioned below.
Get() — Silent panic on initialization failure
// BEFORE: error swallowed, process panics with no recovery path
initOnce.Do(func() {
manager, err = newControllerManager()
if err != nil {
panic(err) // ❌ callers have no chance to handle this
}
})
return manager
err was scoped inside initOnce.Do(...) and never surfaced to the caller
Any failure in newControllerManager() caused an unrecoverable panic
WaitCRDs() — Unsynchronized map mutation from informer goroutine
// BEFORE: crds map written from informer goroutine, read from caller goroutine
AddFunc: func(obj any) {
delete(crds, crdObject.Name) // ❌ no lock — DATA RACE
if len(crds) == 0 {
wg.Done() // ❌ wg.Done() can underflow if called twice
}
}
// ...
wg.Wait() // ❌ blocks forever if ctx is cancelled
The caller's crds map was directly mutated inside the informer callback (different goroutine) with no synchronization
wg.Wait() had no context cancellation path — goroutine leak on shutdown
WaitCRDsWithResync() — Race on remainingCRDs map and completed bool
// BEFORE: both accessed from informer goroutine without a lock
completed := false // ❌ plain bool, written from informer goroutine
AddFunc: func(obj any) {
if completed { return } // ❌ read without lock
delete(remainingCRDs, crdObject.Name) // ❌ write without lock
if len(remainingCRDs) == 0 {
completed = true // ❌ write without lock
finish()
}
}
remainingCRDs map and completed bool both had concurrent read/write from the informer goroutine with no mutex
Detectable by go test -race — classified as undefined behaviour in Go
// BEFORE: wg + separate goroutine has a race window
go func() {
select {
case <-ctx.Done():
finish() // goroutine may not run before wg.Wait() unblocks
case <-done:
}
}()
wg.Wait()
if ctx.Err() != nil { // ❌ could be nil even after cancellation
return ctx.Err()
}
A separate goroutine was used to handle ctx.Done() but wg.Wait() could unblock before that goroutine was scheduled
ctx.Err() check after wg.Wait() had a TOCTOU window — cancellation could be missed
also please find the 1 liner summary fixes
Thankyou.