Skip to content

fix(manager): resolve race conditions in CRD waiting and manager initialization#4988

Open
kalakotima wants to merge 1 commit into
cilium:mainfrom
kalakotima:race_condition
Open

fix(manager): resolve race conditions in CRD waiting and manager initialization#4988
kalakotima wants to merge 1 commit into
cilium:mainfrom
kalakotima:race_condition

Conversation

@kalakotima
Copy link
Copy Markdown

PR Summary

This PR fixes five production-grade bugs discovered in manager.go through static analysis and race detector review. The bugs span across three functions- Get(), WaitCRDs(), and WaitCRDsWithResync() .

All bugs are reproducible under go test race and represent real failure modes in multi-go routine Kubernetes controller environments where informer callbacks and the main goroutine operate concurrently.

requesting to find fixes mentioned below.

  1. Get() — Silent panic on initialization failure
    // BEFORE: error swallowed, process panics with no recovery path
    initOnce.Do(func() {
    manager, err = newControllerManager()
    if err != nil {
    panic(err) // ❌ callers have no chance to handle this
    }
    })
    return manager
    err was scoped inside initOnce.Do(...) and never surfaced to the caller
    Any failure in newControllerManager() caused an unrecoverable panic

  2. WaitCRDs() — Unsynchronized map mutation from informer goroutine
    // BEFORE: crds map written from informer goroutine, read from caller goroutine
    AddFunc: func(obj any) {
    delete(crds, crdObject.Name) // ❌ no lock — DATA RACE
    if len(crds) == 0 {
    wg.Done() // ❌ wg.Done() can underflow if called twice
    }
    }
    // ...
    wg.Wait() // ❌ blocks forever if ctx is cancelled
    The caller's crds map was directly mutated inside the informer callback (different goroutine) with no synchronization
    wg.Wait() had no context cancellation path — goroutine leak on shutdown

  3. WaitCRDsWithResync() — Race on remainingCRDs map and completed bool
    // BEFORE: both accessed from informer goroutine without a lock
    completed := false // ❌ plain bool, written from informer goroutine

AddFunc: func(obj any) {
if completed { return } // ❌ read without lock
delete(remainingCRDs, crdObject.Name) // ❌ write without lock
if len(remainingCRDs) == 0 {
completed = true // ❌ write without lock
finish()
}
}
remainingCRDs map and completed bool both had concurrent read/write from the informer goroutine with no mutex
Detectable by go test -race — classified as undefined behaviour in Go

  1. WaitCRDsWithResync() — TOCTOU window in completion signalling
    // BEFORE: wg + separate goroutine has a race window
    go func() {
    select {
    case <-ctx.Done():
    finish() // goroutine may not run before wg.Wait() unblocks
    case <-done:
    }
    }()
    wg.Wait()

if ctx.Err() != nil { // ❌ could be nil even after cancellation
return ctx.Err()
}
A separate goroutine was used to handle ctx.Done() but wg.Wait() could unblock before that goroutine was scheduled
ctx.Err() check after wg.Wait() had a TOCTOU window — cancellation could be missed

also please find the 1 liner summary fixes

Thankyou.

@kalakotima kalakotima requested a review from a team as a code owner May 11, 2026 21:38
@netlify
Copy link
Copy Markdown

netlify Bot commented May 11, 2026

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 7efcc3e
🔍 Latest deploy log https://app.netlify.com/projects/tetragon/deploys/6a024c6c9866b70008d887bd
😎 Deploy Preview https://deploy-preview-4988--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Comment thread pkg/manager/manager.go
@@ -41,6 +41,7 @@ import (
var (
initOnce, startOnce sync.Once
manager *ControllerManager
managerErr error // : Capture init error for safe retrieval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this even compile?
I do not see a corresponding main.go change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants