Skip to content

feat: database secret#447

Merged
kichasov merged 11 commits into
feat/operator-devfrom
feat/database-secret
May 22, 2026
Merged

feat: database secret#447
kichasov merged 11 commits into
feat/operator-devfrom
feat/database-secret

Conversation

@Logaka
Copy link
Copy Markdown
Collaborator

@Logaka Logaka commented May 15, 2026

No description provided.

Logaka added 2 commits May 15, 2026 16:48
…abase-secret

# Conflicts:
#	dbaas-operator/config/rbac/role.yaml
#	dbaas-operator/helm-templates/dbaas-operator/templates/ClusterRole.yaml
#	dbaas/dbaas-integration-tests/src/test/java/com/netcracker/it/dbaas/test/declarative/OperatorIT.java
@Logaka Logaka requested a review from ArkuNC as a code owner May 15, 2026 12:25
@Logaka Logaka self-assigned this May 15, 2026
@kichasov
Copy link
Copy Markdown
Collaborator

Outdated comment on connectionPropertiesToSecretData

databasesecret_controller.go:250:

// ... on the dbaas-aggregator side the type is List<Map<String,Object>>.

The endpoint used by this controller returns single-CP shape (map[string]any), not a list. Please update the comment to match. For example:

// The aggregator returns a single map (DatabaseResponseV3SingleCP.connectionProperties)
// for the requested userRole; the map shape is dynamic and dictated by the adapter.

@kichasov
Copy link
Copy Markdown
Collaborator

spec.classifier, spec.type, spec.secretName should be immutable

databasesecret_types.go:24–42 — none of the identity fields carry an XValidation immutability rule. Compare with ExternalDatabase, which marks classifier, type, and dbName with XValidation:rule="self == oldSelf".

Without immutability:

  • Changing secretName produces a new Secret next to the old one; consumers mounting the old name keep reading stale credentials.
  • Changing classifier or type makes the next reconcile query a different database in dbaas-aggregator; the same Secret then suddenly points at credentials for a different DB — consumers connect to the wrong system without any indication.

Both are silent-failure scenarios for the consuming microservices.

Suggested fix: add XValidation:rule="self == oldSelf" to all three fields, mirroring the EDB types:

// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec.classifier is immutable after creation"
Classifier Classifier `json:"classifier"`

// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec.type is immutable after creation"
Type string `json:"type"`

// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec.secretName is immutable after creation"
SecretName string `json:"secretName"`

@kichasov
Copy link
Copy Markdown
Collaborator

spec.userRole should be immutable too

databasesecret_types.go:34–36:

// userRole is the role/permission level for the generated credentials.
// +optional
UserRole string `json:"userRole,omitempty"`

Currently mutable. Changing it on an existing CR makes the next reconcile request credentials for a different DB user; the same Secret then suddenly carries different-role credentials, and the previous user is not invalidated by the operator. Same silent-failure shape as the other identity fields.

One CR per role is a cleaner contract — if a microservice needs a different role, create a new DatabaseSecret with a new secretName.

Suggested fix: add the immutability rule, consistent with the other spec fields:

// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec.userRole is immutable after creation"
// +optional
UserRole string `json:"userRole,omitempty"`

@kichasov
Copy link
Copy Markdown
Collaborator

Inaccurate godoc on DatabaseSecret — label enforcement is controller-side, not aggregator-side

databasesecret_types.go:56–63 says:

"CEL validation of metadata.labels at root schema level is not supported by controller-gen; enforcement is done through the aggregator error flow."

But the controller does a real pre-flight check at databasesecret_controller.go:91–96:

if s.Labels["app.kubernetes.io/name"] == "" {
    return invalidSpec(ctx, &s.Status.Phase, &s.Status.Conditions, s.Generation,
        r.Recorder, s,
        "label app.kubernetes.io/name is required — ...")
}

The aggregator is never reached when the label is missing — the controller short-circuits with InvalidSpec/InvalidConfiguration. The aggregator-error path is at most a defence-in-depth fallback.

Suggested fix: rewrite the sentence to reflect the actual enforcement:

// The label `app.kubernetes.io/name` is required — its value is sent as originService
// in the get-by-classifier request to dbaas-aggregator. CRD-level CEL validation of
// metadata.labels is not supported by controller-gen, so enforcement is done at the
// controller level via a pre-flight check that sets InvalidConfiguration/InvalidSpec.

@kichasov
Copy link
Copy Markdown
Collaborator

DatabaseNotReady (404) should use a constant requeue interval, not exponential backoff

databasesecret_controller.go:234–239:

case http.StatusNotFound:
    markTransientFailure(...)
    r.Recorder.Eventf(...)
    return ctrl.Result{}, err   // exponential backoff via rate limiter

Returning err engages controller-runtime's exponential rate limiter (5ms → 10ms → 20ms → … → 1000s). For a "wait for upstream provisioning" signal that's the wrong shape:

  • Early retries hammer the aggregator at sub-second intervals while the DB is still being provisioned.
  • Later retries get spaced out into minutes, so once the DB is actually ready, the consumer waits much longer than necessary.

The DatabaseDeclaration controller has the same pattern — waiting for an async provisioning to complete — and uses a constant interval pollRequeueAfter (~5s, declared in conditions.go).

Suggested fix: mirror the DD pattern for the 404 branch:

case http.StatusNotFound:
    markTransientFailure(...)
    r.Recorder.Eventf(...)
    return ctrl.Result{RequeueAfter: pollRequeueAfter}, nil

This gives predictable reaction time once the DB is ready and bounds the rate of 404 calls to the aggregator.

@kichasov
Copy link
Copy Markdown
Collaborator

Materialized Secret has no managed-by label

databasesecret_controller.go:163–179 — the Secret is created with only Name, Namespace, Data, and an OwnerReference. No labels.

Adding the standard app.kubernetes.io/managed-by: dbaas-operator label (plus optionally app.kubernetes.io/name carried over from the CR's own label, which is already required) would:

  • Make operator-managed Secrets kubectl get secret -l app.kubernetes.io/managed-by=dbaas-operator-filterable.
  • Conform to Kubernetes recommended labels convention.
  • Help platform tools (CD pipelines, GitOps drift detection) treat operator-managed Secrets correctly.

Suggested fix inside the CreateOrUpdate mutator:

_, err = controllerutil.CreateOrUpdate(ctx, r.Client, coreSecret, func() error {
    coreSecret.Data = secretData
    if coreSecret.Labels == nil {
        coreSecret.Labels = map[string]string{}
    }
    coreSecret.Labels["app.kubernetes.io/managed-by"] = "dbaas-operator"
    coreSecret.Labels["app.kubernetes.io/name"] = s.Labels["app.kubernetes.io/name"]
    return ctrl.SetControllerReference(s, coreSecret, r.Scheme)
})

@kichasov
Copy link
Copy Markdown
Collaborator

Sibling-name check is O(N) per reconcile — consider a field index

databasesecret_controller.go:115–126 lists every DatabaseSecret in the namespace on each reconcile and linearly scans for a matching spec.secretName:

var siblings dbaasv1.DatabaseSecretList
if err := r.List(ctx, &siblings, client.InNamespace(s.Namespace)); err != nil {
    return ctrl.Result{}, err
}
for i := range siblings.Items {
    if siblings.Items[i].UID != s.UID && siblings.Items[i].Spec.SecretName == s.Spec.SecretName {
        return r.markSecretConflict(...)
    }
}

On namespaces with hundreds of DatabaseSecrets this scales linearly per reconcile. The EDB controller has the same shape of lookup (Secret → EDBs) and uses a field index for O(1) resolution — same pattern would fit here.

Suggested fix: register a field index on spec.secretName at startup:

const secretNameIndex = "spec.secretName"

if err := mgr.GetFieldIndexer().IndexField(ctx, &dbaasv1.DatabaseSecret{}, secretNameIndex,
    func(obj client.Object) []string {
        return []string{obj.(*dbaasv1.DatabaseSecret).Spec.SecretName}
    }); err != nil { return err }

Then in reconcile:

r.List(ctx, &siblings,
    client.InNamespace(s.Namespace),
    client.MatchingFields{secretNameIndex: s.Spec.SecretName})

Cheap optimization; safe to defer until scale becomes a concern.

Logaka added 3 commits May 21, 2026 17:04
# Conflicts:
#	dbaas/dbaas-integration-tests/src/test/java/com/netcracker/it/dbaas/test/declarative/OperatorIT.java
…abase-secret

# Conflicts:
#	dbaas/dbaas-integration-tests/src/test/java/com/netcracker/it/dbaas/test/declarative/OperatorIT.java
@kichasov kichasov deleted the branch feat/operator-dev May 22, 2026 10:56
@kichasov kichasov closed this May 22, 2026
@kichasov kichasov reopened this May 22, 2026
@kichasov
Copy link
Copy Markdown
Collaborator

Empty connectionProperties should be treated as transient, not permanent

databasesecret_controller.go:153–158:

if len(dbResp.ConnectionProperties) == 0 {
    return invalidSpec(ctx, &s.Status.Phase, &s.Status.Conditions, s.Generation,
        r.Recorder, s,
        fmt.Sprintf("aggregator returned empty connectionProperties for type=%s", s.Spec.Type))
}

invalidSpecmarkPermanentFailureInvalidConfiguration / Stalled=True, so the CR stops retrying until the user touches the spec.

In practice this branch is unreachable for a valid 200 response: the aggregator constructs DatabaseResponseV3SingleCP via ConnectionPropertiesUtils.getConnectionProperties(...) which throws NotExistingConnectionPropertiesException if the requested role has no matching entry — that becomes a non-200 HTTP response, not a 200 with an empty map. Any path that returns 200 here goes through the same constructor.

Defensive check is still worth keeping — if the aggregator ever regresses, we want the controller to notice — but the diagnosis is then "aggregator returned an unexpected shape", not "the user's spec is wrong". The user has no spec change they can make to fix it; only a controller restart or an aggregator fix will help. That's the textbook transient signal.

Suggestion: swap invalidSpec for a transient failure with backoff. Reuse the existing aggregator-error machinery so dashboards/alerts pick it up the same way as any other 5xx-class symptom:

if len(dbResp.ConnectionProperties) == 0 {
    msg := fmt.Sprintf("aggregator returned empty connectionProperties for type=%s — unexpected response shape, retrying", s.Spec.Type)
    markTransientFailure( & s.Status.Phase, & s.Status.Conditions, s.Generation,
        EventReasonAggregatorError, msg)
    r.Recorder.Eventf(s, corev1.EventTypeWarning, EventReasonAggregatorError,
        "%s (requestId=%s)", msg, requestID)
    return ctrl.Result{RequeueAfter: pollRequeueAfter}, nil
}

@kichasov
Copy link
Copy Markdown
Collaborator

Symmetric sibling-name conflict deadlocks both DatabaseSecret CRs

When two CRs are created with the same spec.secretName close in time, both see each other in the sibling check at databasesecret_controller.go:134-140 and both call markSecretConflict → both end up in InvalidConfiguration / Stalled=True.

User deletes one of them to resolve the conflict. The other one stays stuck in InvalidConfiguration forever:

  • Deletion of the sibling doesn't trigger a reconcile on the survivor (controller only watches the CR itself and NamespaceBinding).
  • The survivor's own spec hasn't changed, so GenerationChangedPredicate won't fire.
  • User has to manually bump an annotation on the survivor to make the controller notice the conflict is gone.

Suggested fix: watch DatabaseSecret deletes and re-enqueue siblings sharing the same secretName via the existing field index:

Watches(&dbaasv1.DatabaseSecret{},
    handler.EnqueueRequestsFromMapFunc(r.enqueueSiblingsBySecretName),
    builder.WithPredicates(predicate.Funcs{
        CreateFunc:  func(_ event.CreateEvent) bool { return false },
        UpdateFunc:  func(_ event.UpdateEvent) bool { return false },
        DeleteFunc:  func(_ event.DeleteEvent) bool { return true },
        GenericFunc: func(_ event.GenericEvent) bool { return false },
    }))

with the mapping reusing secretNameIndex to find peers in the same namespace.

@kichasov
Copy link
Copy Markdown
Collaborator

Step 9.4 IsNotFound success-path skips markSucceeded

databasesecret_controller.go:221–231:

case apierrors.IsNotFound(err):
    ...
    if createErr := r.Create(ctx, recreated); createErr != nil && !apierrors.IsAlreadyExists(createErr) {
        return ctrl.Result{}, createErr
    }
    return ctrl.Result{Requeue: true}, nil

When r.Create actually succeeds (createErr == nil), the code still returns Requeue: true without calling markSucceeded. The CR stays in Processing until the next reconcile. The "let next reconcile confirm" comment only applies to the AlreadyExists race; on plain success the status should be updated now.

Suggested fix: split the two cases:

case apierrors.IsNotFound(err):
    recreated, buildErr := r.buildOwnedSecret(s, secretData)
    if buildErr != nil {
        return ctrl.Result{}, buildErr
    }
    createErr := r.Create(ctx, recreated)
    if createErr == nil {
        markSucceeded(&s.Status.Phase, &s.Status.Conditions, s.Generation, EventReasonSecretCreated)
        r.Recorder.Eventf(s, corev1.EventTypeNormal, EventReasonSecretCreated,
            "Secret %q re-created after race (requestId=%s)", s.Spec.SecretName, requestID)
        return ctrl.Result{}, nil
    }
    if !apierrors.IsAlreadyExists(createErr) {
        return ctrl.Result{}, createErr
    }
    return ctrl.Result{Requeue: true}, nil   // AlreadyExists race — let next reconcile confirm

@kichasov
Copy link
Copy Markdown
Collaborator

Step 9.2 race recovery discards already-fetched dbResp

databasesecret_controller.go:186–194:

if apierrors.IsNotFound(err) {
    // race: deleted after AlreadyExists → retry create
    return ctrl.Result{Requeue: true}, nil
}

When the Secret is deleted between our Create→AlreadyExists and the Get, we requeue. But newSecret (built from dbResp) is still valid in scope — we can retry Create immediately on the same data and avoid another GetDatabaseByClassifier round-trip on the next reconcile.

Suggested fix:

if apierrors.IsNotFound(err) {
    // Race: someone deleted the Secret after our AlreadyExists. Retry Create
    // with the data we already have — no need for another aggregator call.
    if createErr := r.Create(ctx, newSecret); createErr == nil {
        markSucceeded(&s.Status.Phase, &s.Status.Conditions, s.Generation, EventReasonSecretCreated)
        r.Recorder.Eventf(s, corev1.EventTypeNormal, EventReasonSecretCreated,
            "Secret %q created after race (requestId=%s)", s.Spec.SecretName, requestID)
        return ctrl.Result{}, nil
    } else if !apierrors.IsAlreadyExists(createErr) {
        return ctrl.Result{}, createErr
    }
    // Still AlreadyExists — let next reconcile sort it out.
    return ctrl.Result{Requeue: true}, nil
}

Cold-path optimization; not a bug.

@kichasov
Copy link
Copy Markdown
Collaborator

DatabaseNotFound (404) polls forever with no deadline signal

databasesecret_controller.go:282–288:

if errors.As(err, &aggErr) && aggErr.IsDatabaseNotFound() {
    markTransientFailure(&s.Status.Phase, &s.Status.Conditions, s.Generation,
        EventReasonDatabaseNotFound, aggErr.UserMessage())
    r.Recorder.Eventf(s, corev1.EventTypeWarning, EventReasonDatabaseNotFound,
        "database not found in dbaas-aggregator, waiting for provisioning (requestId=%s)", requestID)
    return ctrl.Result{RequeueAfter: pollRequeueAfter}, nil
}

The wait-for-provisioning intent is correct — a DatabaseSecret may be applied before its DatabaseDeclaration finishes. But there's no upper bound: a CR with a typo in classifier (wrong microserviceName, wrong scope, declaration in another namespace) will requeue every pollRequeueAfter forever, emitting a Warning event each cycle, with no Stalled=True transition and nothing alertable on a dashboard.

Suggested fix (status-only, minimally invasive): record firstNotFoundAt timestamp in Status on the first 404, and after a threshold (e.g. 10 min) flip Stalled=True with reason DatabaseNotFoundTimeout. Keep requeueing (the aggregator still owns the operation), but stop emitting per-cycle Warning events once stalled. Operators get a detectable stuck-state signal without changing the happy-path semantics.

Severity: Low — not a correctness bug, but typos in classifier are operationally invisible today.

kichasov added 5 commits May 22, 2026 16:40
…ansient

DatabaseSecret previously failed permanently (InvalidConfiguration / InvalidSpec)
when dbaas-aggregator returned HTTP 200 with an empty connectionProperties map.
This response is not produced by a healthy aggregator+adapter pair — the aggregator
throws NotExistingConnectionPropertiesException on missing role rather than
returning an empty payload — so when it does occur it indicates a transient
inconsistency in the adapter or role registry, not a user spec error.

Switch to BackingOff with a new EmptyConnectionProperties event reason and
requeue with pollRequeueAfter, mirroring how we already treat 5xx and
DatabaseNotFound responses. The user's spec remains valid; the operator
retries until the adapter populates the payload.

Refs PR #447 review comment F2.
…onflicts

When two DatabaseSecret CRs claimed the same spec.secretName before either
reconciled, both moved to InvalidConfiguration/SecretConflict symmetrically.
Deleting one did not wake the other (its own spec hadn't changed), so the
survivor stayed deadlocked.

Two changes:

1. Tiebreak by creationTimestamp with UID lexical fallback. Only the younger
   claimant moves to SecretConflict; the older claimant proceeds normally.
   The UID fallback keeps the result stable when peers race within the same
   second of creationTimestamp resolution.

2. Add a Watches on DatabaseSecret in SetupWithManager that re-enqueues
   every sibling sharing spec.secretName whenever any DatabaseSecret in
   the namespace is created, deleted, or has a spec change. A CR sitting
   in SecretConflict now recovers automatically once the older claimant is
   removed or rebound to a different secret name. GenerationChangedPredicate
   filters out status-only updates to keep the fan-out quiet.

Unit tests cover older-wins, recovery-after-winner-deletion, and the
fan-out map func. The existing IT testDatabaseSecretSharedSecretName is
updated: the older CR must remain unaffected by the younger sibling.

Refs PR #447 review comment F3.
… post-GC

Step 9.4 of the DatabaseSecret reconciler handles the race where the target
Secret is deleted between the Update's preceding Get and the Update itself
(IsNotFound). The recovery path recreates the Secret but always returned
Requeue=true without marking the CR succeeded, even when the Create call
returned nil — meaning we just wrote the correct content and could confirm
immediately. The next reconcile would refetch from the aggregator just to
verify a write we already knew was good, flipping the CR's phase through
Processing again.

Split the post-Create branches: on nil, mark succeeded and emit the
SecretCreated event; on AlreadyExists, keep the existing requeue (we don't
know whose content won the race); on any other error, bubble up.

Refs PR #447 review comment F4.
The DatabaseSecret reconciler's Step 9.2 handles the race where the target
Secret disappears between r.Create returning AlreadyExists and the
subsequent r.Get to verify ownership. Previously this returned Requeue=true,
forcing the next reconcile to repeat the whole pipeline — namespace
ownership, sibling check, aggregator round-trip, Secret build — just to
end up back at the same Create call.

newSecret is still valid in the current scope, so retry Create inline:
on nil, mark succeeded; on AlreadyExists (double race), keep the Requeue;
on any other error, bubble up. Saves one aggregator call on a rare race
recovery path.

Refs PR #447 review comment F5.
A DatabaseSecret whose classifier had a typo or referenced a database in
another namespace would sit in BackingOff/DatabaseNotFound forever,
producing a Warning event on every reconcile cycle and giving SRE no
alertable signal that the wait had become hopeless.

Add a databaseNotFoundTimeout (10 min) and a Status.firstNotFoundAt
timestamp that records the start of the current NotFound streak. On the
first reconcile that finds the streak past the threshold:

- emit a one-shot DatabaseNotFoundTimeout Warning event, then
- switch the Ready condition reason to DatabaseNotFoundTimeout so the
  condition itself acts as the "already-notified" marker and no further
  per-cycle Warnings are emitted.

Polling continues — Phase stays BackingOff, Stalled stays False — so a
late-arriving database still unsticks the CR automatically. The marker
is cleared on any successful aggregator response or on a non-NotFound
error so a fresh streak starts a fresh timer.

Unit tests cover the threshold crossing, suppressed re-emit on subsequent
reconciles past the threshold, and clean recovery after the database
finally appears. CRD manifest and Helm CRD template are synced with the
new status field.

Refs PR #447 review comment F6.
@kichasov kichasov merged commit 664e674 into feat/operator-dev May 22, 2026
12 checks passed
@kichasov kichasov deleted the feat/database-secret branch May 22, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants