Skip to content

dis-pgsql-operator: surface server-name conflicts on status and order DatabaseServer teardown #3733

Description

@sduranc

Problem

Two DatabaseServer reliability issues in dis-pgsql-operator:

  1. Misleading status on name collisions. Azure PostgreSQL Flexible Server names are globally unique. When two DatabaseServers resolve to the same name, the second fails with ServerNameAlreadyExists — but the status surfaced an unrelated Server Parameter Errors: Owner … cannot be found, because the operator created the config children while the server was the thing that failed.
  2. Stuck teardown. Deleting a dedicated DatabaseServer wedged its PrivateDnsZone (CannotDeleteResource … nested resources … virtualNetworkLinks): the zone and its vnet links are sibling children, so GC + ASO deleted them concurrently and Azure rejected the zone delete while links still existed.

While validating on Kind, also found the e2e Makefile targets ran kubectl against the caller's current context instead of the Kind cluster they create (applied CRDs to the wrong cluster → ServiceUnavailable).

What we did

  1. Gate child reconciliation on the owned FlexibleServer's Ready condition; surface a ServerNameConflict status carrying the real Azure error, and stop creating the misleading config-children. No name mangling — the server name is a contract (the private DNS zone is derived from it).
  2. Add a finalizer that tears down the FlexibleServer + vnet links first, then the private DNS zone, then drops the finalizer.
  3. Pin the whole e2e flow to a dedicated Kind kubeconfig (in the standalone Makefile and in the shared Makefile.common).

Operator/reconciler + build-tooling only; each operator PR ships an envtest spec.

PRs (all merged)

Change PR
Surface FlexibleServer errors on status (ServerNameConflict) #3724
Order DatabaseServer teardown via finalizer (unstick DNS zone) #3725
Pin e2e kubectl to a dedicated Kind kubeconfig (dis-pgsql Makefile) #3720
Same fix in shared Makefile.common #3721

Notes / follow-ups

  • The other standalone operators (dis-identity, dis-vault, dis-apim; verify lakmus) still carry the same e2e kubectl-context bug — captured in a follow-up plan, not yet done.
  • An admission webhook for early name-collision guards was considered and declined — validation stays in the reconciler.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

Labels

area/sreIssues related to Site Reliability Engineering (Sebastian,Renato,Espen)kind/bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions