Problem
Two DatabaseServer reliability issues in dis-pgsql-operator:
- Misleading status on name collisions. Azure PostgreSQL Flexible Server names are globally unique. When two
DatabaseServers resolve to the same name, the second fails with ServerNameAlreadyExists — but the status surfaced an unrelated Server Parameter Errors: Owner … cannot be found, because the operator created the config children while the server was the thing that failed.
- Stuck teardown. Deleting a dedicated
DatabaseServer wedged its PrivateDnsZone (CannotDeleteResource … nested resources … virtualNetworkLinks): the zone and its vnet links are sibling children, so GC + ASO deleted them concurrently and Azure rejected the zone delete while links still existed.
While validating on Kind, also found the e2e Makefile targets ran kubectl against the caller's current context instead of the Kind cluster they create (applied CRDs to the wrong cluster → ServiceUnavailable).
What we did
- Gate child reconciliation on the owned FlexibleServer's Ready condition; surface a
ServerNameConflict status carrying the real Azure error, and stop creating the misleading config-children. No name mangling — the server name is a contract (the private DNS zone is derived from it).
- Add a finalizer that tears down the FlexibleServer + vnet links first, then the private DNS zone, then drops the finalizer.
- Pin the whole e2e flow to a dedicated Kind kubeconfig (in the standalone Makefile and in the shared
Makefile.common).
Operator/reconciler + build-tooling only; each operator PR ships an envtest spec.
PRs (all merged)
| Change |
PR |
Surface FlexibleServer errors on status (ServerNameConflict) |
#3724 |
| Order DatabaseServer teardown via finalizer (unstick DNS zone) |
#3725 |
| Pin e2e kubectl to a dedicated Kind kubeconfig (dis-pgsql Makefile) |
#3720 |
Same fix in shared Makefile.common |
#3721 |
Notes / follow-ups
- The other standalone operators (
dis-identity, dis-vault, dis-apim; verify lakmus) still carry the same e2e kubectl-context bug — captured in a follow-up plan, not yet done.
- An admission webhook for early name-collision guards was considered and declined — validation stays in the reconciler.
🤖 Generated with Claude Code
Problem
Two
DatabaseServerreliability issues indis-pgsql-operator:DatabaseServers resolve to the same name, the second fails withServerNameAlreadyExists— but the status surfaced an unrelatedServer Parameter Errors: Owner … cannot be found, because the operator created the config children while the server was the thing that failed.DatabaseServerwedged itsPrivateDnsZone(CannotDeleteResource … nested resources … virtualNetworkLinks): the zone and its vnet links are sibling children, so GC + ASO deleted them concurrently and Azure rejected the zone delete while links still existed.While validating on Kind, also found the e2e Makefile targets ran
kubectlagainst the caller's current context instead of the Kind cluster they create (applied CRDs to the wrong cluster →ServiceUnavailable).What we did
ServerNameConflictstatus carrying the real Azure error, and stop creating the misleading config-children. No name mangling — the server name is a contract (the private DNS zone is derived from it).Makefile.common).Operator/reconciler + build-tooling only; each operator PR ships an envtest spec.
PRs (all merged)
ServerNameConflict)Makefile.commonNotes / follow-ups
dis-identity,dis-vault,dis-apim; verifylakmus) still carry the same e2e kubectl-context bug — captured in a follow-up plan, not yet done.🤖 Generated with Claude Code