feat: dbaas operator metrics#449
Conversation
|
Cosmetic regression: em-dashes replaced with hyphens across files unrelated to metrics The diff systematically replaces
- // Conditions are NOT cleared here — setCondition upserts each type in place,
+ // Conditions are NOT cleared here - setCondition upserts each type in place,
- // A mismatch is a permanent misconfiguration — no retry.
+ // A mismatch is a permanent misconfiguration - no retry.
- // build the reverse map ("namespace/secretName" → EDBs), then keeps it up to date
+ // build the reverse map ("namespace/secretName" -> EDBs), then keeps it up to date
- // WatchesMetadata avoids caching Secret data (credentials) in operator memory —
+ // WatchesMetadata avoids caching Secret data (credentials) in operator memory -
- // Uses the field index for an O(1) lookup — no full list scan.
+ // Uses the field index for an O(1) lookup - no full list scan.
- setupLog.Errorf("CLOUD_NAMESPACE env var is not set — ownership checks will not work correctly")
+ setupLog.Errorf("CLOUD_NAMESPACE env var is not set - ownership checks will not work correctly")
- // ── Operator namespace ────────────────────────────────────────────────────
+ // Operator namespace.
- // ── dbaas-aggregator client ───────────────────────────────────────────────
+ // dbaas-aggregator client.
- // ── Ownership resolver ────────────────────────────────────────────────────
+ // Ownership resolver.
- // ── NamespaceBinding controller (always enabled) ───────────────────────────
+ // NamespaceBinding controller (always enabled).These changes are unrelated to the metrics feature and inconsistent with the rest of the codebase, where em-dashes and Unicode section dividers are an established convention (see Could you please revert these stylistic edits so the diff stays focused on metrics? It will make this PR much easier to review and prevent the same noise from leaking into future PRs. If your editor is doing the replacement automatically, disabling smart-quotes / smart-dashes for |
|
Use constants for
// handlePollResponse:
r.observeAsyncCompletion(dd, resultSuccess) // constant ✓
r.observeAsyncCompletion(dd, "failed") // literal ✗
r.observeAsyncCompletion(dd, "terminated") // literal ✗
Keeping two of the three values as magic strings means a typo ( Suggestion: define the missing values next to the existing constants in const (
// ... existing result* constants ...
asyncResultFailed = "failed"
asyncResultTerminated = "terminated"
)Then in the controller: r.observeAsyncCompletion(dd, resultSuccess)
r.observeAsyncCompletion(dd, asyncResultFailed)
r.observeAsyncCompletion(dd, asyncResultTerminated)All three label values are then defined in one place, grep-able and refactor-safe. |
|
func (r *ExternalDatabaseReconciler) stampSecretTrigger(key string, startedAt time.Time) {
...
if _, exists := r.secretPropagationStamps[key]; !exists {
r.secretPropagationStamps[key] = startedAt
}
}But the stamp is only consumed on the // Reconcile, success branch only:
if secretStart, ok := r.consumeSecretPropagation(edbKey); ok {
dbaasSecretRotationPropagationSeconds.Observe(time.Since(secretStart).Seconds())
}The only other cleanups (
Scenario:
The memory cost is negligible. The real bug is stale stamps poisoning the next observation: if the same Secret is later rotated successfully an hour later, the histogram records a one-hour propagation, even though the actual rotation took only seconds. Suggestion: drop the propagation stamp on every reconcile exit, not only on success. The cleanest place is the existing defer func() {
if secretStart, ok := r.consumeSecretPropagation(edbKey); ok {
if edb.Status.Phase == dbaasv1.PhaseSucceeded {
dbaasSecretRotationPropagationSeconds.
Observe(time.Since(secretStart).Seconds())
}
// else: failure path — drop the stamp without polluting the histogram
}
}()Invariant becomes: a propagation stamp is either turned into a histogram sample or discarded — never left behind. |
|
The trigger-stamp pattern uses two steps:
If multiple triggers arrive for the same key, the consume/produce order is not guaranteed to match the actual cause of each reconcile. For example: The reconciles still happen correctly — only the Tightening the classification properly (e.g. carrying the cause through the workqueue) is more complexity than the metric is worth. Suggestion: add a short comment on // stampSecretTrigger records that the next reconcile for `key` was
// most likely caused by a Secret change. Classification is best-effort:
// overlapping triggers for the same key may swap labels between
// concurrent reconciles. The metric is informational only.Same shape on |
|
No unit tests for the trigger-stamp lifecycle
This is exactly the kind of code where concurrency / lifecycle bugs are easy to introduce and hard to spot in review — the The stamp maps live directly on Suggested cases (one per stamp helper set):
Same shape for These tests would also serve as living documentation of the intended invariants (e.g. "a propagation stamp is either consumed and observed, or cleared on terminal exit"), which is hard to reconstruct from the code alone. |
|
Follow-up: move aggregator-call instrumentation from controllers into the client Not blocking for this PR, but worth flagging for the next cleanup pass. Right now every aggregator call in every controller carries the same three-line ritual: // externaldatabase_controller.go
aggStart := time.Now()
aggErr := r.Aggregator.RegisterExternalDatabase(ctx, namespace, aggReq)
recordAggregatorCall(controllerEDB, operationRegisterEDB, aggStart, aggErr)
// dbpolicy_controller.go
aggStart := time.Now()
_, aggErr := r.Aggregator.ApplyConfig(ctx, payload)
recordAggregatorCall(controllerDP, operationApplyConfig, aggStart, aggErr)
// databasedeclaration_controller.go (two sites: ApplyConfig + GetOperationStatus)
aggStart := time.Now()
resp, err := r.Aggregator.ApplyConfig(ctx, payload)
recordAggregatorCall(controllerDD, operationApplyConfig, aggStart, err)This spreads an observability concern across business code. Every new controller (DatabaseSecret, designators…) and every new client method has to remember to wrap. The controller doesn't really care what The aggregator client is built on // In AggregatorClient constructor:
client.OnBeforeRequest(func(_ *resty.Client, r *resty.Request) error {
r.SetContext(context.WithValue(r.Context(), startTimeKey{}, time.Now()))
return nil
})
client.OnAfterResponse(func(_ *resty.Client, r *resty.Response) error {
start, _ := r.Request.Context().Value(startTimeKey{}).(time.Time)
op, _ := r.Request.Context().Value(operationKey{}).(string)
ctrl, _ := r.Request.Context().Value(controllerKey{}).(string)
recordAggregatorCall(ctrl, op, start, errFromResp(r))
return nil
})Each client method can attach its own func (c *AggregatorClient) RegisterExternalDatabase(ctx context.Context, ns string, req *ExternalDatabaseRequest) error {
ctx = withOperation(ctx, operationRegisterEDB)
...
}Controllers then pass only the Suggested action: track as a follow-up issue, not a change in this PR. Once a fourth controller appears (DatabaseSecret) the boilerplate becomes noticeable; pulling it down into the client at that point is cheap. |
|
Nit: add an operational note about restart-induced sample loss in The doc already describes the restart behavior:
This explains what happens, but leaves the dashboard / alert author to figure out what to do with that fact. Consider adding a one-line operational note right after that paragraph:
Rationale: a future on-call engineer building an alert like "completion rate dropped below X" will get false positives on every rollout/restart. Calling this out next to the metric — rather than only in the implementation discussion — prevents that. Persisting the submit time on |
|
Dashboard refresh interval is 10s — consider 30s default The dashboard JSON sets: "refresh": "10s"At 19 PromQL expressions and a 10s refresh, every viewer of the dashboard issues ~114 queries/minute against Prometheus. In a production cluster with multiple SREs / app teams keeping the dashboard open, this compounds quickly — and some panels run Most operator dashboards in the broader Kubernetes ecosystem (kube-state-metrics, controller-runtime, cert-manager) default to Suggestion: change the default to - "refresh": "10s",
+ "refresh": "30s",If a tunable refresh ever becomes a real requirement, it can be threaded through Helm later (similar to how |
|
Dashboard does not scope to the operator instance —
json: {{ .Files.Get "..." | toJson | replace "#namespace#" .Values.NAMESPACE }}But the dashboard JSON contains zero occurrences of Suggested fix: add Other metrics (no internal Severity: medium. Single-operator clusters won't notice; multi-operator or blue/green deployments will see cross-talk. Worth fixing before this dashboard ships. |
|
Follow-up: extract the trigger-tracker pattern into a shared type Three controllers ( bindingTriggerMu sync.Mutex
bindingTriggerStamps map[string]struct{}
func (r *Reconciler) stampBindingTrigger(key string) { ... }
func (r *Reconciler) consumeBindingTrigger(key string) bool { ... }
func (r *Reconciler) clearBindingTrigger(key string) { ... }That's ~30 lines × 3 = ~90 lines of duplicate code that differ only in the receiver type. The EDB controller has another near-identical set for the Secret trigger ( Suggestion: extract into a reusable type, e.g. // triggerTracker is a thread-safe set of pending reconcile triggers.
// Stamp before enqueue; consume once at the start of the resulting reconcile.
type triggerTracker struct {
mu sync.Mutex
stamps map[string]struct{}
}
func (t *triggerTracker) stamp(key string) { ... }
func (t *triggerTracker) consume(key string) bool { ... }
func (t *triggerTracker) clear(key string) { ... }Then each controller embeds it: type ExternalDatabaseReconciler struct {
...
bindingTrigger triggerTracker
secretTrigger triggerTracker
// secretPropagation keeps its own type — different semantics (stores time.Time)
}
// call sites become:
r.bindingTrigger.stamp(key)
r.bindingTrigger.consume(key)For Not blocking this PR. Combines nicely with the unit-test ask in another comment: writing tests for one shared type is much cheaper than writing them three times. And the next controller picks up the tested, mutex-correct implementation for free. |
|
Addressed the changes, will be taking up the boilerplate removal on a later change |
|
Duplicate godoc on After the // classifierToFlatMap converts the typed Classifier struct into the flat
// string-to-string map expected by dbaas-aggregator on the wire.
// The aggregator treats the classifier as SortedMap<String, Object> with
// top-level keys: microserviceName, scope, namespace, tenantId, plus any
// additional adapter-specific entries from customKeys merged into the top level.
// classifierToFlatMap converts a typed Classifier into the map expected by
// dbaas-aggregator's ExternalDatabaseRequestV3.classifier (declared on the
// Java side as SortedMap<String, Object>). Scalar fields are emitted as
// top-level keys. customKeys entries are flattened into the same top-level
// map and preserve their native JSON types — strings stay as Go strings,
// numbers as float64, booleans as bool, nested objects/arrays as
// map[string]any / []any. The aggregator stores the classifier as JSONB and
// supports deep value comparison, so nested values are first-class.
//
// Explicit scalar fields take precedence over customKeys entries with the
// same name — this prevents user-supplied customKeys from silently
// overwriting the structured identity fields.
func classifierToFlatMap(c dbaasv1.Classifier) map[string]any {Both blocks describe the same function. Suggestion: keep the second block (it covers more: Java DTO reference, JSONB semantics, native JSON-type preservation, collision precedence) and delete the first. Not strictly part of the metrics work, but the file is touched by this PR so it's the cheapest place to clean it up. |
|
if err := r.Get(ctx, req.NamespacedName, dd); err != nil {
if apierrors.IsNotFound(err) {
r.clearBindingTrigger(req.Namespace + "/" + req.Name) // only binding stamps
}
return ctrl.Result{}, client.IgnoreNotFound(err)
}The NotFound branch clears
Map size is bounded by the number of mid-flight deletions per process lifetime, so this is not a runaway leak, but it's a symmetric mirror of the Suggested fix: add an EDB-style helper if apierrors.IsNotFound(err) {
r.clearBindingTrigger(req.Namespace + "/" + req.Name)
r.clearAsyncStart(req.Namespace + "/" + req.Name)
}And cover it in |
|
The propagation fromSecret := r.consumeSecretTrigger(edbKey)
defer func() {
secretStart, ok := r.consumeSecretPropagation(edbKey)
if ok && edb.Status.Phase == dbaasv1.PhaseSucceeded {
dbaasSecretRotationPropagationSeconds.Observe(time.Since(secretStart).Seconds())
}
}()
// ...
owned, result, err := checkOwnership(...)
if err != nil {
return ctrl.Result{}, err // defer fires here
}The False-positive scenario:
After retry the stamp is gone, so the eventual real success records no sample. Net effect: one wrong short sample + one missing real sample in the histogram. Suggested fix: decouple the "did we actually succeed in this reconcile?" signal from var observePropagation bool
defer func() {
if secretStart, ok := r.consumeSecretPropagation(edbKey); ok && observePropagation {
dbaasSecretRotationPropagationSeconds.
Observe(time.Since(secretStart).Seconds())
}
}()
// ...
// at the end of the success path, right after markSucceeded:
observePropagation = trueInvariant becomes: a stamp is either turned into a sample (and only when this reconcile reached |
|
In all three controllers, fromSecret := r.consumeSecretTrigger(edbKey) // consume
// ...
trigger := triggerSpecChange
if fromSecret {
trigger = triggerSecretChange
} else if r.consumeBindingTrigger(edbKey) { // consume
trigger = triggerNamespaceBindingChange
}
owned, result, err := checkOwnership(...)
if err != nil {
return ctrl.Result{}, err
}
if !owned {
r.clearSecretTrigger(edbKey)
r.clearBindingTrigger(edbKey)
return result, nil
}
recordReconcileTrigger(controllerEDB, trigger)Scenario that produces a misattribution:
This is in the same category as the "overlapping triggers" caveat the stamp helpers already mention — just a different root cause (ownership skip rather than concurrent triggers). The metric is informational, so a logic change isn't needed; the docs just shouldn't omit this case. Suggestion: extend the existing
No code change needed — just keeping the contract honest for future readers. |
|
var dbaasSecretResolutionErrorsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "dbaas_secret_resolution_errors_total",
Help: "Total failures reading credential Secrets referenced by ExternalDatabase.",
},
[]string{"namespace", "reason"},
)What the help text says vs. what the counter actually counts:
The Suggestion: expand the help to be self-describing: Help: "Failures reading credential Secrets referenced by ExternalDatabase, " +
"scoped to namespaces owned by this operator instance. " +
"Labelled by namespace and failure category " +
"(secret_not_found, key_missing, key_empty, forbidden, secret_read_failed).",Same shape would also benefit a couple of the other metrics (for example |
|
{name: "success", err: nil, want: resultSuccess},
{name: "auth", err: &aggregatorclient.AggregatorError{StatusCode: 401}, want: resultAuthError},
{name: "spec rejection", err: &aggregatorclient.AggregatorError{StatusCode: 409}, want: resultSpecRejection},
{name: "server", err: &aggregatorclient.AggregatorError{StatusCode: 500}, want: resultServerError},
{name: "wrapped server", err: fmt.Errorf("wrapped: %w", &aggregatorclient.AggregatorError{StatusCode: 503}), want: resultServerError},
{name: "network", err: errors.New("dial tcp: connection refused"), want: resultNetworkError},The matrix has bare and wrapped The current code handles it correctly via var aggErr *aggregatorclient.AggregatorError
if errors.As(err, &aggErr) { ... }
return resultNetworkErrorBut the test matrix doesn't pin this behaviour. Someone could easily refactor Suggestion: add one more case to the table: {name: "wrapped network", err: fmt.Errorf("connect: %w", errors.New("dial timeout")), want: resultNetworkError}, |
No description provided.