Skip to content

feat: dbaas operator metrics#449

Merged
kichasov merged 8 commits into
feat/operator-devfrom
feat/metrics
May 20, 2026
Merged

feat: dbaas operator metrics#449
kichasov merged 8 commits into
feat/operator-devfrom
feat/metrics

Conversation

@TheHollowMonarch
Copy link
Copy Markdown
Contributor

No description provided.

@kichasov
Copy link
Copy Markdown
Collaborator

Cosmetic regression: em-dashes replaced with hyphens across files unrelated to metrics

The diff systematically replaces (em-dash) with - (hyphen) and strips the // ── Section ────────── block dividers in files that are touched for metrics work. Examples:

internal/controller/externaldatabase_controller.go:

- // Conditions are NOT cleared here — setCondition upserts each type in place,
+ // Conditions are NOT cleared here - setCondition upserts each type in place,

- // A mismatch is a permanent misconfiguration — no retry.
+ // A mismatch is a permanent misconfiguration - no retry.

- // build the reverse map ("namespace/secretName" → EDBs), then keeps it up to date
+ // build the reverse map ("namespace/secretName" -> EDBs), then keeps it up to date

- // WatchesMetadata avoids caching Secret data (credentials) in operator memory —
+ // WatchesMetadata avoids caching Secret data (credentials) in operator memory -

- // Uses the field index for an O(1) lookup — no full list scan.
+ // Uses the field index for an O(1) lookup - no full list scan.

cmd/main.go:

- setupLog.Errorf("CLOUD_NAMESPACE env var is not set — ownership checks will not work correctly")
+ setupLog.Errorf("CLOUD_NAMESPACE env var is not set - ownership checks will not work correctly")

- // ── Operator namespace ────────────────────────────────────────────────────
+ // Operator namespace.

- // ── dbaas-aggregator client ───────────────────────────────────────────────
+ // dbaas-aggregator client.

- // ── Ownership resolver ────────────────────────────────────────────────────
+ // Ownership resolver.

- // ── NamespaceBinding controller (always enabled) ───────────────────────────
+ // NamespaceBinding controller (always enabled).

These changes are unrelated to the metrics feature and inconsistent with the rest of the codebase, where em-dashes and Unicode section dividers are an established convention (see namespacebinding_controller.go, ExternalDatabase tests, design docs). They look like editor smart-dash auto-replace.

Could you please revert these stylistic edits so the diff stays focused on metrics? It will make this PR much easier to review and prevent the same noise from leaking into future PRs. If your editor is doing the replacement automatically, disabling smart-quotes / smart-dashes for .go and .md files would help.

@kichasov
Copy link
Copy Markdown
Collaborator

Use constants for result label values, not string literals

metrics.go already defines a clean set of result* constants (resultSuccess, resultAuthError, resultSpecRejection, resultServerError, resultNetworkError) for the result label on dbaas_aggregator_requests_total. But the new async-operation observation in databasedeclaration_controller.go mixes constants and literals:

// handlePollResponse:
r.observeAsyncCompletion(dd, resultSuccess)   // constant ✓
r.observeAsyncCompletion(dd, "failed")        // literal  ✗
r.observeAsyncCompletion(dd, "terminated")    // literal  ✗

operator-metrics.md declares these as a stable contract for dbaas_async_operation_duration_seconds:

Labels: result (success | failed | terminated)

Keeping two of the three values as magic strings means a typo ("failed" vs "falied") will silently emit a wrong label and break the dashboard / any future alert that filters on result=. The compiler won't catch it because string literals are not type-checked.

Suggestion: define the missing values next to the existing constants in metrics.go and use them consistently. For example:

const (
    // ... existing result* constants ...
    asyncResultFailed     = "failed"
    asyncResultTerminated = "terminated"
)

Then in the controller:

r.observeAsyncCompletion(dd, resultSuccess)
r.observeAsyncCompletion(dd, asyncResultFailed)
r.observeAsyncCompletion(dd, asyncResultTerminated)

All three label values are then defined in one place, grep-able and refactor-safe.

@kichasov
Copy link
Copy Markdown
Collaborator

secretPropagationStamps leaks for ExternalDatabases that never reach Succeeded

stampSecretTrigger always inserts a start time into secretPropagationStamps on a Secret event:

func (r *ExternalDatabaseReconciler) stampSecretTrigger(key string, startedAt time.Time) {
    ...
    if _, exists := r.secretPropagationStamps[key]; !exists {
        r.secretPropagationStamps[key] = startedAt
    }
}

But the stamp is only consumed on the Succeeded path:

// Reconcile, success branch only:
if secretStart, ok := r.consumeSecretPropagation(edbKey); ok {
    dbaasSecretRotationPropagationSeconds.Observe(time.Since(secretStart).Seconds())
}

The only other cleanups (clearSecretTrigger) fire on NotFound and !owned. Permanent / sustained failure paths are not covered:

  • InvalidConfiguration (e.g. classifier-namespace mismatch, AggregatorRejected 400/403/409)
  • BackingOff indefinitely (sustained 401, SecretError on a broken Secret, etc.)

Scenario:

  1. User updates a Secret → enqueueForSecret stamps the start time.
  2. Reconcile runs but buildRequest fails (e.g. key missing in the Secret) → markTransientFailure → no consumeSecretPropagation.
  3. No further Secret change occurs (the user doesn't realise something is broken). Reconciles quiesce.
  4. The stamp stays in secretPropagationStamps until the CR is deleted.

The memory cost is negligible. The real bug is stale stamps poisoning the next observation: if the same Secret is later rotated successfully an hour later, the histogram records a one-hour propagation, even though the actual rotation took only seconds.

Suggestion: drop the propagation stamp on every reconcile exit, not only on success. The cleanest place is the existing defer — observe only when the resource is Succeeded, but always remove the entry from the map.

defer func() {
    if secretStart, ok := r.consumeSecretPropagation(edbKey); ok {
        if edb.Status.Phase == dbaasv1.PhaseSucceeded {
            dbaasSecretRotationPropagationSeconds.
                Observe(time.Since(secretStart).Seconds())
        }
        // else: failure path — drop the stamp without polluting the histogram
    }
}()

Invariant becomes: a propagation stamp is either turned into a histogram sample or discarded — never left behind.

@kichasov
Copy link
Copy Markdown
Collaborator

reconcile_trigger_total label may be misattributed under concurrent triggers — document it

The trigger-stamp pattern uses two steps:

  1. enqueueForSecret / enqueueForBinding writes a stamp keyed by namespace/name.
  2. The next Reconcile for that key calls consumeSecretTrigger / consumeBindingTrigger and labels itself accordingly.

If multiple triggers arrive for the same key, the consume/produce order is not guaranteed to match the actual cause of each reconcile. For example:

t0: spec change            → enqueue R1
t1: Secret change          → enqueueForSecret stamps key
t2: R1 starts              → consumeSecretTrigger returns true
                             → R1 labels itself "secret_change"  (wrong: R1 came from spec change)
t3: R2 (actually secret-driven) starts
                             → no stamp → labels itself "spec_change" (wrong)

The reconciles still happen correctly — only the trigger label is swapped. dbaas_reconcile_trigger_total is informational (dashboard panel, not an SLO or alert source), so the impact is limited, but it skews the trigger distribution during periods of overlapping changes.

Tightening the classification properly (e.g. carrying the cause through the workqueue) is more complexity than the metric is worth. Suggestion: add a short comment on stampSecretTrigger and on the corresponding stampBindingTrigger calls explaining the limitation, so future readers don't try to alert on the exact trigger value:

// stampSecretTrigger records that the next reconcile for `key` was
// most likely caused by a Secret change. Classification is best-effort:
// overlapping triggers for the same key may swap labels between
// concurrent reconciles. The metric is informational only.

Same shape on stampBindingTrigger and consumeBindingTrigger. A line in operator-metrics.md under dbaas_reconcile_trigger_total saying "label is best-effort under concurrent triggers" would also help.

@kichasov
Copy link
Copy Markdown
Collaborator

No unit tests for the trigger-stamp lifecycle

metrics_test.go covers only the pure classify-functions (aggregatorResult, secretResolutionReason). The genuinely new code in this PR is the set of mutex-protected stamp maps on each reconciler — secretTriggerStamps, secretPropagationStamps, bindingTriggerStamps (the binding variant is replicated across all three controllers).

This is exactly the kind of code where concurrency / lifecycle bugs are easy to introduce and hard to spot in review — the secretPropagationStamps leak under non-Succeeded paths called out in the other comment is a concrete example. Right now the lifecycle is "trust the reviewer's eyes".

The stamp maps live directly on ExternalDatabaseReconciler / DatabaseDeclarationReconciler / DbPolicyReconciler, so they can be tested as plain unit tests — no envtest required.

Suggested cases (one per stamp helper set):

  1. stampSecretTriggerconsumeSecretTrigger returns true exactly once; subsequent consumes return false.
  2. Repeated stampSecretTrigger for the same key before consume is idempotent (set semantics, not a counter).
  3. secretPropagationStamps preserves the earliest start time across multiple stamps for the same key.
  4. clearSecretTrigger removes both the trigger stamp and the propagation stamp.
  5. (Optional) concurrent stamp / consume exercised under go test -race for sanity.

Same shape for stampBindingTrigger / consumeBindingTrigger / clearBindingTrigger.

These tests would also serve as living documentation of the intended invariants (e.g. "a propagation stamp is either consumed and observed, or cleared on terminal exit"), which is hard to reconstruct from the code alone.

@kichasov
Copy link
Copy Markdown
Collaborator

Follow-up: move aggregator-call instrumentation from controllers into the client

Not blocking for this PR, but worth flagging for the next cleanup pass.

Right now every aggregator call in every controller carries the same three-line ritual:

// externaldatabase_controller.go
aggStart := time.Now()
aggErr := r.Aggregator.RegisterExternalDatabase(ctx, namespace, aggReq)
recordAggregatorCall(controllerEDB, operationRegisterEDB, aggStart, aggErr)

// dbpolicy_controller.go
aggStart := time.Now()
_, aggErr := r.Aggregator.ApplyConfig(ctx, payload)
recordAggregatorCall(controllerDP, operationApplyConfig, aggStart, aggErr)

// databasedeclaration_controller.go (two sites: ApplyConfig + GetOperationStatus)
aggStart := time.Now()
resp, err := r.Aggregator.ApplyConfig(ctx, payload)
recordAggregatorCall(controllerDD, operationApplyConfig, aggStart, err)

This spreads an observability concern across business code. Every new controller (DatabaseSecret, designators…) and every new client method has to remember to wrap. The controller doesn't really care what operation or controller label values are; it only knows "I'm calling the aggregator."

The aggregator client is built on resty.Client, which has built-in OnBeforeRequest / OnAfterResponse hooks. The same metric can be emitted from there, once, with no call-site overhead:

// In AggregatorClient constructor:
client.OnBeforeRequest(func(_ *resty.Client, r *resty.Request) error {
    r.SetContext(context.WithValue(r.Context(), startTimeKey{}, time.Now()))
    return nil
})
client.OnAfterResponse(func(_ *resty.Client, r *resty.Response) error {
    start, _ := r.Request.Context().Value(startTimeKey{}).(time.Time)
    op,    _ := r.Request.Context().Value(operationKey{}).(string)
    ctrl,  _ := r.Request.Context().Value(controllerKey{}).(string)
    recordAggregatorCall(ctrl, op, start, errFromResp(r))
    return nil
})

Each client method can attach its own operation label internally:

func (c *AggregatorClient) RegisterExternalDatabase(ctx context.Context, ns string, req *ExternalDatabaseRequest) error {
    ctx = withOperation(ctx, operationRegisterEDB)
    ...
}

Controllers then pass only the controller label via context (or via a small client.For(controllerEDB) wrapper) and become unaware of the metric.

Suggested action: track as a follow-up issue, not a change in this PR. Once a fourth controller appears (DatabaseSecret) the boilerplate becomes noticeable; pulling it down into the client at that point is cheap.

@kichasov
Copy link
Copy Markdown
Collaborator

Nit: add an operational note about restart-induced sample loss in operator-metrics.md

The doc already describes the restart behavior:

The start timestamp is kept in memory after the operator submits an async operation. If the operator restarts before the terminal poll result, that operation is still reconciled correctly, but its duration sample is not recorded.

This explains what happens, but leaves the dashboard / alert author to figure out what to do with that fact. Consider adding a one-line operational note right after that paragraph:

Operational note: alerts on this histogram should rely on absolute latency percentiles, not on sample-rate dips. A drop in rate(dbaas_async_operation_duration_seconds_count[…]) after an operator restart usually reflects lost samples, not a slowdown in actual provisioning.

Rationale: a future on-call engineer building an alert like "completion rate dropped below X" will get false positives on every rollout/restart. Calling this out next to the metric — rather than only in the implementation discussion — prevents that.

Persisting the submit time on Status would solve the issue properly, but the trade-off (extra status write per submit, schema field, migration) isn't worth it at the current scale. Re-evaluate if the 2-hour bucket becomes routinely populated.

@kichasov
Copy link
Copy Markdown
Collaborator

Dashboard refresh interval is 10s — consider 30s default

The dashboard JSON sets:

"refresh":  "10s"

At 19 PromQL expressions and a 10s refresh, every viewer of the dashboard issues ~114 queries/minute against Prometheus. In a production cluster with multiple SREs / app teams keeping the dashboard open, this compounds quickly — and some panels run increase(...[$__range]) over the full dashboard range, which is not cheap on long retention.

Most operator dashboards in the broader Kubernetes ecosystem (kube-state-metrics, controller-runtime, cert-manager) default to 30s for the same reason. 10s is appropriate for a development setup; it's noticeable load and a flickering UI for a production audience.

Suggestion: change the default to "30s" in the JSON. Operators who want tighter resolution can override per-session in Grafana.

- "refresh":  "10s",
+ "refresh":  "30s",

If a tunable refresh ever becomes a real requirement, it can be threaded through Helm later (similar to how #namespace# is already a placeholder in Dashboard.yaml).

@kichasov
Copy link
Copy Markdown
Collaborator

Dashboard does not scope to the operator instance — #namespace# placeholder is the missing piece

Dashboard.yaml clearly intends per-instance scoping:

json: {{ .Files.Get "..." | toJson | replace "#namespace#" .Values.NAMESPACE }}

But the dashboard JSON contains zero occurrences of #namespace#, so the replace is a no-op and the panels aggregate metrics from every dbaas-operator instance scraped by the same Prometheus. In a multi-tenant cluster, or during a blue/green rollout with two operators in different namespaces, the dashboard mixes their metrics.

Suggested fix: add {namespace="#namespace#"} to every PromQL selector in the JSON. One catch: dbaas_secret_resolution_errors_total already has its own namespace label (the EDB CR's namespace). Prometheus renames the colliding metric label to exported_namespace. So queries on that metric should look like:

# before:
sum by (namespace, reason) (rate(dbaas_secret_resolution_errors_total[1m]))

# after:
sum by (exported_namespace, reason) (
  rate(dbaas_secret_resolution_errors_total{namespace="#namespace#"}[1m])
)

Other metrics (no internal namespace label) just need the selector added.

Severity: medium. Single-operator clusters won't notice; multi-operator or blue/green deployments will see cross-talk. Worth fixing before this dashboard ships.

@kichasov
Copy link
Copy Markdown
Collaborator

Follow-up: extract the trigger-tracker pattern into a shared type

Three controllers (ExternalDatabaseReconciler, DatabaseDeclarationReconciler, DbPolicyReconciler) now carry the same shape:

bindingTriggerMu     sync.Mutex
bindingTriggerStamps map[string]struct{}

func (r *Reconciler) stampBindingTrigger(key string)   { ... }
func (r *Reconciler) consumeBindingTrigger(key string) bool { ... }
func (r *Reconciler) clearBindingTrigger(key string)   { ... }

That's ~30 lines × 3 = ~90 lines of duplicate code that differ only in the receiver type. The EDB controller has another near-identical set for the Secret trigger (stamp/consume/clearSecretTrigger), adding ~50 more lines of the same shape. The fourth controller (DatabaseSecret) will copy this once again.

Suggestion: extract into a reusable type, e.g. internal/controller/trigger_tracker.go:

// triggerTracker is a thread-safe set of pending reconcile triggers.
// Stamp before enqueue; consume once at the start of the resulting reconcile.
type triggerTracker struct {
    mu     sync.Mutex
    stamps map[string]struct{}
}

func (t *triggerTracker) stamp(key string)   { ... }
func (t *triggerTracker) consume(key string) bool { ... }
func (t *triggerTracker) clear(key string)   { ... }

Then each controller embeds it:

type ExternalDatabaseReconciler struct {
    ...
    bindingTrigger triggerTracker
    secretTrigger  triggerTracker
    // secretPropagation keeps its own type — different semantics (stores time.Time)
}

// call sites become:
r.bindingTrigger.stamp(key)
r.bindingTrigger.consume(key)

For secretPropagationStamps (value is time.Time, not struct{}), either a sibling triggerStampsWithValue[V any] or a small non-generic propagationTracker would do — but only one is needed, so non-generic is probably simpler.

Not blocking this PR. Combines nicely with the unit-test ask in another comment: writing tests for one shared type is much cheaper than writing them three times. And the next controller picks up the tested, mutex-correct implementation for free.

@TheHollowMonarch
Copy link
Copy Markdown
Contributor Author

Addressed the changes, will be taking up the boilerplate removal on a later change

@kichasov
Copy link
Copy Markdown
Collaborator

Duplicate godoc on classifierToFlatMap — merge artifact

After the feat/operator-dev → feat/metrics merge, the godoc block on classifierToFlatMap is doubled (externaldatabase_controller.go:213–229):

// classifierToFlatMap converts the typed Classifier struct into the flat
// string-to-string map expected by dbaas-aggregator on the wire.
// The aggregator treats the classifier as SortedMap<String, Object> with
// top-level keys: microserviceName, scope, namespace, tenantId, plus any
// additional adapter-specific entries from customKeys merged into the top level.
// classifierToFlatMap converts a typed Classifier into the map expected by
// dbaas-aggregator's ExternalDatabaseRequestV3.classifier (declared on the
// Java side as SortedMap<String, Object>). Scalar fields are emitted as
// top-level keys. customKeys entries are flattened into the same top-level
// map and preserve their native JSON types — strings stay as Go strings,
// numbers as float64, booleans as bool, nested objects/arrays as
// map[string]any / []any. The aggregator stores the classifier as JSONB and
// supports deep value comparison, so nested values are first-class.
//
// Explicit scalar fields take precedence over customKeys entries with the
// same name — this prevents user-supplied customKeys from silently
// overwriting the structured identity fields.
func classifierToFlatMap(c dbaasv1.Classifier) map[string]any {

Both blocks describe the same function. go doc will print both back-to-back, and any future editor will have to guess which one is authoritative.

Suggestion: keep the second block (it covers more: Java DTO reference, JSONB semantics, native JSON-type preservation, collision precedence) and delete the first. Not strictly part of the metrics work, but the file is touched by this PR so it's the cheapest place to clean it up.

@kichasov
Copy link
Copy Markdown
Collaborator

asyncStartTimes leaks when a DatabaseDeclaration is deleted mid-flight

databasedeclaration_controller.go:81–87:

if err := r.Get(ctx, req.NamespacedName, dd); err != nil {
    if apierrors.IsNotFound(err) {
        r.clearBindingTrigger(req.Namespace + "/" + req.Name)   // only binding stamps
    }
    return ctrl.Result{}, client.IgnoreNotFound(err)
}

The NotFound branch clears bindingTriggerStamps but not asyncStartTimes. Scenario:

  1. User creates a DatabaseDeclaration.
  2. Reconcile submits, aggregator returns HTTP 202 + trackingID → r.asyncStartTimes[ddKey] = time.Now().
  3. User deletes the CR before the operation completes.
  4. The next reconcile fails on Get → NotFound → clears binding stamp only.
  5. The entry in asyncStartTimes stays in the map until the operator restarts.

Map size is bounded by the number of mid-flight deletions per process lifetime, so this is not a runaway leak, but it's a symmetric mirror of the secretPropagationStamps cleanup we just added for EDB, and the symmetry is worth keeping.

Suggested fix: add an EDB-style helper clearAsyncStart(key) and call it in the NotFound branch:

if apierrors.IsNotFound(err) {
    r.clearBindingTrigger(req.Namespace + "/" + req.Name)
    r.clearAsyncStart(req.Namespace + "/" + req.Name)
}

And cover it in trigger_stamps_test.go alongside the existing TestExternalDatabaseClearSecretTriggerClearsTriggerAndPropagation test.

@kichasov
Copy link
Copy Markdown
Collaborator

secretPropagationStamps may record a false positive on transient ownership errors

The propagation defer in externaldatabase_controller.go:90–96 is registered before the ownership check:

fromSecret := r.consumeSecretTrigger(edbKey)
defer func() {
    secretStart, ok := r.consumeSecretPropagation(edbKey)
    if ok && edb.Status.Phase == dbaasv1.PhaseSucceeded {
        dbaasSecretRotationPropagationSeconds.Observe(time.Since(secretStart).Seconds())
    }
}()
// ...
owned, result, err := checkOwnership(...)
if err != nil {
    return ctrl.Result{}, err   // defer fires here
}

The edb.Status.Phase == PhaseSucceeded check looks at the phase as it came from the API server — the previous reconcile's outcome, not this reconcile's outcome. The line edb.Status.Phase = dbaasv1.PhaseProcessing is at line 132, well after the ownership check.

False-positive scenario:

  1. CR is currently Succeeded (previous reconcile finished cleanly).
  2. Secret rotates → stampSecretTrigger(edbKey, t0).
  3. Reconcile starts. consumeSecretTrigger returns true.
  4. checkOwnership returns a transient error (e.g. NamespaceBinding cache not yet warm, i/o timeout listing bindings).
  5. Reconcile returns err; defer runs.
  6. consumeSecretPropagation extracts t0.
  7. edb.Status.Phase == PhaseSucceeded is true (stale, from the previous reconcile).
  8. Observe(time.Since(t0)) records a sample that represents "Secret event → transient ownership error", not a real rotation propagation.

After retry the stamp is gone, so the eventual real success records no sample. Net effect: one wrong short sample + one missing real sample in the histogram.

Suggested fix: decouple the "did we actually succeed in this reconcile?" signal from edb.Status.Phase. Use a local bool that is only set in the success branch, so the defer can both drop the stamp unconditionally and observe only on real success:

var observePropagation bool
defer func() {
    if secretStart, ok := r.consumeSecretPropagation(edbKey); ok && observePropagation {
        dbaasSecretRotationPropagationSeconds.
            Observe(time.Since(secretStart).Seconds())
    }
}()
// ...
// at the end of the success path, right after markSucceeded:
observePropagation = true

Invariant becomes: a stamp is either turned into a sample (and only when this reconcile reached markSucceeded), or dropped. Transient errors on the ownership / pre-success path never feed false samples to the histogram.

@kichasov
Copy link
Copy Markdown
Collaborator

consume*Trigger swallows stamps even when the reconcile is skipped on ownership — extend the best-effort disclaimer

In all three controllers, consumeSecretTrigger / consumeBindingTrigger are called before checkOwnership. Example from externaldatabase_controller.go:90–113:

fromSecret := r.consumeSecretTrigger(edbKey)               // consume
// ...
trigger := triggerSpecChange
if fromSecret {
    trigger = triggerSecretChange
} else if r.consumeBindingTrigger(edbKey) {                // consume
    trigger = triggerNamespaceBindingChange
}

owned, result, err := checkOwnership(...)
if err != nil {
    return ctrl.Result{}, err
}
if !owned {
    r.clearSecretTrigger(edbKey)
    r.clearBindingTrigger(edbKey)
    return result, nil
}
recordReconcileTrigger(controllerEDB, trigger)

Scenario that produces a misattribution:

  1. A NamespaceBinding is created for a new namespace.
  2. enqueueForBinding stamps the binding trigger for every CR in that namespace.
  3. The first reconcile fires before the binding is visible in the operator's informer cache.
  4. consumeBindingTrigger returns true and removes the stamp.
  5. checkOwnership returns false (binding not yet visible).
  6. Reconcile returns without calling recordReconcileTrigger — but the stamp is gone.
  7. When the binding lands in the cache, a fresh reconcile runs and is labelled spec_change instead of namespace_binding_change.

This is in the same category as the "overlapping triggers" caveat the stamp helpers already mention — just a different root cause (ownership skip rather than concurrent triggers). The metric is informational, so a logic change isn't needed; the docs just shouldn't omit this case.

Suggestion: extend the existing stampSecretTrigger / stampBindingTrigger godocs and the operator-metrics.md entry for dbaas_reconcile_trigger_total with one more bullet. Something like:

In addition, a stamp is consumed even when the reconcile is later skipped on namespace-ownership (e.g. the matching NamespaceBinding has not yet propagated to the informer cache). The follow-up reconcile that does run will fall back to spec_change.

No code change needed — just keeping the contract honest for future readers.

@kichasov
Copy link
Copy Markdown
Collaborator

dbaas_secret_resolution_errors_total help text is imprecise

metrics.go:78–84:

var dbaasSecretResolutionErrorsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "dbaas_secret_resolution_errors_total",
        Help: "Total failures reading credential Secrets referenced by ExternalDatabase.",
    },
    []string{"namespace", "reason"},
)

What the help text says vs. what the counter actually counts:

Help text says Actual behaviour
"Total failures" Only failures in namespaces owned by this operator instance — the counter increments after checkOwnership.
(no reason enumeration) The reason label takes one of five values defined in metrics.go:51–55: secret_not_found, key_missing, key_empty, forbidden, secret_read_failed.

The Help field is the only documentation a downstream user gets when scraping /metrics, in promtool, or on Grafana panel info popups. Both omissions are small surprises that hurt observability hygiene.

Suggestion: expand the help to be self-describing:

Help: "Failures reading credential Secrets referenced by ExternalDatabase, " +
      "scoped to namespaces owned by this operator instance. " +
      "Labelled by namespace and failure category " +
      "(secret_not_found, key_missing, key_empty, forbidden, secret_read_failed).",

Same shape would also benefit a couple of the other metrics (for example dbaas_aggregator_requests_total could enumerate the result values), but this one is the most surprising because of the scope qualifier.

@kichasov
Copy link
Copy Markdown
Collaborator

metrics_test.go lacks a wrapped-network-error case

metrics_test.go:11–32 covers aggregatorResult reasonably:

{name: "success",        err: nil,                                                            want: resultSuccess},
{name: "auth",           err: &aggregatorclient.AggregatorError{StatusCode: 401},             want: resultAuthError},
{name: "spec rejection", err: &aggregatorclient.AggregatorError{StatusCode: 409},             want: resultSpecRejection},
{name: "server",         err: &aggregatorclient.AggregatorError{StatusCode: 500},             want: resultServerError},
{name: "wrapped server", err: fmt.Errorf("wrapped: %w", &aggregatorclient.AggregatorError{StatusCode: 503}), want: resultServerError},
{name: "network",        err: errors.New("dial tcp: connection refused"),                     want: resultNetworkError},

The matrix has bare and wrapped AggregatorErrors, but only the bare generic error variant. The wrapped non-AggregatorError path — which is the realistic shape for transport / network failures coming out of resty (e.g. fmt.Errorf("post: %w", netErr)) — is not exercised.

The current code handles it correctly via errors.As falling through:

var aggErr *aggregatorclient.AggregatorError
if errors.As(err, &aggErr) { ... }
return resultNetworkError

But the test matrix doesn't pin this behaviour. Someone could easily refactor aggregatorResult (e.g. add an errors.As(err, &net.OpError{}) branch with the wrong priority) and silently start mislabelling. Five lines of defensive coverage make the contract explicit.

Suggestion: add one more case to the table:

{name: "wrapped network", err: fmt.Errorf("connect: %w", errors.New("dial timeout")), want: resultNetworkError},

@kichasov kichasov merged commit 9a95ad3 into feat/operator-dev May 20, 2026
6 checks passed
@kichasov kichasov deleted the feat/metrics branch May 20, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants