Skip to content

[BUG] Consul Is Degraded Alert and Consul Is Down Alert are not working correctly #165

@KryukovaPolina

Description

@KryukovaPolina

Describe the bug

ConsulIsDegradedAlarm
Query

sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+",condition="false"}) / sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+"}) > 0

https://github.com/Netcracker/qubership-consul/blob/main/charts/helm/consul-service/templates/prometheus-rules.yaml#L33

ConsulIsDegradedAlarm alert works incorrectly - it does not always trigger and triggers for too short a duration. Problems with Consul may go unnoticed.

Specific cases of incorrect behavior:

  1. When deleting one of the Consul pods:

    • If Consul quickly recovers the pod, the alert may not trigger at all.
    • If pod is deleted, the alert only triggers for the time the pod is in Terminating status.
  2. When there are errors in a pod:

    • If a pod restarts due to an error, the alert may operate unstably, triggering only briefly during the restart.
  3. When scaling the StatefulSet to 2:

    • If the number of Consul pods is reduced to 2, the alert will only trigger during the termination of the third pod.
    • After the third pod is deleted, the calculation changes: 0 unready pods / 2 total pods = 0, and the alert stops triggering, even though the Consul cluster is in a degraded state (Consul requires 3 servers).

ConsulIsDownAlarm
Query

sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+",condition="false"}) / sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+"}) == 1
  1. When deleting one of the Consul pods:

    • In most cases, Consul quickly recovers pods
  2. When scaling the StatefulSet to 0:

    • it is triggered only if EACH of the 3 pods is not in the Running status, but still exists.
    • If you set the statefullset to 0, the ConsulIsDownAlarm triggers for a short time.
    • After 3 pods do not exist, this alert will become inactive, and ConsulDoesNotExistAlarm will be triggered.

Expected behavior
Alerts work stably and correctly show the consul's problems

Screenshots
If applicable, add screenshots to help explain your problem.

Environment:

  • Application Version: main
  • K8S Version:

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions