Describe the bug
ConsulIsDegradedAlarm
Query
sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+",condition="false"}) / sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+"}) > 0
https://github.com/Netcracker/qubership-consul/blob/main/charts/helm/consul-service/templates/prometheus-rules.yaml#L33
ConsulIsDegradedAlarm alert works incorrectly - it does not always trigger and triggers for too short a duration. Problems with Consul may go unnoticed.
Specific cases of incorrect behavior:
-
When deleting one of the Consul pods:
- If Consul quickly recovers the pod, the alert may not trigger at all.
- If pod is deleted, the alert only triggers for the time the pod is in Terminating status.
-
When there are errors in a pod:
- If a pod restarts due to an error, the alert may operate unstably, triggering only briefly during the restart.
-
When scaling the StatefulSet to 2:
- If the number of Consul pods is reduced to 2, the alert will only trigger during the termination of the third pod.
- After the third pod is deleted, the calculation changes: 0 unready pods / 2 total pods = 0, and the alert stops triggering, even though the Consul cluster is in a degraded state (Consul requires 3 servers).
ConsulIsDownAlarm
Query
sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+",condition="false"}) / sum(kube_pod_status_ready{exported_namespace="{{ .Release.Namespace }}",exported_pod=~"{{ template "consul.fullname" . }}-server-[0-9]+"}) == 1
-
When deleting one of the Consul pods:
- In most cases, Consul quickly recovers pods
-
When scaling the StatefulSet to 0:
- it is triggered only if EACH of the 3 pods is not in the Running status, but still exists.
- If you set the statefullset to 0, the ConsulIsDownAlarm triggers for a short time.
- After 3 pods do not exist, this alert will become inactive, and ConsulDoesNotExistAlarm will be triggered.
Expected behavior
Alerts work stably and correctly show the consul's problems
Screenshots
If applicable, add screenshots to help explain your problem.
Environment:
- Application Version: main
- K8S Version:
Additional context
Add any other context about the problem here.
Describe the bug
ConsulIsDegradedAlarm
Query
https://github.com/Netcracker/qubership-consul/blob/main/charts/helm/consul-service/templates/prometheus-rules.yaml#L33
ConsulIsDegradedAlarm alert works incorrectly - it does not always trigger and triggers for too short a duration. Problems with Consul may go unnoticed.
Specific cases of incorrect behavior:
When deleting one of the Consul pods:
When there are errors in a pod:
When scaling the StatefulSet to 2:
ConsulIsDownAlarm
Query
When deleting one of the Consul pods:
When scaling the StatefulSet to 0:
Expected behavior
Alerts work stably and correctly show the consul's problems
Screenshots
If applicable, add screenshots to help explain your problem.
Environment:
Additional context
Add any other context about the problem here.