Skip to content

docs: KEEP-573 four-dashboard observability spec for review#1318

Open
OleksandrUA wants to merge 2 commits into
stagingfrom
KEEP-573-four-dashboard-spec
Open

docs: KEEP-573 four-dashboard observability spec for review#1318
OleksandrUA wants to merge 2 commits into
stagingfrom
KEEP-573-four-dashboard-spec

Conversation

@OleksandrUA
Copy link
Copy Markdown

Draft of the four-dashboard observability spec called out in KEEP-573 Phase 1 E.

Splits the current single "KeeperHub" dashboard by audience: SLO exec view, platform on-call view, per-org customer-support view, growth/revenue view. Each section in the doc gives variables, panel list with PromQL/LogQL queries, linked alerts, suggested owner, open questions.

Closes the last unticked Phase 1 acceptance item on the parent ticket. Implementation lands in follow-up PRs in grafana/keeperhub-dashboards/git-sync/ (new path adopted 2026-05-20).

Review asks

  • Sanity-check the panel list per dashboard - did I miss anything the audience actually needs?
  • The five "open questions" at the bottom of the doc need decisions before Phase 2 starts.
  • Owner suggestions at the bottom of each dashboard section - confirm or reassign.

Test plan

  • Spec reviewed by Simon
  • Open questions resolved
  • Owners finalized per dashboard

Draft spec for the four-dashboard rebuild called out in KEEP-573 Phase 1 E.
Splits the current single "KeeperHub" dashboard by audience:

- A. Managed Client SLO (exec + Sky/Ajna account team)
- B. Platform Health (TechOps/DevOps on-call)
- C. Customer Workflows (per-org support debugging)
- D. Growth + Revenue (founders + revenue side)

Each dashboard section: variables, panel list with PromQL/LogQL,
linked alerts, owner suggestion, open questions. Implementation plan
targets the new grafana git-sync path adopted on 2026-05-20, with
files landing under grafana/keeperhub-dashboards/git-sync/.

Open questions for the review at the bottom of the doc.

Closes the Phase 1 E acceptance item on the parent ticket. Owners
finalized in this PR's review.
Bulk error_type reclassification (backfill re-run, classifier rule
change, manual SQL fix) makes the DB-sourced gauge's error_type
label-value series move at one scrape. PromQL's increase() reads the
gain as new errors while the loss is treated as a counter reset,
producing a phantom positive bump that contaminates SLI panels for
one window-length.

Documents the symptom + how to recognise it so future engineers don't
chase a phantom incident, and references the KEEP-592 analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant