O-11: live Gatus status badges in repo README#47
Merged
Merged
Conversation
…(O-11) Replaces the static "Active / Dev" Applications table with live status + 7d-uptime badges sourced from Gatus at dev-status.home-0ps.com (already Cloudflare-tunnel-exposed). Adds an Infrastructure block for TrueNAS + UniFi so all 6 endpoints monitored by Gatus surface on the repo landing page. Drops the Syncthing row (spun down 2026-05-15) and updates a stale `stern -n syncthing` example in the kubeop docs to a live namespace. Note: GitHub camo caches external SVGs ~5min — README badges lag the live dashboard by that window. Out of scope to fix.
alexrf45
added a commit
that referenced
this pull request
May 28, 2026
## Closes **O-11 (Gatus endpoint statuses in repo README)** from the open-items punch list. Source: [_docs/reviews/home-0ps-review-2026-05-27.md](../blob/dev/_docs/reviews/home-0ps-review-2026-05-27.md#observability-follow-ups) line 121. ## What's in this PR - New `## Live status` block in `README.md` with two sub-tables (`### Applications`, `### Infrastructure`) surfacing all 6 Gatus-monitored endpoints together - **Hybrid badge format:** shields.io endpoint (style-consistent with the rest of the README's shields) for the *Status* column, direct Gatus SVG (`/uptimes/7d/badge.svg`) for the *Uptime (7d)* column — different signals, both useful - Removed the stale **Syncthing** row from the `## Applications` table (spun down 2026-05-15) - **Bonus:** found + fixed a second stale Syncthing reference in `.claude/rules/kube-wrapper.md` (a `k8sop dev stern -n syncthing` example → swapped to `freshrss`) - ~5-minute GitHub camo cache caveat documented inline in the README's Live status section (not just the PR body) — visitors hitting stale badge state will see the explanation ## Gatus key normalization confirmed live ``` curl -sf https://dev-status.home-0ps.com/api/v1/endpoints/statuses | jq -r '.[].key' applications_authentik applications_freshrss applications_grafana applications_homer infrastructure_truenas infrastructure_unifi ``` Rule: `<group>_<lowercase(name)>` — TrueNAS → truenas, UniFi → unifi. ## Acceptance test CI runs `.claude/sprints/o-11/accept.sh` via `.github/workflows/sprint-accept.yml`. Local pass: ```text [accept:O-11] README.md present [accept:O-11] all 6 Gatus endpoint badge URLs present [accept:O-11] dev-status.home-0ps.com references: 7 [accept:O-11] Syncthing reference removed [accept:O-11] all 6 endpoint tokens surfaced in README copy [accept:O-11] markdownlint not installed — skipping [accept:O-11] PASS ``` ## Public-surface notes - Public hostname `dev-status.home-0ps.com` is Cloudflare-tunnel-exposed + rate-limited (60 req/min) per `terraform/cloudflare-tunnel/main.tf` G2 — no infra changes needed for this PR - GitHub camo proxies + caches the badge SVGs ~5min; badge state on the README lags live dashboard by that window. Acceptable for repo-visitor signal ## Worktree `/tmp/sprints/o-11` (clean up after merge: `git worktree remove /tmp/sprints/o-11 && git branch -D sprint/o-11`) ## Generated by `/sprint-orchestrate H-3 O-9 O-11` — wave 2 (parallel with O-9, which is still running in a separate worktree). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
alexrf45
added a commit
that referenced
this pull request
May 28, 2026
## Closes
**O-9 (App-level dashboards/alerts)** from the open-items punch list.
Source:
[_docs/reviews/home-0ps-review-2026-05-27.md](../blob/dev/_docs/reviews/home-0ps-review-2026-05-27.md#observability-follow-ups)
line 119.
## What's in this PR
- **Authentik HR:** enable `metrics.enabled: true` on both `server` and
`worker` (chart provisions the two metrics Services). **Chart-side
ServiceMonitor stays disabled** — ownership of ServiceMonitors is
centralised in `_lib/observability/kube-prometheus-stack/` (same pattern
as falco), so cardinality/relabel changes happen in one place.
- **`servicemonitor-authentik.yaml`** (new): cross-namespace selects
both `authentik-server-metrics` + `authentik-worker-metrics` Services
via one `matchExpressions` block. 60s scrape interval.
- **4 dashboard ConfigMaps** under
`_lib/observability/kube-prometheus-stack/dashboards/` —
sidecar-discovered via `grafana_dashboard: "1"` label:
- **authentik** — community grafana.com/14837 r2 (beryju), normalised
(stripped `__inputs`/`__requires`, replaced `${DS_PROMETHEUS}` with
Prometheus datasource, cleared `id`)
- **cloudflared** — community dashboard, same normalisation
- **gatus** — hand-authored — uptime & latency panels
- **freshrss** — hand-authored — service-health panels
- **`.yamllint.yaml`** updated: ignore the 2 community-sourced dashboard
YAMLs (embedded markdown in panel `content:` blocks exceeds the 300-char
line cap; pinned to specific revisions per file header, not hand-edited)
## Scope discipline
**O-10 (postgres-exporter + DB-content panels) is a separate sprint.**
Dashboards in this PR show *operational* metrics only — request rates,
latency, pod health, scrape targets. Custom DB-content queries (freshrss
unread/favorites, authentik login stats) wait for the postgres-exporter
rollout in O-10.
## Acceptance test
CI runs `.claude/sprints/o-9/accept.sh` via
`.github/workflows/sprint-accept.yml`. Local pass:
```text
[accept:O-9] yamllint 8 files
[accept:O-9] kubectl kustomize _lib/observability/kube-prometheus-stack
[accept:O-9] assert: ServiceMonitor 'authentik' in obs render
[accept:O-9] assert: dashboard ConfigMaps present
[accept:O-9] found: dashboard-authentik
[accept:O-9] found: dashboard-cloudflared
[accept:O-9] found: dashboard-gatus
[accept:O-9] found: dashboard-freshrss
[accept:O-9] assert: every dashboard ConfigMap has label grafana_dashboard=1
[accept:O-9] assert: Authentik HR has server.metrics.enabled == true
[accept:O-9] assert: Authentik HR has worker.metrics.enabled == true
[accept:O-9] assert: Authentik HR server.metrics.serviceMonitor.enabled == false
[accept:O-9] assert: ServiceMonitor 'authentik' targets authentik ns
[accept:O-9] PASS
```
## Post-merge manual checks (not in accept.sh)
- `kube dev -n monitoring exec deploy/prometheus-kps-prometheus-0 --
wget -qO- localhost:9090/api/v1/targets | grep authentik` — authentik
metrics targets up (server + worker)
- Grafana sidebar: 4 new dashboards visible (Authentik, Cloudflared,
Gatus, FreshRSS)
## Background — subagent session-limit + recovery
This PR was driven by the `/sprint-orchestrate` parallel executor as
wave 2 alongside O-11. The O-9 subagent **completed the implementation**
but hit the global session limit before committing — the work was left
uncommitted in `/tmp/sprints/o-9`. The orchestrator (me) verified the
agent's edits (architecture, ServiceMonitor pattern, dashboard
provenance, yamllint workaround), committed them as a single logical
commit, rebased, ran the acceptance test, and opened this PR.
**Also during recovery:** discovered that PRs #46 (H-3) + #47 (O-11) had
been merged on GitHub but were missing from `origin/dev` — looks like a
force-push to dev clobbered the merge commits. Recovered both via `git
cherry-pick -m 1` of the GitHub merge commits (`9b6881f`, `290677a`)
onto current dev, then pushed `eb197cf` + `6f51280`. O-9 was rebased
onto the now-correct dev tip before this PR was opened.
## Worktree
`/tmp/sprints/o-9` (clean up after merge: `git worktree remove
/tmp/sprints/o-9 && git branch -D sprint/o-9`)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes
O-11 (Gatus endpoint statuses in repo README) from the open-items punch list. Source: _docs/reviews/home-0ps-review-2026-05-27.md line 121.
What's in this PR
## Live statusblock inREADME.mdwith two sub-tables (### Applications,### Infrastructure) surfacing all 6 Gatus-monitored endpoints together/uptimes/7d/badge.svg) for the Uptime (7d) column — different signals, both useful## Applicationstable (spun down 2026-05-15).claude/rules/kube-wrapper.md(ak8sop dev stern -n syncthingexample → swapped tofreshrss)Gatus key normalization confirmed live
Rule:
<group>_<lowercase(name)>— TrueNAS → truenas, UniFi → unifi.Acceptance test
CI runs
.claude/sprints/o-11/accept.shvia.github/workflows/sprint-accept.yml. Local pass:Public-surface notes
dev-status.home-0ps.comis Cloudflare-tunnel-exposed + rate-limited (60 req/min) perterraform/cloudflare-tunnel/main.tfG2 — no infra changes needed for this PRWorktree
/tmp/sprints/o-11(clean up after merge:git worktree remove /tmp/sprints/o-11 && git branch -D sprint/o-11)Generated by
/sprint-orchestrate H-3 O-9 O-11— wave 2 (parallel with O-9, which is still running in a separate worktree).🤖 Generated with Claude Code