Derived from a competitor sweep of Uptime Kuma, Gatus, Healthchecks.io, OneUptime, Statping-ng, and Better Stack. Every phase below is a self-contained slice that can be merged independently.
Ordering is by (impact × audience expectation) ÷ implementation effort, not personal preference. Phase 1 is the single most-requested feature across all competitors; phases 1-4 together close ~80% of the perception gap with Uptime Kuma.
- Work proceeds phase by phase, top to bottom.
- After every phase the app is restarted on port 3001 and smoke-tested with
curlbefore moving on. - Every phase adds/modifies the DB schema in a backward-compatible way (
CREATE TABLE IF NOT EXISTS,ALTER TABLE ADD COLUMNguarded by a column-exists check). No destructive migrations. - Every phase keeps the project's core promise: simple, no Docker required, one process, MIT. Features that violate that promise are explicitly out of scope (last section).
Why first: every competitor ships it; users assume it exists.
- New
/statusroute (no auth) and/status.rssAtom feed. - New
status_page_groupsandstatus_page_settingstables (groups = ordered named buckets of monitors). - Per-group / per-monitor display name override (
display_nameonsites). - 90-day daily bar (
uptime%per day) + current state + last 24h MTTR. - Optional
STATUS_PAGE_PUBLIC=trueenv; if false, page requires a?token=query param matched againstSTATUS_PAGE_TOKEN. - Branded with existing whitelabel vars; clean / fast / SSR (no SPA).
- Per-monitor "exclude from status page" flag for internal-only monitors.
Why high: doubles your channel count with mostly thin adapters.
Each new channel is a new type in CHANNEL_TYPES with its own sanitizeConfig and dispatchToChannel branch:
- Slack (incoming webhook URL; reuses Discord embed-ish structure)
- Telegram (bot token + chat ID)
- Ntfy.sh (topic URL, optional auth header, priority + tags)
- Gotify (server URL + app token + priority)
- Pushover (user key + app token + priority)
- Mattermost (incoming webhook; Slack-compatible body)
- Microsoft Teams (incoming webhook; adaptive-card JSON)
All seven use the existing per-event {{placeholder}} template engine.
Why high: every paid competitor surfaces this. Roughly 100 LOC because undici exposes the TLS socket.
- On every successful HTTPS probe, capture
cert_issuer,cert_subject,cert_valid_to,cert_days_remaining,tls_version(stored on the latestchecksrow + denormalized tosites.last_cert_days_remaining). - New monitor type
certthat runs only the TLS handshake (no HTTP body), useful for ports other than 443 (SMTPS, IMAPS, custom). - Two new failure modes: cert expires in < N days (default 14) → channels fire a new event
cert_expiring. Cert already expired →down. - Per-monitor override for the "warn at N days" threshold.
- Display the cert expiry pill on the dashboard card and full info on the detail page.
Why high: closes the "is this actually a monitor?" sniff test against Uptime Kuma.
- TCP —
net.Socketconnect to host:port within timeout. Optional "expected banner contains" substring assertion (saved inexpected_string).src/lib/tcp.js. - Ping (ICMP) — shells out to
/bin/ping -c<n> -W<sec>(no extra deps, no CAP_NET_RAW). Returns avg RTT as response time; reports partial loss.src/lib/ping.js. - DNS record —
dns.promises.Resolver-based lookup for A / AAAA / CNAME / MX / TXT / NS / SRV / CAA / SOA / PTR. Optional custom resolver host[:port].dns_expectedaccepts a substring (case-insensitive) or/pattern/flagsregex against the joined RRset.src/lib/dnscheck.js. - Keyword (regex) — new
check_type = 'regex'for active HTTP monitors. Pattern can be bare (defaults to flagi) or/pattern/flags. Sits alongsidestatus,string,json.
Schema additions on sites (all nullable, idempotent migration):
tcp_host, tcp_port, ping_host, ping_count, dns_query, dns_record_type, dns_resolver, dns_expected.
Dashboard: filter pills for TCP / Ping / DNS, type-aware meta line on every card.
Why high: every operator hits this on day-one of production use.
- New
maintenance_windowstable (id, name, enabled, scope, scope_value, kind, starts_at, ends_at, cron, duration_minutes, timezone, suppress_notifications, pause_probes). - Two scopes: global (all monitors) or monitor (one specific monitor).
- Two kinds: one-off (explicit start/end timestamps) or recurring (cron + duration in minutes, evaluated in a timezone via
cron-parser). src/lib/maintenance.js:windowIsActive,isActive(siteId),isActiveGlobal(),isAlertSuppressed,isProbeSuppressed,nextOccurrence,currentEnd,summarize. 15-second in-memory cache, invalidated on writes.src/notifier.jswraps every channel dispatch inisAlertSuppressed— suppressed events lognotifier.suppressed_by_maintenancefor audit and skip the channel call.src/monitor.jschecksisProbeSuppressedbefore each probe tick. Heartbeats are unaffected.- New incidents get
during_maintenance=1if opened inside a window (denormalized column onincidents). - Settings page
/settings/maintenance(CRUD + toggle + delete) with full timezone picker. - Dashboard banner whenever any window is currently active.
Why medium: lowers the friction of running 50+ monitors.
tags(id, name, color, timestamps) +site_tags(site_id, tag_id) join table.src/lib/tags.js:listTags,createTag,updateTag,deleteTag(cascades to site_tags),setSiteTags,listSiteTags,tagsForSites(bulk Map<siteId, tag[]>),attachToSites,detachFromSites./settings/tagsadmin UI: create with color picker, edit name/color inline (HTML5formattr), delete with confirmation. 12-colour Tabler palette.- Dashboard: tag chip row under the filter bar (clicking a tag toggles it as the
?tag=filter), tag chips rendered on each monitor card. - Monitor form: tag multi-select with the existing tags shown as toggleable badges.
- Bulk actions: per-card checkbox + sticky bottom action bar with Pause / Resume / Delete / Add tag / Remove tag.
POST /sites/bulkwithaction,site_ids[], optionaltag_id. URL state is preserved across the redirect.
Why medium: brings parity with Healthchecks.io, which is the dominant tool in this niche.
- New
sites.heartbeat_schedule_kind(interval|cron),heartbeat_cron,heartbeat_timezone. evaluateHeartbeatnow branches on schedule kind: interval keeps the existinginterval + gracecheck; cron usescron-parserto compute the next expected occurrence from the last received ping in the configured timezone and tolerates lateness up to grace.- New ping endpoints (all unauthenticated, same token):
GET|HEAD|POST|PUT /ping/:token— successGET|HEAD|POST|PUT /ping/:token/start— wrap-job start signal (recordslast_heartbeat_start_at, doesn't change state)GET|HEAD|POST|PUT /ping/:token/<0|success|ok>— success aliasGET|HEAD|POST|PUT /ping/:token/<rc>— non-zero exit code → failure (opens incident)GET|HEAD|POST|PUT /ping/:token/<fail|failure|down>— failure alias
- Per-ping body captured (up to 4 KB, transparently truncated). POST/PUT raw bytes preferred; falls back to the parsed urlencoded/json body.
- New
heartbeat_pingstable (kind, exit_code, duration_ms, body, source_ip, user_agent, received_at) capped at 50 rows per site via post-insert prune. - Site form: heartbeat block grew a "schedule kind" selector with cron + timezone fields. Three ping URLs shown (
success/start/fail) plus a wrap-a-job snippet. - Site detail page shows the last 25 pings with kind badge, exit code, computed duration, body snippet, source IP.
Why medium: unlocks programmatic access and Grafana integration.
api_tokenstable:id, name, token_hash (sha256), scope ('read'|'write'), last_used_at, created_at. Plaintext token (utk_<48-hex>) shown exactly once at creation.src/lib/apiTokens.js: create, list, delete, lookup-by-bearer-or-?token=, asynclast_used_attouch on use./settings/api-tokensadmin UI with create / revoke / scope picker and cURL + Prometheus scrape examples.- REST endpoints in
src/routes/api.js(mounted before session-auth, bypassing cookies):GET /api/v1/health(public)GET /api/v1/sites?state=&monitor_type=&tag=&limit=&offset=(read)GET /api/v1/sites/:id(read; includes uptime stats, tags, recent incidents, channel IDs)GET /api/v1/sites/:id/checks?limit=(read)GET /api/v1/sites/:id/incidents?limit=(read)GET /api/v1/incidents?limit=(read)GET /api/v1/tags(read)GET /api/v1/stats(read)POST /api/v1/sites/:id/pause|/resume|/check-now(write)DELETE /api/v1/sites/:id(write)
- Read tokens are denied on write endpoints with HTTP 403.
/metricsPrometheus exporter, public when no tokens exist, token-gated as soon as the first token is created. Series:uptime_monitor_up{id,name,monitor_type}(1 / 0; paused monitors omitted)uptime_monitor_response_time_msuptime_monitor_last_check_age_secondsuptime_monitor_uptime_pct_24huptime_cert_days_remaininguptime_monitors_total{state}(up / down / paused)uptime_open_incidents
Why medium: security parity for any public-internet deploy.
- TOTP 2FA — RFC 6238 implementation in
src/auth.js(HMAC-SHA1 + custom base32, no external dep). Window=±1 / step=30s / 6 digits. Otplib was dropped because v13 introduced an incompatible API for a feature small enough to inline. - Two-step login:
POST /loginvalidates credentials, setssession.pendingUser, redirects to/login/2fa.POST /login/2faaccepts a TOTP token or a one-shot recovery code; on success it rotatessession.userand clears the pending state. - Enrollment UX at
/settings/2fa: status badge, QR code (PNG data-url viaqrcode), pasteable base32 secret, 10 recovery codes shown exactly once at enable / regenerate. Disable requires a current TOTP/recovery confirmation. - Rate-limit + lockout (
src/lib/rateLimit.js): in-memory dual buckets — 5 failures / 15 min per IP, 10 failures / 30 min per username. A successful login clears both buckets. Lockout messages tell the user how many minutes remain. - Audit log — new
audit_logtable (id, at, actor, ip, action, target_type, target_id, meta).src/lib/audit.jsprovidesrecord()andfromReq(). Hooked into login.success / login.failed / 2fa.enabled / 2fa.disabled / 2fa.recovery_used / 2fa.recovery_regenerated / 2fa.failed / logout / api_token.created / api_token.deleted / site.{created,updated,deleted,bulk_*} / tag.{created,updated,deleted}. /settings/auditpage with action filter dropdown, colour-coded action badges, JSONmetashown inline, capped at 200 most-recent.- Login + 2FA forms hardened against
</script>injection via the same\\u003cJSON escaping used on the dashboard.
Why medium: rounds out the assertion surface.
- New columns on
sites:request_body,request_body_type(text/json/form),auth_type(none/basic/bearer),auth_username,auth_password,auth_token,follow_redirects,skip_tls_verify,max_response_time_ms. - Request body for POST/PUT/PATCH/DELETE with auto Content-Type based on body type (no override needed). Sent up to 64 KB.
- First-class Basic auth + Bearer token — populated directly into the
Authorizationheader bybuildHeaders; takes precedence over any header pasted into the custom-headers JSON. - Follow redirects — yes / no (5 hops vs 0). Wired into
undici'smaxRedirections. - Skip TLS verify — when enabled, the monitor uses a separate cached undici Agent with
rejectUnauthorized: false. Tested againsthttps://self-signed.badssl.com/(strict = DOWN, skip = UP). - Regex body match — already shipped in Phase 4.
- Response-time threshold — when
max_response_time_msis set, a successful response that exceeds it is flipped to DOWN withresponse too slow: Xms > Yms. Applied after the primary assertion so reason text is consistent. - Site form gained four sub-sections under "active": Request body, Authentication, Network & TLS, plus a vanilla-JS reveal for the basic vs bearer credential inputs.
- Multiple assertions — explicitly deferred. Existing single-assertion + response-time threshold covers ~95% of real cases without a UI rebuild. Worth revisiting after user feedback.
Why medium: keeps DB size bounded and adds analyst-friendly exports.
src/lib/retention.js— pluggable pruner with one entry per table and a periodic scheduler. Runs on boot + everyRETENTION_RUN_INTERVAL_HOURS(default 24h).- Configurable retention via env (all default-on):
CHECKS_RETENTION_DAYS(90),INCIDENTS_RETENTION_DAYS(365),HEARTBEAT_PINGS_RETENTION_DAYS(30),AUDIT_RETENTION_DAYS(180),RETENTION_VACUUM(1 = run SQLiteVACUUMafter prune). - New columns on
sites:notes(free-text, rendered as pre-wrap on the detail page) andmute_notifications(probes still tick, alerts are dropped).incidents.failure_snapshot(JSON) reserved for future header/body capture. src/notifier.js: extra short-circuit onsite.mute_notificationsbefore any channel dispatch — logged asnotifier.muted_by_sitefor audit.- CSV exports off the detail page and globally:
GET /sites/:id/checks.csv?limit=NGET /sites/:id/incidents.csv?limit=NGET /incidents.csv?limit=N- Identical RFC 4180 quoting helper used everywhere.
- Site form: "Notes & alert mute" section. Detail page: notes block + "Notifications muted" badge + two CSV download buttons.
Why low: project sells "no Docker required" — kept that — but the docker-compose path is now first-class for r/selfhosted users who default to compose.
Dockerfile— two-stage build onnode:20-bookworm-slim:- Stage 1 installs
python3,make,g++and runsnpm ci --omit=devsobetter-sqlite3compiles its native binding against the target glibc. - Stage 2 ships only the runtime:
iputils-ping(for the ICMP monitor),tini(PID 1, signal-forwarding),ca-certificates. Runs as the unprivilegednodeuser./datais a declared volume. Final image hovers around ~150 MB. HEALTHCHECKhits the new unauthenticatedGET /healthz(returns{ ok, ts }).
- Stage 1 installs
docker-compose.yml— two profiles in one file:- Default: just the
uptimeservice, SQLite under a named volumeuptime-data. --profile mysql: brings upmysql:8.4alongside, withMYSQL_RANDOM_ROOT_PASSWORD=yes, healthcheck, and a separatemysql-datavolume. SettingDB_DRIVER=mysqlin.envis enough to switch.cap_add: NET_RAWgranted to the app container so the bundledpingbinary can open its raw socket inside the container.- All sensitive env vars (
SESSION_SECRET,ADMIN_PASS,DB_PASSWORD) use${VAR:?...}so compose fails loudly if.envis missing.
- Default: just the
.env.docker.examplecommitted — copy to.env, change two secrets,docker compose up -d..dockerignorekeeps build context tiny by excludingnode_modules,data,logs,.env,.git, and screenshots.- New unauthenticated
GET /healthzendpoint added tosrc/server.js— JSON only, no DB hit,Cache-Control: no-store.
Why medium: shared-team self-hosters need more than a single shared password.
- New roles
admin/editor/viewerlayered on top of per-monitor grantsview/manage. The.envsuper-admin is preserved as a permanent break-glass account and is never written to theuserstable. - New tables:
users(argon2 password hash, role, per-user TOTP, recovery codes, disabled flag, must-change-password flag, last-login metadata) andsite_grants((site_id, user_id)PK withview/managepermission). New columns:sites.owner_user_id,api_tokens.user_id,audit_log.actor_user_id,channels.created_by_user_id. All migrations are idempotent on both SQLite and MySQL. src/lib/users.js— argon2 hashing, CRUD, password verification, env-admin username reservation (refuses to create a DB user that collides withADMIN_USER).src/lib/acl.js—canSeeSite,canManageSite,siteFilterClausefor splicing into list queries, plusrequireRole/requireSiteSee/requireSiteManageExpress middleware.src/lib/grants.js—list,set,revoke,setManyfor managing the join rows.- Auth refactor:
startLoginlooks up DB users first (rejectingdisabled = 1), falls back to constant-time.envcomparison. TOTP secret + recovery codes are loaded from the matchingusersrow OR the legacy singletonsettingsrow depending on who is logging in. Session shape grows{ id, isEnv, username, role, mustChangePassword }, refreshed from the DB on every request so role changes and disable take effect immediately. - Route guards: every
/sites/*write path runs throughrequireSiteManage, list / detail / CSV paths throughrequireSiteSee. Dashboard,/api/sites, and/api/v1/sites*applysiteFilterClauseso non-admins literally cannot enumerate other people's monitors. Bulk actions silently drop sites the caller cannot manage and flash a count of skipped rows. All/settings/*admin routes requireadmin, except/settings/account(any user) and/settings/audit(any user — non-admins see only their own actions). - API tokens are scoped to a user:
requireApiresolvesapiToken.user_id→ acting user (env admin synthetic when NULL), and every/api/v1response is filtered through that user's ACL./metricsis filtered the same way. - Audit log: every
audit.fromReqwrite capturesactor_user_id(NULL for env admin, still readable viaactortext). The reader filters non-admins to their own rows automatically. - UI: new admin pages
/settings/users(list, create, role / disable / reset-pw / disable-2FA / delete, plus a "claim all unowned monitors" button) and/settings/users/:id/grants(per-sitenone/view/manageselector with bulk buttons). New user-facing/settings/accountpage (own profile, password, 2FA + recovery codes, personal API tokens). The site form gains an Ownership selector for admins; the site detail page gains a Sharing card with owner + grants management for owners and admins; the navbar splits "Administration" (admin-only) from "My account" (any user). - Smoke tested end to end via
scripts/smoke-acl.shagainst both SQLite (scripts/dev-restart.sh) and MySQL (scripts/dev-restart-mysql.sh).
- Groups / teams (would need a third table and dedicated UI; revisit on real demand)
- Per-channel or per-tag ACLs (channels and tags stay admin-managed)
- OIDC / SAML SSO
- Self-service signup or email invitations (admin creates users only)
- User-driven forgot-password flow (admins reset passwords from
/settings/users) - Separate API token "viewer" scope (existing
read/write+ ACL is sufficient)
Adding any of these would change what this project is. Compete on simplicity, not feature breadth.
- Multi-region probe fleet
- On-call schedules, phone-call escalation
- Synthetic browser monitoring (Playwright, headless Chromium per probe)
- AI postmortems / agentic root-cause analysis
- APM, log management, error tracking (OneUptime's "everything in one box")
- SAML / OIDC SSO (only matters for enterprise self-hosters)
A phase is considered done when all of the following are true:
- The DB schema (sqlite + mysql) was extended without breaking existing rows.
- New routes work end-to-end via
curlagainst the local app on port 3001. - The dashboard renders without errors and existing monitors keep ticking.
- No new linter errors were introduced.
- The phase's user-visible features are documented in the README.
The session ends with a final summary of: shipped phases, deferred phases (with reason), and any follow-up tasks for the next session.