Skip to content

feat: add insforge diagnose command for backend health diagnostics#32

Merged
jwfing merged 13 commits intomainfrom
feat/diagnosis
Mar 27, 2026
Merged

feat: add insforge diagnose command for backend health diagnostics#32
jwfing merged 13 commits intomainfrom
feat/diagnosis

Conversation

@jwfing
Copy link
Copy Markdown
Member

@jwfing jwfing commented Mar 27, 2026

Summary

  • Add insforge diagnose command group for SRE-style backend health diagnostics
  • diagnose (no subcommand): comprehensive health report aggregating metrics, advisor, DB checks, and logs
  • diagnose metrics: EC2 instance metrics (CPU, memory, disk, network) with latest/avg/max stats
  • diagnose advisor: latest advisor scan results with severity/category filtering
  • diagnose db: 7 predefined PostgreSQL health checks (connections, slow queries, bloat, table sizes, index usage, locks, cache hit ratio)
  • diagnose logs: error-level log aggregation across all 4 backend log sources
  • OSS mode support: when linked via --api-key, gracefully skips Platform API calls (metrics/advisor) and only runs DB + logs checks
  • All commands support --json output for agent consumption

Test plan

  • insforge diagnose --help shows all subcommands
  • insforge diagnose produces comprehensive health report (with linked project)
  • insforge diagnose metrics --range 1h displays EC2 metrics table
  • insforge diagnose advisor --severity critical filters issues
  • insforge diagnose db --check connections,cache-hit runs specific checks
  • insforge diagnose logs --source postgres.logs filters by source
  • insforge --json diagnose outputs valid JSON
  • OSS-linked project (--api-key): metrics/advisor show N/A, db/logs work normally
  • Unlinked project: shows "No project linked" error

🤖 Generated with Claude Code

Note

Add insforge diagnose command group for backend health diagnostics

  • Adds a new diagnose command group to the insforge CLI with four subcommands: metrics, advisor, db, and logs.
  • The top-level diagnose command runs all checks concurrently via Promise.allSettled and renders a combined health report in table or JSON format.
  • diagnose metrics fetches CPU/memory/disk/network metrics from the Platform API; diagnose advisor fetches the latest advisor scan and issues; diagnose db runs predefined SQL health checks (connections, slow queries, bloat, locks, cache-hit); diagnose logs aggregates error-level log entries across sources.
  • OSS-linked projects skip metrics and advisor checks, marking them as N/A in the report.
  • Exports platformFetch from src/lib/api/platform.ts so diagnose subcommands can call the Platform API.

Macroscope summarized efb7abb.

Summary by CodeRabbit

  • New Features

    • Added insforge diagnose with subcommands: metrics, advisor, db, logs and a comprehensive health report.
  • User-Facing Enhancements

    • Human-readable tables and --json consolidated output with per-section partial-failure reporting.
    • Metrics: range selection, aggregated network metrics, latest/avg/max.
    • Advisor: scan summary with filtered issues.
    • DB: predefined PostgreSQL health checks with per-check results.
    • Logs: multi-source retrieval with error/fatal filtering and summaries.
  • Documentation

    • Added design spec and implementation plan for the diagnose command.
  • Chores

    • Bumped package version to 0.1.32

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Walkthrough

Adds a new top-level diagnose CLI command group with four flat subcommands (metrics, advisor, db, logs) that concurrently collect backend health data and emit either human-friendly tables or a unified JSON report with per-source failure isolation and aggregated errors.

Changes

Cohort / File(s) Summary
Specs
docs/specs/2026-03-27-diagnose-command-design.md, docs/specs/2026-03-27-diagnose-implementation-plan.md
Added design and implementation plan detailing command surface, subcommands, JSON report schema, error handling, data shaping rules, and recommended file layout/wiring.
Diagnose core & orchestration
src/commands/diagnose/index.ts, src/index.ts
New diagnose top-level command registration and root handler that runs sub-sources concurrently (Promise.allSettled), aggregates results/errors, formats JSON vs console output, and records CLI usage.
Subcommands — metrics & advisor
src/commands/diagnose/metrics.ts, src/commands/diagnose/advisor.ts
New diagnose metrics and diagnose advisor subcommands: auth/project checks, platform fetches, query option parsing, data enrichment (metrics latest/avg/max; advisor scan + issues), and JSON/table rendering.
Subcommands — db & logs
src/commands/diagnose/db.ts, src/commands/diagnose/logs.ts
New diagnose db implementing configurable read‑only SQL checks and runDbChecks(); diagnose logs fetching multiple log sources, error-line extraction, summaries, and per-source error details.
API helper export
src/lib/api/platform.ts
Made platformFetch exported for use by the new diagnose modules.
Package metadata
package.json
Bumped package version from 0.1.310.1.32.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • 0.1.21 #26 — package version bump (related to this PR's package.json version update).

Poem

🐇 I hopped through endpoints, sniffed each log and trace,
Metrics, scans, and DBs—lined up in tidy place.
Promise.allSettled kept each piece unfrayed,
Tables or JSON — a rabbit-built brigade,
I nudge a bug away and nibble on a grace. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding a new insforge diagnose command for backend health diagnostics, which aligns with the comprehensive changeset introducing the diagnose command group and its subcommands.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/diagnosis

Comment @coderabbitai help to get the list of available commands and usage tips.

…terfaces

Network metrics (network_in/network_out) are returned per-interface by the
API, causing duplicate rows. Now sums across interfaces into a single row
per metric.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/index.ts (1)

166-168: Add a description for the diagnose command group.

Other command groups include descriptions (e.g., db, functions, secrets, schedules), but diagnose is missing one. This affects --help output consistency.

Proposed fix
 // Diagnose commands
-const diagnoseCmd = program.command('diagnose');
+const diagnoseCmd = program.command('diagnose').description('Backend health diagnostics');
 registerDiagnoseCommands(diagnoseCmd);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index.ts` around lines 166 - 168, The diagnose command group created via
program.command('diagnose') has no description, causing inconsistent --help
output; update the call that defines diagnoseCmd (the
program.command('diagnose') invocation) to include a short descriptive string
(e.g., program.command('diagnose').description('...')) or otherwise set a
description on the diagnoseCmd before calling
registerDiagnoseCommands(diagnoseCmd) so the --help output matches other groups;
ensure the description is concise and mirrors the style used for
db/functions/secrets/schedules.
src/commands/diagnose/logs.ts (1)

51-61: Consider parallelizing source fetches.

The current implementation fetches each log source sequentially. With 4 sources, parallelization could reduce latency. However, this is a minor optimization and acceptable as-is.

Optional: parallel fetch implementation
 export async function fetchLogsSummary(limit = 100): Promise<SourceSummary[]> {
-  const results: SourceSummary[] = [];
-  for (const source of LOG_SOURCES) {
-    try {
-      results.push(await fetchSourceLogs(source, limit));
-    } catch {
-      results.push({ source, total: 0, errors: [] });
-    }
-  }
-  return results;
+  const settled = await Promise.allSettled(
+    LOG_SOURCES.map((source) => fetchSourceLogs(source, limit)),
+  );
+  return settled.map((result, i) =>
+    result.status === 'fulfilled'
+      ? result.value
+      : { source: LOG_SOURCES[i], total: 0, errors: [] },
+  );
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/diagnose/logs.ts` around lines 51 - 61, fetchLogsSummary
currently iterates LOG_SOURCES sequentially calling fetchSourceLogs which
increases latency; change it to kick off parallel fetches (e.g., map LOG_SOURCES
to promises of fetchSourceLogs) and await them together using Promise.allSettled
(or Promise.all with per-promise catch) to preserve per-source error handling,
then convert settled results into SourceSummary objects (using fetchSourceLogs,
LOG_SOURCES, and SourceSummary to locate code) so failures produce { source,
total: 0, errors: [] } while successful results are pushed as before.
docs/specs/2026-03-27-diagnose-command-design.md (1)

32-32: Add language specifiers to fenced code blocks.

Static analysis flagged code blocks at lines 32, 66, 94, and 151 as missing language specifiers. Since these are output mockups, use text or plaintext to satisfy the linter.

Example fix for line 32
-```
+```text
 ┌─────────────────────────────────────────────────┐

Apply similar changes to code blocks at lines 66, 94, and 151.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/specs/2026-03-27-diagnose-command-design.md` at line 32, Fenced code
blocks that show the ASCII output mockups (they start with a triple backtick
followed by the box-drawing line
"┌─────────────────────────────────────────────────┐" and similar blocks later)
are missing language specifiers; update each triple-backtick fence to include a
language such as text or plaintext (e.g., change ``` to ```text) for every
output mockup block (the one beginning with the box-drawing line and the other
similar mockup blocks) so the linter stops flagging them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/specs/2026-03-27-diagnose-implementation-plan.md`:
- Around line 699-750: The plan snippet is out of sync with the shipped
implementation: it imports isOssMode from metrics.js and uses spread-based
Math.max(...vals), both of which were intentionally removed; update the plan to
match the real code by (1) importing the correct helper (or removing the
isOssMode import) and referencing the actual OSS detection used in
registerDiagnoseCommands, and (2) replacing any spread-based Math.max usage with
the safe alternative used in the implementation (e.g., Math.max.apply or an
explicit loop/reduce) for computing max in the metrics handling so the document
mirrors the shipped functions fetchMetricsSummary, registerDiagnoseCommands, and
the metrics aggregation logic.

In `@src/commands/diagnose/index.ts`:
- Around line 33-38: The current code models OSS skips as rejected promises
which get treated as errors by the aggregator; change the ossMode branches for
metricsPromise and advisorPromise to return a fulfilled sentinel (e.g.,
Promise.resolve(null) or Promise.resolve({ skipped: true })) instead of
Promise.reject(...), keeping fetchMetricsSummary and fetchAdvisorSummary calls
unchanged, and ensure downstream code that inspects the results (the
aggregator/renderer that consumes metricsPromise and advisorPromise) explicitly
checks for that sentinel and renders a "skipped" state rather than treating it
as an error.

---

Nitpick comments:
In `@docs/specs/2026-03-27-diagnose-command-design.md`:
- Line 32: Fenced code blocks that show the ASCII output mockups (they start
with a triple backtick followed by the box-drawing line
"┌─────────────────────────────────────────────────┐" and similar blocks later)
are missing language specifiers; update each triple-backtick fence to include a
language such as text or plaintext (e.g., change ``` to ```text) for every
output mockup block (the one beginning with the box-drawing line and the other
similar mockup blocks) so the linter stops flagging them.

In `@src/commands/diagnose/logs.ts`:
- Around line 51-61: fetchLogsSummary currently iterates LOG_SOURCES
sequentially calling fetchSourceLogs which increases latency; change it to kick
off parallel fetches (e.g., map LOG_SOURCES to promises of fetchSourceLogs) and
await them together using Promise.allSettled (or Promise.all with per-promise
catch) to preserve per-source error handling, then convert settled results into
SourceSummary objects (using fetchSourceLogs, LOG_SOURCES, and SourceSummary to
locate code) so failures produce { source, total: 0, errors: [] } while
successful results are pushed as before.

In `@src/index.ts`:
- Around line 166-168: The diagnose command group created via
program.command('diagnose') has no description, causing inconsistent --help
output; update the call that defines diagnoseCmd (the
program.command('diagnose') invocation) to include a short descriptive string
(e.g., program.command('diagnose').description('...')) or otherwise set a
description on the diagnoseCmd before calling
registerDiagnoseCommands(diagnoseCmd) so the --help output matches other groups;
ensure the description is concise and mirrors the style used for
db/functions/secrets/schedules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 993d4c09-babb-42c3-8910-ec2e47b4258f

📥 Commits

Reviewing files that changed from the base of the PR and between b4bb6ad and d24e815.

📒 Files selected for processing (10)
  • docs/specs/2026-03-27-diagnose-command-design.md
  • docs/specs/2026-03-27-diagnose-implementation-plan.md
  • package.json
  • src/commands/diagnose/advisor.ts
  • src/commands/diagnose/db.ts
  • src/commands/diagnose/index.ts
  • src/commands/diagnose/logs.ts
  • src/commands/diagnose/metrics.ts
  • src/index.ts
  • src/lib/api/platform.ts

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/commands/diagnose/db.ts`:
- Line 149: The diagnostic queries are static SELECTs but currently call
runRawSql with unrestricted=true, which unnecessarily broadens privileges;
update both call sites (the invocations that look like
runRawSql(DB_CHECKS[key].sql, true)) to call the read-only form (either pass
false instead of true or omit the unrestricted flag per runRawSql's signature)
so the checks use the restricted/read-only SQL path; keep the SQL source
DB_CHECKS[key].sql unchanged.
- Around line 169-181: The loop over checkNames currently logs unknown check
names and continues, which leads to partial/empty results; update the behavior
in the loop that inspects DB_CHECKS[name] (inside the for-of iterating
checkNames built from opts.check/ALL_CHECKS) to fail fast instead of continuing:
when a lookup yields no check, raise an error or call process.exit(1) after
printing a clear message that includes the invalid name and ALL_CHECKS, so the
command (including --json consumers) receives a non-zero failure rather than
silent partial success.
- Around line 145-153: The runDbChecks function currently swallows SQL errors
and sets results[key] = [] which hides failures; change the catch block in
runDbChecks (which iterates ALL_CHECKS and calls runRawSql with
DB_CHECKS[key].sql) to preserve per-check error metadata instead of coercing to
an empty array — e.g., assign results[key] to an array/object that includes the
error message/stack and identifying info (error, message, maybe sql or check id)
so downstream code in diagnose/index.ts can distinguish "no findings" vs "DB
unavailable"; keep the successful path returning rows unchanged and ensure error
serialization is safe (stringify message/stack).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6ea8c721-14ab-4103-97c9-39f5b9f7be4e

📥 Commits

Reviewing files that changed from the base of the PR and between d24e815 and b6da2cf.

📒 Files selected for processing (2)
  • src/commands/diagnose/db.ts
  • src/commands/diagnose/metrics.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/commands/diagnose/metrics.ts

@jwfing jwfing requested a review from tonychang04 March 27, 2026 19:11
@jwfing
Copy link
Copy Markdown
Member Author

jwfing commented Mar 27, 2026

screenshot as:

$ insforge diagnose --help
Usage: insforge diagnose [options] [command]

Backend diagnostics — run with no subcommand for a full health report

Options:
  -h, --help         display help for command

Commands:
  metrics [options]  Display EC2 instance metrics (CPU, memory, disk, network)
  advisor [options]  Display latest advisor scan results and issues
  db [options]       Run database health checks (connections, bloat, index usage, etc.)
  logs [options]     Aggregate error-level logs from all backend sources

$ insforge diagnose     

  InsForge Health Report — Sudo Database Assistant

── System Metrics (last 1h) ────────────────────
  CPU: 4.9%   Memory: 75.7%
  Disk: 66.3%  Network: ↑555B/s ↓597B/s

── Advisor Scan ────────────────────────────────
  3/27/2026 (completed) — 3 critical · 0 warning · 0 info

── Database ────────────────────────────────────
  Connections: 5/100  Cache Hit: 98.7%
  Dead tuples: 26   Locks waiting: 0

── Recent Errors (last 100 logs/source) ────────
  insforge.logs: 0  postgREST.logs: 0  postgres.logs: 1  function.logs: 0

@tonychang04
Copy link
Copy Markdown
Contributor

tonychang04 commented Mar 27, 2026

@jwfing what happens if there's no data? so the cloud backend advisor only runs once per day right?

oh it can see the database recent errors systme metrics

it's nice if you can do a knowledge share or short blurb in slack!!

@jwfing
Copy link
Copy Markdown
Member Author

jwfing commented Mar 27, 2026

what happens if there's no data?

just display N/A.


| Check | SQL |
|-------|-----|
| `connections` | `SELECT count(*) AS active FROM pg_stat_activity WHERE state IS NOT NULL` combined with `SHOW max_connections` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this very heavy? will this overload nano instance?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not heavy at all, it counts from shared memory.

@tonychang04 tonychang04 self-requested a review March 27, 2026 21:26
@jwfing jwfing merged commit 83b1ed0 into main Mar 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants