Skip to content

feat(skills): add debug-issue skill#2612

Closed
NilanshBansal wants to merge 1 commit into
nanocoai:mainfrom
NilanshBansal:feat/skill-debug-issue
Closed

feat(skills): add debug-issue skill#2612
NilanshBansal wants to merge 1 commit into
nanocoai:mainfrom
NilanshBansal:feat/skill-debug-issue

Conversation

@NilanshBansal
Copy link
Copy Markdown

What this skill does

debug-issue is a Skyler-powered triage skill that investigates application or platform-wide incidents end-to-end. Given an incoming message describing an error, it fetches the right logs, correlates them, and produces a structured root-cause analysis — without requiring the user to specify what to look at.

Three execution paths

The skill picks exactly one path based on whether an application_id is present:

  • Path A — app_id in message: Skips straight to per-application deep dive: metadata → orchestrator summary → primary debug bundle (Langfuse + Grafana cluster + E2B) → app files → Vercel deployment logs.
  • Path B — no app_id: Derives a search pattern from the error description, runs iterative search_cluster_logs (up to 5 refinement rounds), and — if a valid app UUID surfaces — continues into Path A's per-application steps.
  • Path C / Step 4 fallback: If cluster log search exhausts 5 rounds with no hits and no app_id can be derived, summarises what was searched, what was found, and what additional information would allow a definitive answer.

MCP/tool requirements

Primary (required): mcp__skyler__* — covers metadata, orchestrator summary, fetch_all_debug_logs (Langfuse + Grafana + E2B bundle), per-source re-fetches, cluster log search, app file download, and Vercel proxy.

Optional / conditional:

  • mcp__skyler__get_slack_thread — when the user references a Slack thread_ts
  • Vercel tools (via mcp__skyler__get_vercel) — triggered automatically when deploy/domain keywords appear in the message
  • mcp__nanoclaw__send_message — for progress updates at major step transitions

No Langfuse, GitHub, or Grafana MCP servers are called directly; all observability access goes through mcp__skyler__*.

Workspace layout

Path Purpose
/workspace/agent/debug_evidence/ Persisted MCP payloads with deterministic filenames encoding tool + time range + app_id
/workspace/agent/output.md Canonical structured output (###OUTPUT_START######OUTPUT_END### + JSON filter block + MCP call log)

Evidence filenames follow the pattern: <tool_id>__<start>_<end>__<app_id>.json (e.g. grafana__20260423T120000Z_20260423T130000Z__550e8400-e29b-41d4-a716-446655440000.json).

Provenance

Ported from the debug-issue skill in appsmith-v2/kite-triage-bot (OpenCode/E2B environment). The core decision tree, step sequence, output format, and analysis principles transfer directly. Adapted for NanoClaw:

  • skyler_*mcp__skyler__* tool names
  • /home/user/skyler/ sandbox paths → /workspace/agent/ workspace paths
  • OpenCode [STEP]/[STATUS]/[HEARTBEAT] runner logs → mcp__nanoclaw__send_message progress updates
  • Hosho MCP (not available in NanoClaw) removed
  • Platform codebase (/home/user/skyler/repo) and skyler_search_repo_code / skyler_read_repo_file (not available) removed; Step 4 fallback refactored accordingly
  • mcp__skyler__fetch_all_debug_logs promoted to primary fast-path in Step 3c (it wasn't explicitly called out in the original flow)

Findings discovered during the port

Finding 1 — Skill file location: Skills must live under container/skills/<name>/SKILL.md in the repo root, not under groups/<name>/skills/. This is not documented in CLAUDE.md, which caused initial confusion when writing the destination path. Recommend adding a one-liner to CLAUDE.md: "Skills live at container/skills/<skill-name>/SKILL.md."

Finding 2 — ${VAR} not expanded in MCP args: The Claude MCP launcher does not expand ${VAR} environment variable references that appear inside the args array of the per-MCP env block. HTTP MCPs wrapped via mcp-remote that rely on injected env vars in their args fail silently at connection time without a clear error. Recommend either (a) pre-expanding env vars in NanoClaw's container-runner before passing the config to Claude, or (b) documenting this limitation explicitly so skill authors know to use literal values or a different injection mechanism.

Validated by

Used this skill in production immediately after porting it. It correctly root-caused a fork/claim race condition on production app 5309a049-c630-41c2-b745-a2d805d66eb7 ("Paisajismo Nativo"): the /api/v1/applications/{id}/fork endpoint was returning HTTP 500 because sandbox boot was triggered before the forked website's EFS directory was fully provisioned — consistently failing within 7 seconds of fork record creation, while a later successful fork (next morning) completed in 18 seconds with the filesystem ready. Four orphaned fork records were identified for the affected user (paisaje@paisajismonativo.com). The skill ran Path A (app_id provided), fetched metadata + orchestrator summary + full debug bundle + Vercel deployments, and identified the root cause without any manual log queries.

Skyler-style end-to-end debugging skill ported from appsmith-v2/kite-triage-bot.
Three-path decision tree (app_id given / derive via cluster logs / fallback),
per-application sequence (metadata → orchestrator → logs → app files → Vercel),
evidence persistence to /workspace/agent/debug_evidence/ with deterministic
filenames, structured ###OUTPUT_START### / ###OUTPUT_END### output format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant