Skip to content

Add deploy-selfmanaged skill and slash command#982

Draft
EngHabu wants to merge 3 commits into
mainfrom
add-deploy-selfmanaged-skill
Draft

Add deploy-selfmanaged skill and slash command#982
EngHabu wants to merge 3 commits into
mainfrom
add-deploy-selfmanaged-skill

Conversation

@EngHabu

@EngHabu EngHabu commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a Claude Code skill at .claude/skills/deploy-selfmanaged/ that guides users through deploying a Union self-managed (BYOC) data plane end-to-end. Captures intent (cloud, mode, control-plane URL, cluster name, region), verifies prereqs, then runs the deterministic path for the chosen cloud:
    • AWS / GCP — drives the existing scripts/selfmanaged_{aws,gcp}_e2e.py flyte e2e scripts (full 4-phase flow: infra → helm → smoke → optional teardown).
    • Azure / OCI / CoreWeave / generic K8s — guided walkthroughs that mirror each cloud's prepare-infra.mddeploy-dataplane.md, with cloud-specific gotchas surfaced.
  • Adds /deploy-selfmanaged slash command (.claude/commands/deploy-selfmanaged.md) for explicit invocation, with optional [cloud] [mode] args.
  • Whitelists .claude/skills/ and .claude/commands/ in .gitignore so team-shared skills are checked in while personal Claude state (settings, transcripts) stays ignored.
  • Drive-by fix to scripts/README.md — the GCP e2e script is fully implemented (not a stub); README is updated to reflect that and to call out that remote credential stashing is AWS-only today.

What the user experience looks like

The skill is designed to be intuitive, guiding, deterministic. A typical AWS deploy session looks like this:

1. Trigger

The user either says something natural ("deploy a self-managed Union cluster on AWS") and the skill auto-engages from its description, or they run the slash command:

/deploy-selfmanaged aws deploy

2. Intent capture (Step 1 in SKILL.md)

Claude asks two button-pick questions via AskUserQuestion:

  • Cloudaws | gcp | azure | oci | coreweave | generic
  • Modedeploy | teardown | smoke-only

Then prompts in plain text for the three free-form fields:

  • Control plane URL (e.g. https://myorg.union.ai)
  • Cluster name (registered with the control plane)
  • Region (script default shown but never silently assumed)

Skipped questions if the user already volunteered the answer in their initial message. Then echoes a Plan: block back so the user can sanity-check before anything runs:

Plan:
  cloud:             aws
  mode:              deploy
  control plane:     https://myorg.union.ai
  cluster name:      my-test-cluster
  region:            us-east-2

3. Prereq verification (Step 2 + cloud-specific)

Claude runs the prereq checks as a single batch and reports a pass/fail table:

Check Result
helm version --short ✅ v4.1.1
kubectl version --client ✅ v1.35.1
uctl version ✅ v0.1.20
uv ✅ installed
aws --version ✅ aws-cli/2.31.10
eksctl version ✅ 0.225.0
aws sts get-caller-identity ✅ account 479…192
AWS_ACCESS_KEY_ID env var not set
scripts/.venv ✅ exists, flyte 2.2.2 importable

Any failure halts the flow with the exact remediation command — no "let me try a workaround." For SSO-authenticated users, the skill specifically catches the gap between aws sts get-caller-identity succeeding (via SSO) and the e2e script needing AWS_ACCESS_KEY_ID exported, and tells the user to run aws configure export-credentials --profile <sso-profile> --format env.

4. Deploy command, shown before run

Claude composes the exact command, displays it, and asks for explicit yes before running anything cloud-mutating. For AWS/GCP, the deploy is long-running (~25–40 min) and the recommendation is to run in the user's own terminal with --tui:

cd scripts && source .venv/bin/activate && \
flyte run --local --tui selfmanaged_aws_e2e.py main \
    --control_plane_url https://myorg.union.ai \
    --cluster_name my-test-cluster \
    --aws_region us-east-2

The user sees Phase 1/4 → 2/4 → 3/4 → (4/4 teardown) in a live TUI tree with timings, logs, and links to each running cloud resource. For Azure/OCI/CoreWeave/generic, Claude walks prepare-infra.mddeploy-dataplane.md step by step, asking before each cloud-mutating command and showing the values-file diff before saving.

5. Verify

Once helm install completes, Claude runs:

  • kubectl get pods -n union — every pod must reach Running/Completed
  • kubectl get events -n union --sort-by=.lastTimestamp | tail -20
  • AWS/GCP: smoke suite ran in Phase 3 — Claude reports the TUI summary
  • Other clouds: optional python -m smoke_tests invocation against the new cluster

6. Teardown is first-class

mode=teardown is a real entry point, not an afterthought. AWS/GCP invoke the script's teardown_cluster task (resolves account/project from ambient identity, deletes everything cluster_name implied). Manual clouds get a reverse-order delete with per-resource confirmation — the skill never bulk-deletes a resource group, project, or subscription.

Operating rules the skill enforces

These appear at the top of SKILL.md and govern every cloud path:

  1. Never run a cloud-mutating command without showing it first and getting explicit "yes." Read-only commands run without prompting.
  2. Never invent values. Missing cluster_name, region, or credentials → ask, don't guess.
  3. Failed prereq → halt with the fix command. No silent workarounds.
  4. One cluster, one run. No batching across clusters in a single skill invocation.
  5. Re-runnable end to end — scripts are idempotent, manual paths note "skip if already created."

Test plan

  • Invoke /deploy-selfmanaged and confirm intent capture works (cloud + mode via questions, free-form for URL/name/region).
  • Run the universal prereq checks on a clean machine and confirm they catch missing helm/kubectl/uctl/uv.
  • AWS path: confirm the SSO-vs-STS-env-var check correctly halts SSO-only sessions and unblocks once aws configure export-credentials is eval'd.
  • AWS path: end-to-end deploy on a throwaway cluster (cost ~$1.30/hr) and confirm teardown completes cleanly.
  • AWS path: teardown mode against a --skip_teardown cluster.
  • GCP path: end-to-end deploy on a throwaway project; confirm gke-gcloud-auth-plugin prereq check catches its absence.
  • Azure / OCI / CoreWeave / generic walkthroughs: spot-read each <cloud>.md against the corresponding content/deployment/selfmanaged/<cloud>/ doc to confirm gates and command order match.

🤖 Generated with Claude Code

Guides users through deploying a Union self-managed (BYOC) data plane
end-to-end — captures intent, verifies prereqs, then runs the
deterministic path for the chosen cloud (flyte e2e script for AWS/GCP,
guided helm walkthrough for Azure/OCI/CoreWeave/generic). Whitelists
.claude/skills/ and .claude/commands/ so team-shared skills are
checked in while personal Claude state stays ignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 01:30
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented May 6, 2026

Copy link
Copy Markdown

Deploying docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3f06fa5
Status: ✅  Deploy successful!
Preview URL: https://58a4766b.docs-dog.pages.dev
Branch Preview URL: https://add-deploy-selfmanaged-skill.docs-dog.pages.dev

View logs

Documents the user-facing UX (six phases from trigger → teardown), how
to invoke the skill, and the operating rules that govern every cloud
path. Lives next to SKILL.md so contributors landing in the skill dir
have a human-readable overview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a team-shared Claude skill and slash command to guide end-to-end Union self-managed (BYOC) dataplane deployments across multiple clouds, and updates repo hygiene/docs to support it.

Changes:

  • Add the deploy-selfmanaged Claude skill with per-cloud deployment/teardown walkthroughs and universal prereq gating.
  • Add /deploy-selfmanaged slash command for explicit invocation with optional [cloud] [mode] arguments.
  • Update .gitignore to keep team-shared .claude/skills/ + .claude/commands/ tracked, and update scripts/README.md to reflect GCP E2E script support.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
scripts/README.md Updates E2E scripts README to reflect AWS + GCP support and credential stashing limitations.
.gitignore Adjusts ignore rules to allow sharing Claude skills/commands while keeping other Claude state ignored.
.claude/skills/deploy-selfmanaged/SKILL.md Defines the top-level skill flow, intent capture, universal prereqs, and verification/teardown rules.
.claude/skills/deploy-selfmanaged/aws.md AWS-specific prereqs and script-driven deploy/teardown instructions.
.claude/skills/deploy-selfmanaged/gcp.md GCP-specific prereqs and script-driven deploy/teardown instructions.
.claude/skills/deploy-selfmanaged/azure.md Azure guided walkthrough aligned to self-managed Azure docs (with some command/flag divergences noted).
.claude/skills/deploy-selfmanaged/oci.md OCI guided walkthrough aligned to self-managed OCI docs (with some command/flag divergences noted).
.claude/skills/deploy-selfmanaged/coreweave.md CoreWeave guided walkthrough (currently has several critical mismatches vs source-of-truth docs).
.claude/skills/deploy-selfmanaged/generic.md Generic Kubernetes guided walkthrough (currently has provider mismatch vs source-of-truth docs).
.claude/commands/deploy-selfmanaged.md Adds the /deploy-selfmanaged command entrypoint wiring into the skill.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


Source of truth:
`content/deployment/selfmanaged/selfmanaged-generic/prepare-infra.md`
and `selfmanaged-generic/deploy-dataplane.md`. Use this path for
```bash
uctl config init --host=<CONTROL_PLANE_URL>
uctl selfserve provision-dataplane-resources \
--clusterName <CLUSTER_NAME> --provider generic
Comment on lines +90 to +93
-n union --create-namespace \
-f <org>-values.yaml \
--wait
```

Source of truth:
`content/deployment/selfmanaged/selfmanaged-coreweave/prepare-infra.md`
and `selfmanaged-coreweave/deploy-dataplane.md`. CoreWeave-specific
Comment on lines +60 to +63
--clusterName <CLUSTER_NAME> --provider generic
```
(CoreWeave uses the `generic` provider — same as on-prem, with
custom storage overrides.)
Comment on lines +4 to +5
and `selfmanaged-oci/deploy-dataplane.md`. Follow the doc — this file
captures gates and order.
Comment on lines +77 to +80
-n union --create-namespace \
-f <org>-values.yaml \
--wait
```
# Azure path

Source of truth: `content/deployment/selfmanaged/selfmanaged-azure/prepare-infra.md`
and `selfmanaged-azure/deploy-dataplane.md`. The doc is authoritative —
Comment on lines +102 to +104
-n union --create-namespace \
-f <org>-values.yaml \
--wait
Comment on lines +114 to +119
```bash
helm version --short # expect: v3.x or newer
kubectl version --client # expect: any modern client
uctl version | head -1 # expect: >= 0.1.20 (note: subcommand, not --version)
which uv # required only for AWS/GCP (script venv)
```
@ppiegaze

ppiegaze commented May 7, 2026

Copy link
Copy Markdown
Collaborator

@EngHabu Can you resolve the copilot comments?

@ppiegaze ppiegaze left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the copilot comments. Probably false positives but worth checking

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

GHA build & deploy preview

Built by .github/workflows/build-and-deploy.yml and deployed to the docs CF Pages project.

Branch alias https://pr-982-add-deploy-selfmanage.docs-dog.pages.dev
This commit https://134bab76.docs-dog.pages.dev
Commit SHA afd1ee54813bcc8ce295e6bdbc13b1fcd5b26e06

Updated automatically on every push.

@EngHabu EngHabu marked this pull request as draft June 18, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants