diff --git a/AGENTS.md b/AGENTS.md index 8e5fb3f..20897c3 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,320 +1,164 @@ -# cluster-api-provider-cloudscale - AI Agent Guide - -## Project Structure - -**Single-group layout (default):** -``` -cmd/main.go Manager entry (registers controllers/webhooks) -api//*_types.go CRD schemas (+kubebuilder markers) -api//zz_generated.* Auto-generated (DO NOT EDIT) -internal/controller/* Reconciliation logic -internal/webhook/* Validation/defaulting (if present) -config/crd/bases/* Generated CRDs (DO NOT EDIT) -config/rbac/role.yaml Generated RBAC (DO NOT EDIT) -config/samples/* Example CRs (edit these) -Makefile Build/test/deploy commands -PROJECT Kubebuilder metadata Auto-generated (DO NOT EDIT) -``` - -**Multi-group layout** (for projects with multiple API groups): -``` -api///*_types.go CRD schemas by group -internal/controller//* Controllers by group -internal/webhook///* Webhooks by group and version (if present) -``` - -Multi-group layout organizes APIs by group name (e.g., `batch`, `apps`). Check the `PROJECT` file for `multigroup: true`. - -**To convert to multi-group layout:** -1. Run: `kubebuilder edit --multigroup=true` -2. Move APIs: `mkdir -p api/ && mv api/ api//` -3. Move controllers: `mkdir -p internal/controller/ && mv internal/controller/*.go internal/controller//` -4. Move webhooks (if present): `mkdir -p internal/webhook/ && mv internal/webhook/ internal/webhook//` -5. Update import paths in all files -6. Fix `path` in `PROJECT` file for each resource -7. Update test suite CRD paths (add one more `..` to relative paths) - -## Critical Rules - -### Never Edit These (Auto-Generated) -- `config/crd/bases/*.yaml` - from `make manifests` -- `config/rbac/role.yaml` - from `make manifests` -- `config/webhook/manifests.yaml` - from `make manifests` -- `**/zz_generated.*.go` - from `make generate` -- `PROJECT` - from `kubebuilder [OPTIONS]` - -### Never Remove Scaffold Markers -Do NOT delete `// +kubebuilder:scaffold:*` comments. CLI injects code at these markers. - -### Keep Project Structure -Do not move files around. The CLI expects files in specific locations. - -### Always Use CLI Commands -Always use `kubebuilder create api` and `kubebuilder create webhook` to scaffold. Do NOT create files manually. - -### E2E Tests Require an Isolated Kind Cluster -The e2e tests are designed to validate the solution in an isolated environment (similar to GitHub Actions CI). -Ensure you run them against a dedicated [Kind](https://kind.sigs.k8s.io/) cluster (not your “real” dev/prod cluster). - -## After Making Changes - -**After editing `*_types.go` or markers:** -``` -make manifests # Regenerate CRDs/RBAC from markers -make generate # Regenerate DeepCopy methods -``` - -**After editing `*.go` files:** -``` -make lint-fix # Auto-fix code style -make test # Run unit tests -``` - -## CLI Commands Cheat Sheet - -### Create API (your own types) -```bash -kubebuilder create api --group --version --kind -``` - -### Deploy Image Plugin (scaffold to deploy/manage ANY container image) - -Generate a controller that deploys and manages a container image (nginx, redis, memcached, your app, etc.): - -```bash -# Example: deploying memcached -kubebuilder create api --group example.com --version v1alpha1 --kind Memcached \ - --image=memcached:alpine \ - --plugins=deploy-image.go.kubebuilder.io/v1-alpha -``` - -Scaffolds good-practice code: reconciliation logic, status conditions, finalizers, RBAC. Use as a reference implementation. - - -### Create Webhooks -```bash -# Validation + defaulting -kubebuilder create webhook --group --version --kind \ - --defaulting --programmatic-validation - -# Conversion webhook (for multi-version APIs) -kubebuilder create webhook --group --version v1 --kind \ - --conversion --spoke v2 -``` - -### Controller for Core Kubernetes Types -```bash -# Watch Pods -kubebuilder create api --group core --version v1 --kind Pod \ - --controller=true --resource=false - -# Watch Deployments -kubebuilder create api --group apps --version v1 --kind Deployment \ - --controller=true --resource=false -``` - -### Controller for External Types (e.g., from other operators) - -Watch resources from external APIs (cert-manager, Argo CD, Istio, etc.): - -```bash -# Example: watching cert-manager Certificate resources -kubebuilder create api \ - --group cert-manager --version v1 --kind Certificate \ - --controller=true --resource=false \ - --external-api-path=github.com/cert-manager/cert-manager/pkg/apis/certmanager/v1 \ - --external-api-domain=io \ - --external-api-module=github.com/cert-manager/cert-manager -``` - -**Note:** Use `--external-api-module=@` only if you need a specific version. Otherwise, omit `@` to use what's in go.mod. - -### Webhook for External Types - -```bash -# Example: validating external resources -kubebuilder create webhook \ - --group cert-manager --version v1 --kind Issuer \ - --defaulting \ - --external-api-path=github.com/cert-manager/cert-manager/pkg/apis/certmanager/v1 \ - --external-api-domain=io \ - --external-api-module=github.com/cert-manager/cert-manager -``` - -## Testing & Development - -```bash -make test # Run unit tests (uses envtest: real K8s API + etcd) -make run # Run locally (uses current kubeconfig context) -``` - -Tests use **Ginkgo + Gomega** (BDD style). Check `suite_test.go` for setup. - -## Deployment Workflow - -```bash -# 1. Regenerate manifests -make manifests generate - -# 2. Build & deploy -export IMG=/:tag -make docker-build docker-push IMG=$IMG # Or: kind load docker-image $IMG --name -make deploy IMG=$IMG - -# 3. Test -kubectl apply -k config/samples/ - -# 4. Debug -kubectl logs -n -system deployment/-controller-manager -c manager -f -``` - -### API Design - -**Key markers for** `api//*_types.go`: - -```go -// +kubebuilder:object:root=true -// +kubebuilder:subresource:status -// +kubebuilder:resource:scope=Namespaced -// +kubebuilder:printcolumn:name="Status",type=string,JSONPath=".status.conditions[?(@.type=='Ready')].status" - -// On fields: -// +kubebuilder:validation:Required -// +kubebuilder:validation:Minimum=1 -// +kubebuilder:validation:MaxLength=100 -// +kubebuilder:validation:Pattern="^[a-z]+$" -// +kubebuilder:default="value" -``` - -- **Use** `metav1.Condition` for status (not custom string fields) -- **Use predefined types**: `metav1.Time` instead of `string` for dates -- **Follow K8s API conventions**: Standard field names (`spec`, `status`, `metadata`) - -### Controller Design - -**RBAC markers in** `internal/controller/*_controller.go`: - -```go -// +kubebuilder:rbac:groups=mygroup.example.com,resources=mykinds,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=mygroup.example.com,resources=mykinds/status,verbs=get;update;patch -// +kubebuilder:rbac:groups=mygroup.example.com,resources=mykinds/finalizers,verbs=update -// +kubebuilder:rbac:groups=events.k8s.io,resources=events,verbs=create;patch -// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete -``` - -**Implementation rules:** -- **Idempotent reconciliation**: Safe to run multiple times -- **Re-fetch before updates**: `r.Get(ctx, req.NamespacedName, obj)` before `r.Update` to avoid conflicts -- **Structured logging**: `log := log.FromContext(ctx); log.Info("msg", "key", val)` -- **Owner references**: Enable automatic garbage collection (`SetControllerReference`) -- **Watch secondary resources**: Use `.Owns()` or `.Watches()`, not just `RequeueAfter` -- **Finalizers**: Clean up external resources (buckets, VMs, DNS entries) - -### Logging - -**Follow Kubernetes logging message style guidelines:** - -- Start from a capital letter -- Do not end the message with a period -- Active voice: subject present (`"Deployment could not create Pod"`) or omitted (`"Could not create Pod"`) -- Past tense: `"Could not delete Pod"` not `"Cannot delete Pod"` -- Specify object type: `"Deleted Pod"` not `"Deleted"` -- Balanced key-value pairs - -```go -log.Info("Starting reconciliation") -log.Info("Created Deployment", "name", deploy.Name) -log.Error(err, "Failed to create Pod", "name", name) -``` - -**Reference:** https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines - -### Webhooks -- **Create all types together**: `--defaulting --programmatic-validation --conversion` -- **When`--force`is used**: Backup custom logic first, then restore after scaffolding -- **For multi-version APIs**: Use hub-and-spoke pattern (`--conversion --spoke v2`) - - Hub version: Usually oldest stable version (v1) - - Spoke versions: Newer versions that convert to/from hub (v2, v3) - - Example: `--group crew --version v1 --kind Captain --conversion --spoke v2` (v1 is hub, v2 is spoke) - -### Learning from Examples - -The **deploy-image plugin** scaffolds a complete controller following good practices. Use it as a reference implementation: - -```bash -kubebuilder create api --group example --version v1alpha1 --kind MyApp \ - --image= --plugins=deploy-image.go.kubebuilder.io/v1-alpha -``` - -Generated code includes: status conditions (`metav1.Condition`), finalizers, owner references, events, idempotent reconciliation. - -## Distribution Options - -### Option 1: YAML Bundle (Kustomize) - -```bash -# Generate dist/install.yaml from Kustomize manifests -make build-installer IMG=/:tag -``` - -**Key points:** -- The `dist/install.yaml` is generated from Kustomize manifests (CRDs, RBAC, Deployment) -- Commit this file to your repository for easy distribution -- Users only need `kubectl` to install (no additional tools required) - -**Example:** Users install with a single command: -```bash -kubectl apply -f https://raw.githubusercontent.com////dist/install.yaml -``` - -### Option 2: Helm Chart - -```bash -kubebuilder edit --plugins=helm/v2-alpha # Generates dist/chart/ (default) -kubebuilder edit --plugins=helm/v2-alpha --output-dir=charts # Generates charts/chart/ -``` - -**For development:** -```bash -make helm-deploy IMG=/: # Deploy manager via Helm -make helm-deploy IMG=$IMG HELM_EXTRA_ARGS="--set ..." # Deploy with custom values -make helm-status # Show release status -make helm-uninstall # Remove release -make helm-history # View release history -make helm-rollback # Rollback to previous version -``` - -**For end users/production:** -```bash -helm install my-release .//chart/ --namespace --create-namespace -``` - -**Important:** If you add webhooks or modify manifests after initial chart generation: -1. Backup any customizations in `/chart/values.yaml` and `/chart/manager/manager.yaml` -2. Re-run: `kubebuilder edit --plugins=helm/v2-alpha --force` (use same `--output-dir` if customized) -3. Manually restore your custom values from the backup - -### Publish Container Image - -```bash -export IMG=/: -make docker-build docker-push IMG=$IMG -``` +# AGENTS.md — CAPCS-specific guidance for AI agents + +This file is a supplement for AI agents working on CAPCS +(cluster-api-provider-cloudscale). For architecture, setup, and the +day-to-day developer flow, read [`docs/development.md`](docs/development.md) +first. This file covers the rules and conventions that are most +load-bearing for code changes. + +Related docs: + +- [`docs/development.md`](docs/development.md) — architecture, Tilt, test layers +- [`CONTRIBUTING.md`](CONTRIBUTING.md) — PR flow, required local checks +- [`docs/getting-started.md`](docs/getting-started.md) — end-user workflow +- [`docs/troubleshooting.md`](docs/troubleshooting.md) — common failure modes + +## Do not edit (auto-generated) + +- `config/crd/bases/*.yaml`, `config/rbac/role.yaml`, + `config/webhook/manifests.yaml` — regenerated by `make manifests`. +- `**/zz_generated.*.go` — regenerated by `make generate`. +- `PROJECT` — owned by kubebuilder. +- Leave `// +kubebuilder:scaffold:*` markers in place; the CLI injects code + at them. + +## What to run after a change + +| You touched | Run | +|-------------------------------------------------|--------------------------------------------------------------------------------------| +| `api/v1beta2/*_types.go` or kubebuilder markers | `make manifests generate` | +| Any `*.go` | `make lint-fix && make test` | +| A reconciler or a template under `templates/` | `make test-e2e-lifecycle` before opening a PR ([`CONTRIBUTING.md`](CONTRIBUTING.md)) | + +**Webhooks own all defaulting and validation.** Never duplicate webhook +logic in a controller — behaviour must be identical between `kubectl apply` +and the reconcile loop. + +## Use TDD + +Write code red → green: a failing test first, then the minimal change that +turns it green. This applies to new behaviour, bug fixes, and refactors +that change observable behaviour. Tests live next to the code they exercise +(`*_test.go`); the cloudscale API is mocked through the service interfaces +in `internal/cloudscale/`. + +## Where things live + +| CRD | Types | Controller | Webhook | +|-----------------------------|--------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------| +| `CloudscaleCluster` | `api/v1beta2/cloudscalecluster_types.go` | `internal/controller/cloudscalecluster_controller.go` + `cloudscalecluster_{network,loadbalancer,floatingip,servergroup}.go` | `internal/webhook/v1beta2/cloudscalecluster_webhook.go` | +| `CloudscaleMachine` | `api/v1beta2/cloudscalemachine_types.go` | `internal/controller/cloudscalemachine_controller.go` + `cloudscalemachine_server.go` | `internal/webhook/v1beta2/cloudscalemachine_webhook.go` + `cloudscalemachine_validation.go` | +| `CloudscaleMachineTemplate` | `api/v1beta2/cloudscalemachinetemplate_types.go` | `internal/controller/cloudscalemachinetemplate_controller.go` | `internal/webhook/v1beta2/cloudscalemachinetemplate_webhook.go` | + +Shared infrastructure: + +- `internal/cloudscale/{client,services}.go` — SDK wrapper, service interfaces, shared HTTP transport +- `internal/scope/{cluster,machine}.go` — reconciliation scope objects +- `internal/credentials/` — resolves the per-cluster API token from `credentialsRef` +- `api/v1beta2/condition_types.go` — condition type and reason constants +- `api/v1beta2/tags.go`, `internal/controller/cloudscale_tags.go` — ownership tagging + +For the prose architecture sketch see [`docs/development.md`](docs/development.md#architecture-sketch). + +## Reconciler conventions + +- **Scope pattern.** Build a `scope.ClusterScope` / `scope.MachineScope` at + the top of `Reconcile`. It bundles the client, logger, CAPI `Cluster`, the + CAPCS object, a `patch.Helper`, and the cloudscale client. Defer + `scope.Close()` **with a fresh context** — the reconcile context may + already be timed out by the time status persistence runs. See + `internal/controller/cloudscalecluster_controller.go`. +- **Reconcile shape:** get CAPI owner → check pause + (`util/annotations.IsPaused`) → load credentials → build scope → defer + close → branch on deletion timestamp → add finalizer → call + `reconcileNormal` / `reconcileDelete`. +- **Finalizer constants** live next to the types + (`api/v1beta2/cloudscalecluster_types.go`, look for `ClusterFinalizer`). +- **Conditions.** Set sub-conditions (`NetworkReadyCondition`, + `LoadBalancerReadyCondition`, `FloatingIPReadyCondition`, + `ServerReadyCondition`, `ServerGroupReadyCondition`) that roll up into + `ReadyCondition`. Full list with reason constants: + `api/v1beta2/condition_types.go`. +- **Requeue intervals.** Return `ctrl.Result{}` on steady state. For + transient waits, reuse existing intervals — 5 s + (`ServerStatusPollInterval`, `cloudscalemachine_controller.go:51`) for + server status polling, 10 s for "still draining" waits + (`cloudscalecluster_controller.go:207-214`). Don't invent new ones. +- **Ownership tag.** Every cloudscale resource we create is tagged with + `capcs-cluster-: owned` via + `internal/controller/cloudscale_tags.go`. Reuse `clusterOwnershipTags`; + do not invent a parallel ownership scheme. + +## Webhook conventions + +- Use `CustomDefaulter` + `CustomValidator` and wire both with + `ctrl.NewWebhookManagedBy(...).WithValidator(...).WithDefaulter(...)` — + see `internal/webhook/v1beta2/cloudscalecluster_webhook.go:40`. +- Defaulters and validators get a `*cloudscale.RegionInfo` injected; use it + for region/zone/flavor lookups instead of hitting the API live. +- Validators enforce immutability. Existing examples: region, zone, network + config, LB `Enabled` flag, FIP managed/pre-existing switch. When you add + a new spec field, decide whether it is immutable up front and add the + check here, not in the controller. + +## Cloudscale SDK usage + +- Do not `import "github.com/cloudscale-ch/cloudscale-go-sdk/v8"` outside + `internal/cloudscale/`. Controllers and webhooks talk to the SDK through + the service interfaces on `cloudscale.Client` + (`internal/cloudscale/client.go:32`). +- The shared `*http.Transport` (`internal/cloudscale/client.go:62`) is + created once per manager. Do not build new transports per reconciliation. +- Use the error helpers — `IsNotFound`, `IsFloatingIPNoPublicInterface`, + `IsTimeoutError` (`internal/cloudscale/client.go:121-144`) — instead of + string-matching on error messages. +- Wrap SDK calls with `context.WithTimeout` using the constants in + `client.go`: `ReadTimeout` (1 s), `WriteTimeout` (2 m), `DeleteTimeout` + (2 s). Existing controllers already do this; match the pattern. + +## API design rules + +- Standard CRD markers on infrastructure types: + ```go + // +kubebuilder:object:root=true + // +kubebuilder:subresource:status + // +kubebuilder:resource:path=,scope=Namespaced,categories=cluster-api + ``` + The `categories=cluster-api` bit is what makes `kubectl get cluster-api` + include the resource — keep it on any new infrastructure CRD. +- Print columns. A new infrastructure CRD should expose at minimum + `Cluster` (from the `cluster.x-k8s.io/cluster-name` label) and + `Provisioned` (from `.status.initialization.provisioned`), plus one or + two type-specific columns. Pattern: + `api/v1beta2/cloudscalecluster_types.go:317-323`. +- Implement `conditions.Setter` from + `sigs.k8s.io/cluster-api/util/conditions` so CAPI tooling can read status + uniformly. See + `api/v1beta2/cloudscalecluster_types.go:344-355`. +- Prefer kubebuilder validation markers (`Required`, `Enum`, `Pattern`, + `MinLength`, `MaxLength`, `Minimum`, `Maximum`, `default`) over webhook + checks where a marker suffices — they generate OpenAPI schema and are + enforced by the API server. + +## Multi-version / multi-group + +CAPCS is single-group, single-version (`infrastructure.cluster.x-k8s.io/v1beta2`). +If we ever introduce `v1beta3`, use +`kubebuilder create webhook --conversion --spoke v2` and follow the +[Kubebuilder Book](https://book.kubebuilder.io) — do not roll a custom +conversion scheme. + +## Logging style + +Follow +the [Kubernetes message style guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines): +capitalised start, no trailing period, past tense, name the object type +(`"Created FloatingIP"`, not `"created"`), balanced key/value pairs. Use +`log.FromContext(ctx)` (or pass the logger through the scope). ## References -### Essential Reading -- **Kubebuilder Book**: https://book.kubebuilder.io (comprehensive guide) -- **controller-runtime FAQ**: https://github.com/kubernetes-sigs/controller-runtime/blob/main/FAQ.md (common patterns and questions) -- **Good Practices**: https://book.kubebuilder.io/reference/good-practices.html (why reconciliation is idempotent, status conditions, etc.) -- **Logging Conventions**: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#message-style-guidelines (message style, verbosity levels) - -### API Design & Implementation -- **API Conventions**: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md -- **Operator Pattern**: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ -- **Markers Reference**: https://book.kubebuilder.io/reference/markers.html - -### Tools & Libraries -- **controller-runtime**: https://github.com/kubernetes-sigs/controller-runtime -- **controller-tools**: https://github.com/kubernetes-sigs/controller-tools -- **Kubebuilder Repo**: https://github.com/kubernetes-sigs/kubebuilder +- [Kubebuilder Book](https://book.kubebuilder.io) +- [Cluster API Book](https://cluster-api.sigs.k8s.io) +- [controller-runtime FAQ](https://github.com/kubernetes-sigs/controller-runtime/blob/main/FAQ.md) +- [cloudscale-go-sdk](https://github.com/cloudscale-ch/cloudscale-go-sdk) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..677ffdf --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,35 @@ +# Contributing + +Thanks for your interest in improving CAPCS. + +## Issues + +File bugs and feature requests in the +[GitHub issue tracker](https://github.com/cloudscale-ch/cluster-api-provider-cloudscale/issues). +Please include the CAPCS version, the Kubernetes version of your management +cluster, and the relevant CRD YAML when reporting a bug. If you are unsure +whether a problem is a bug, open an issue anyway — it is easier to redirect +than to discover later. + +## Pull requests + +1. Fork the repository and create a feature branch off `main`. +2. Make your change. Tests live next to the code; new behavior needs a test. +3. Run `make test` and `make lint` locally. +4. For changes that touch reconcilers or templates, run at least + `make test-e2e-lifecycle` against a cloudscale.ch project. +5. Open a PR against `main`. Keep the title short and the description + focused on the *why*. + +Commit messages loosely follow +[Conventional Commits](https://www.conventionalcommits.org/) (`feat:`, `fix:`, +`chore:`, `docs:`). Match the style of recent commits. + +## Development setup + +See [docs/development.md](docs/development.md) for architecture, Tilt setup, +test layers, and make targets. + +## Questions + +Open an issue — there is no separate chat channel. diff --git a/README.md b/README.md index afd3721..f145392 100644 --- a/README.md +++ b/README.md @@ -4,195 +4,60 @@ [![Release](https://img.shields.io/github/v/release/cloudscale-ch/cluster-api-provider-cloudscale)](https://github.com/cloudscale-ch/cluster-api-provider-cloudscale/releases/latest) Kubernetes [Cluster API](https://cluster-api.sigs.k8s.io/) infrastructure provider -for [cloudscale.ch](https://www.cloudscale.ch). +for [cloudscale.ch](https://www.cloudscale.ch). CAPCS provisions the cloudscale-specific +infrastructure — servers, networks, load balancers, floating IPs, server groups — +that Cluster API uses to build and manage workload Kubernetes clusters. + +New to Cluster API? Read the upstream +[concepts](https://cluster-api.sigs.k8s.io/user/concepts.html) and +[quick start](https://cluster-api.sigs.k8s.io/user/quick-start.html) first; this +project only documents what is cloudscale-specific. ## Features -- **CloudscaleCluster**: Multi-network management (managed or pre-existing), Load Balancer (public or private VIP), - Floating IP - support -- **CloudscaleMachine**: Server provisioning with cloud-init and configurable network interfaces -- **CloudscaleMachineTemplate**: Immutable machine templates for KubeadmControlPlane/MachineDeployment +- Three CRDs: `CloudscaleCluster`, `CloudscaleMachine`, `CloudscaleMachineTemplate` +- Managed or pre-existing networks; public or private load balancer VIPs; + floating IPs (IPv4/IPv6); anti-affinity server groups +- Supported regions: `lpg`, `rma` +- HA control plane; `MachineDeployment` autoscaling including + [scale-from-zero](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling) + via capacity reported on `CloudscaleMachineTemplate` +- Four cluster templates: `default`, `fip`, `pre-existing-network`, + `public-lb-private-nodes` ## Prerequisites -- A Kubernetes cluster to use as a management cluster ([kind](https://kind.sigs.k8s.io/) works) -- [clusterctl](https://cluster-api.sigs.k8s.io/user/quick-start#install-clusterctl) -- A [cloudscale.ch](https://www.cloudscale.ch) account and API token -- A custom image imported into cloudscale. Images can e.g. be generated - using [image-builder Openstack](https://image-builder.sigs.k8s.io/) +- cloudscale.ch account and API token +- A custom OS image imported into your cloudscale.ch project, e.g. built with + [image-builder for OpenStack](https://image-builder.sigs.k8s.io/) +- A management Kubernetes cluster ([kind](https://kind.sigs.k8s.io/) works) and + [clusterctl](https://cluster-api.sigs.k8s.io/user/quick-start#install-clusterctl) ## Quickstart -### Initialize the management cluster - ```bash export CLOUDSCALE_API_TOKEN= - clusterctl init --infrastructure cloudscale-ch-cloudscale -``` - -### Generate and apply a workload cluster - -Set the [required environment variables](#environment-variables), then generate and apply the cluster manifest: - -```bash clusterctl generate cluster my-cluster \ - --infrastructure cloudscale-ch-cloudscale \ - --kubernetes-version v1.36.0 \ - --control-plane-machine-count 1 \ - --worker-machine-count 2 \ + --infrastructure cloudscale-ch-cloudscale --kubernetes-version v1.36.0 \ + --control-plane-machine-count 1 --worker-machine-count 2 \ | kubectl apply -f - -``` - -This uses the default template (public nodes, managed network). See [Cluster Templates](#cluster-templates) for other -network topologies. - -Watch the cluster come up: - -```bash clusterctl describe cluster my-cluster ``` -## Environment Variables - -| Variable | Description | Example | -|-------------------------------------------|-----------------------------------------|-----------------------------------| -| `CLOUDSCALE_API_TOKEN` | cloudscale.ch API token | `abc123...` | -| `CLOUDSCALE_SSH_PUBLIC_KEY` | SSH public key added to nodes | `ssh-ed25519 AAAA...` | -| `CLOUDSCALE_REGION` | cloudscale.ch region | `lpg` or `rma` | -| `CLOUDSCALE_MACHINE_IMAGE` | Server image for nodes | `custom:ubuntu-2404-kube-v1.xx.x` | -| `CLOUDSCALE_CONTROL_PLANE_MACHINE_FLAVOR` | Flavor for control plane nodes | `flex-4-2` | -| `CLOUDSCALE_WORKER_MACHINE_FLAVOR` | Flavor for worker nodes | `flex-4-2` | -| `CLOUDSCALE_ROOT_VOLUME_SIZE` | Root volume size in GB | `50` | -| `CLOUDSCALE_NETWORK_UUID` | Pre-Existing cloudscale.ch network UUID | `2db69ba3-...` | - -> **Note:** `CLOUDSCALE_NETWORK_UUID` is required by the `fip`, `public-lb-private-nodes`, and `pre-existing-network` -> template flavors. It is not needed for the default template. - -## Cluster Templates - -CAPCS ships several cluster templates for different network topologies. Use `clusterctl generate cluster` with the -`--flavor` flag to select one: - -```bash -clusterctl generate cluster my-cluster \ - --infrastructure cloudscale-ch-cloudscale \ - --kubernetes-version v1.36.0 \ - --control-plane-machine-count 1 \ - --worker-machine-count 2 \ - --flavor \ - | kubectl apply -f - -``` +The default template uses a managed network and a public load balancer. +[Getting Started](docs/getting-started.md) lists the required environment +variables and the other template flavors. -| Flavor | Network | CP Endpoint | Node Connectivity | Extra Env Vars | Notes | -|---------------------------|---------------------------|-----------------------|-------------------|---------------------------|----------------------| -| *(default)* | Managed (`172.18.0.0/24`) | Public LB (DualStack) | Public + cluster | — | | -| `fip` | Pre-Existing | Floating IP (IPv4) | Public + cluster | `CLOUDSCALE_NETWORK_UUID` | | -| `public-lb-private-nodes` | Pre-Existing + NAT | Public LB | Private only | `CLOUDSCALE_NETWORK_UUID` | Requires NAT gateway | -| `pre-existing-network` | Pre-Existing | Public LB (DualStack) | Public + cluster | `CLOUDSCALE_NETWORK_UUID` | | +## Documentation -The default `networks[].cidr` is `172.18.0.0/24` so it does not overlap with the default Cilium -cluster-pool IPAM range `10.0.0.0/8`. If you override `networks[].cidr` to a range inside -`10.0.0.0/8`, make sure to configure your CNI's IP range correctly. Overlapping -ranges may break for example control-plane LB's health checks. - -## Development - -This is a kubebuilder-scaffolded project. For new APIs, Webhooks, etc. [kubebuilder](https://book.kubebuilder.io/) -commands should be used. - -```bash -# Run tests -make test - -# Generate manifests -make manifests - -# Generate code -make generate - -# Run E2E tests (requires CLOUDSCALE_API_TOKEN) -make test-e2e -``` - -### E2E Tests - -E2E tests are built on the [CAPI e2e test framework](https://pkg.go.dev/sigs.k8s.io/cluster-api/test/e2e) -(Ginkgo-based) and provision real clusters on cloudscale.ch. Tests use Ginkgo labels for -filtering and are split into suites of increasing cost, scheduled accordingly: - -| Suite | Label | Description | ~Duration | Schedule | Make target | -|-------------------------|---------------------------|------------------------------------------------------------------------------------------|-----------|----------|------------------------------------| -| Lifecycle | `lifecycle` | 1 CP + 1 worker: create, validate cloudscale resources, delete | ~5 min | Nightly | `test-e2e-lifecycle` | -| HA lifecycle | `ha` | 3 CP + 2 workers with anti-affinity server groups | ~8 min | Weekly | `test-e2e-ha` | -| Cluster upgrade | `upgrade` | Rolling K8s version upgrade (v1.34 → v1.35) | ~25 min | Weekly | `test-e2e-upgrade` | -| Self-hosted | `self-hosted` | clusterctl move (pivot) to workload cluster. Requires container image in public registry | ~13 min | Weekly | `test-e2e-self-hosted` | -| MD remediation | `md-remediation` | MachineHealthCheck auto-replacement of unhealthy workers | ~6 min | Weekly | `test-e2e-md-remediation` | -| Pre-Existing networking | `pre-existing-networking` | Pre-Existing network: public-LB + private-nodes and floating-IP variants | ~30 min | Weekly | `test-e2e-pre-existing-networking` | -| Conformance (fast) | `conformance` | K8s conformance, skip Serial tests | ~55 min | Weekly | `test-e2e-conformance-fast` | -| Conformance (full) | `conformance` | Full K8s conformance including Serial tests | ~120 min | Biweekly | `test-e2e-conformance` | - -Durations are approximate from a real CI run; conformance varies with cluster size. - -**Why this split?** The single-CP lifecycle test is the cheapest smoke test and runs -nightly to catch regressions early. HA, upgrade, self-hosted, and remediation tests are more -resource-intensive and run weekly. Private networking tests require `CLOUDSCALE_NETWORK_UUID` to be set and are -skipped otherwise. Full K8s conformance is the most expensive and runs biweekly -(1st + 15th of month). All suites can be triggered manually via the `test-e2e.yml` workflow -dispatch. E2E tests share a concurrency group so only one suite runs at a time. - -Any run involving the self-hosted spec requires the container image to be published to our registry. The self-hosted -spec moves the management cluster to the first workload cluster. That workload cluster doesn't have access to the -locally -built images and therefore needs a published container image. - -For PRs, no e2e test is automatically run. It is advised to run them locally before submitting, as well as for a -reviewer -to run them locally and/or manually triggering the workflow **after** reviewing the code is safe. - -### Tilt - -The easiest way to work on this provider is by using the -[Tilt setup](https://cluster-api.sigs.k8s.io/developer/core/tilt.html) of Cluster-API. - -Refer to the linked documentation on how to set up your local tilt. This requires cloning -[Cluster-API core](https://github.com/kubernetes-sigs/cluster-api) to your host. The necessary commands need to be -executed in the -Cluster-API core repository (**not** in this repository). - -An example `tilt-settings.yaml`, which should also be placed in the Cluster-API core repository, is provided here: - -```yaml -default_registry: "" # change if you use a remote image registry -provider_repos: - # This refers to your provider directory and loads settings - # from `tilt-provider.yaml` - - path/to/local/clone/cluster-api-provider-cloudscale -enable_providers: - - cloudscale - - kubeadm-bootstrap - - kubeadm-control-plane -deploy_cert_manager: true -kustomize_substitutions: - CLOUDSCALE_API_TOKEN: "INSERT_TOKEN_HERE" - CLOUDSCALE_SSH_PUBLIC_KEY: "INSERT_SSH_PUBLIC_KEY_HERE" - CLOUDSCALE_REGION: "lpg" - CLOUDSCALE_CONTROL_PLANE_MACHINE_FLAVOR: "flex-4-2" - CLOUDSCALE_WORKER_MACHINE_FLAVOR: "flex-4-2" - CLOUDSCALE_MACHINE_IMAGE: "IMAGE_NAME" - CLOUDSCALE_ROOT_VOLUME_SIZE: "50" - # Required for pre-existing network flavors (fip, public-lb-private-nodes, pre-existing-network): - # CLOUDSCALE_NETWORK_UUID: "UUID_HERE" -extra_args: - cloudscale: - - "--zap-log-level=5" -template_dirs: - docker: - - ./test/infrastructure/docker/templates - cloudscale: - - path/to/local/clone/cluster-api-provider-cloudscale/templates -``` +| If you are… | Start here | +|-------------------------------------|----------------------------------------------------------------------------------------------------------------| +| New to Cluster API, or new to CAPCS | [Getting Started](docs/getting-started.md) | +| Looking up a CRD field | `kubectl explain cloudscalecluster.spec` (or the generated CRDs under [`config/crd/bases/`](config/crd/bases)) | +| Hitting an error | [Troubleshooting](docs/troubleshooting.md) | +| Contributing to CAPCS | [Development](docs/development.md), [CONTRIBUTING.md](CONTRIBUTING.md) | +| Cutting a release | [Releasing](docs/releasing.md), [Testing releases](docs/testing-releases.md) | ## License diff --git a/api/v1beta2/cloudscalecluster_types.go b/api/v1beta2/cloudscalecluster_types.go index 294cb68..ca0077a 100644 --- a/api/v1beta2/cloudscalecluster_types.go +++ b/api/v1beta2/cloudscalecluster_types.go @@ -40,13 +40,17 @@ const ( // CloudscaleClusterSpec defines the desired state of CloudscaleCluster type CloudscaleClusterSpec struct { - // Region is the cloudscale.ch region (e.g., "rma", "lpg"). + // Region is the cloudscale.ch region the cluster is provisioned in. + // Determines the default zone and the set of available flavors. + // Immutable after cluster creation. // +kubebuilder:validation:Required // +kubebuilder:validation:Enum=rma;lpg Region string `json:"region"` - // Zone is the cloudscale.ch zone (e.g., "rma1", "lpg1"). - // Defaults to region + "1" if not specified. + // Zone is the cloudscale.ch zone within Region. + // Defaults to Region + "1" (e.g., "rma1", "lpg1"). Set explicitly only when + // the region offers multiple zones and you need to pin the cluster to one. + // Immutable after cluster creation. // +optional Zone string `json:"zone,omitempty"` @@ -86,13 +90,16 @@ type CloudscaleClusterSpec struct { FloatingIP *FloatingIPSpec `json:"floatingIP,omitempty"` } -// CloudscaleCredentialsReference references a Secret containing the API token. +// CloudscaleCredentialsReference references a Secret holding the cloudscale.ch +// API token used to provision this cluster's infrastructure. The Secret must +// contain a key named "token" with the raw token string as its value. type CloudscaleCredentialsReference struct { // Name is the name of the Secret. // +kubebuilder:validation:Required Name string `json:"name"` - // Namespace is the namespace of the Secret. Defaults to the cluster namespace. + // Namespace is the namespace of the Secret. Defaults to the + // CloudscaleCluster's own namespace if unset. // +optional Namespace string `json:"namespace,omitempty"` } @@ -138,18 +145,24 @@ type LoadBalancerSpec struct { // +optional Enabled *bool `json:"enabled,omitempty"` - // Algorithm is the load balancing algorithm. + // Algorithm is the cloudscale.ch load-balancing algorithm. + // - "round_robin" (default): rotate requests across healthy backends. + // - "least_connections": send each request to the backend with the fewest active connections. + // - "source_ip": hash the client IP so the same client lands on the same backend. // +kubebuilder:validation:Enum=round_robin;least_connections;source_ip // +kubebuilder:default="round_robin" // +optional Algorithm string `json:"algorithm,omitempty"` - // Flavor is the load balancer flavor (size). + // Flavor is the cloudscale.ch load balancer flavor slug. Defaults to + // "lb-standard". // +kubebuilder:default="lb-standard" // +optional Flavor string `json:"flavor,omitempty"` - // APIServerPort is the port for the Kubernetes API server. + // APIServerPort is the LB listener port exposed for the Kubernetes API + // server. Defaults to 6443. The pool always targets the API server on the + // control plane nodes' 6443. // +kubebuilder:default=6443 // +kubebuilder:validation:Minimum=1 // +kubebuilder:validation:Maximum=65535 @@ -309,7 +322,9 @@ func (s *CloudscaleClusterStatus) GetNetworkStatus(name string) *NetworkStatus { // +kubebuilder:printcolumn:name="Region",type="string",JSONPath=".spec.region",description="cloudscale.ch region" // +kubebuilder:printcolumn:name="Endpoint",type="string",JSONPath=".spec.controlPlaneEndpoint.host",description="Control plane endpoint" -// CloudscaleCluster is the Schema for the cloudscaleclusters API +// CloudscaleCluster is the cloudscale.ch infrastructure for a CAPI Cluster. +// It owns the networks, control-plane load balancer, optional floating IP, and +// server groups that back the cluster's machines. type CloudscaleCluster struct { metav1.TypeMeta `json:",inline"` diff --git a/api/v1beta2/cloudscalemachine_types.go b/api/v1beta2/cloudscalemachine_types.go index d4eb98c..056e36a 100644 --- a/api/v1beta2/cloudscalemachine_types.go +++ b/api/v1beta2/cloudscalemachine_types.go @@ -34,22 +34,34 @@ type CloudscaleMachineSpec struct { // +optional ProviderID *string `json:"providerID,omitempty"` - // Flavor is the cloudscale.ch server flavor (e.g., "flex-8-4"). + // Flavor is the cloudscale.ch server flavor slug, e.g. "flex-4-2" or + // "plus-8-4". List available flavors via the cloudscale API + // (`GET /v1/flavors`) or the control panel. + // Immutable after machine creation. // +kubebuilder:validation:Required // +kubebuilder:validation:MinLength=1 Flavor string `json:"flavor"` - // Image is the OS image slug (e.g., "ubuntu-24.04"), custom image slug (e.g., "custom:ubuntu-foo"), or custom image UUID. + // Image identifies the OS image used to boot the server. One of: + // - a public image slug (e.g. "ubuntu-24.04"), + // - a custom image slug (e.g. "custom:ubuntu-2404-kube-v1.36.0"), or + // - a custom image UUID. + // For Kubernetes nodes you typically want a custom image built with + // image-builder (https://image-builder.sigs.k8s.io/) that already contains + // kubelet, containerd, and the chosen Kubernetes version. // +kubebuilder:validation:Required // +kubebuilder:validation:MinLength=1 Image string `json:"image"` - // RootVolumeSize is the root volume size in GB. + // RootVolumeSize is the root volume size in GB. Minimum 10. If unset, the + // cloudscale.ch default for the chosen flavor is used. // +kubebuilder:validation:Minimum=10 // +optional RootVolumeSize int `json:"rootVolumeSize,omitempty"` - // Tags are key-value pairs to apply to the server. + // Tags are user-defined key/value pairs applied to the server as cloudscale + // tags. CAPCS additionally sets its own ownership tag with the key + // "capcs-cluster-"; do not set keys with the "capcs-" prefix. // +optional Tags map[string]string `json:"tags,omitempty"` @@ -95,10 +107,14 @@ type InterfaceSpec struct { } // ServerGroupSpec configures server group placement for anti-affinity. +// cloudscale.ch limits a single server group to 4 servers; to scale a pool +// beyond that, split it across multiple MachineDeployments each pointing at a +// CloudscaleMachineTemplate with a distinct ServerGroupSpec.Name. type ServerGroupSpec struct { // Name is the server group name. Machines with the same server group name - // in the same zone will be placed on different physical hosts. - // The server group is created automatically if it doesn't exist. + // in the same zone are placed on different physical hosts. The group is + // created automatically the first time CAPCS sees the name. + // Immutable after machine creation. // +kubebuilder:validation:Required // +kubebuilder:validation:MinLength=1 Name string `json:"name"` @@ -153,7 +169,9 @@ type CloudscaleMachineStatus struct { // +kubebuilder:printcolumn:name="ProviderID",type="string",JSONPath=".spec.providerID",description="cloudscale.ch server ID" // +kubebuilder:printcolumn:name="Machine",type="string",JSONPath=".metadata.ownerReferences[?(@.kind==\"Machine\")].name",description="Machine object" -// CloudscaleMachine is the Schema for the cloudscalemachines API +// CloudscaleMachine represents a single cloudscale.ch server backing a CAPI +// Machine. Most spec fields are immutable after creation — to change them, +// roll the owning MachineDeployment or KubeadmControlPlane. type CloudscaleMachine struct { metav1.TypeMeta `json:",inline"` diff --git a/api/v1beta2/cloudscalemachinetemplate_types.go b/api/v1beta2/cloudscalemachinetemplate_types.go index 64fd857..2b91553 100644 --- a/api/v1beta2/cloudscalemachinetemplate_types.go +++ b/api/v1beta2/cloudscalemachinetemplate_types.go @@ -70,7 +70,10 @@ type NodeInfo struct { // +kubebuilder:object:root=true // +kubebuilder:subresource:status -// CloudscaleMachineTemplate is the Schema for the cloudscalemachinetemplates API +// CloudscaleMachineTemplate is the immutable template a MachineDeployment or +// KubeadmControlPlane uses to stamp out CloudscaleMachines. Its Status.Capacity +// reports the CPU/memory of the chosen flavor (plus the root volume size) so +// the cluster autoscaler can scale a MachineDeployment up from zero replicas. type CloudscaleMachineTemplate struct { metav1.TypeMeta `json:",inline"` diff --git a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscaleclusters.yaml b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscaleclusters.yaml index d53dd94..b50bc8c 100644 --- a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscaleclusters.yaml +++ b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscaleclusters.yaml @@ -36,7 +36,10 @@ spec: name: v1beta2 schema: openAPIV3Schema: - description: CloudscaleCluster is the Schema for the cloudscaleclusters API + description: |- + CloudscaleCluster is the cloudscale.ch infrastructure for a CAPI Cluster. + It owns the networks, control-plane load balancer, optional floating IP, and + server groups that back the cluster's machines. properties: apiVersion: description: |- @@ -82,7 +85,11 @@ spec: properties: algorithm: default: round_robin - description: Algorithm is the load balancing algorithm. + description: |- + Algorithm is the cloudscale.ch load-balancing algorithm. + - "round_robin" (default): rotate requests across healthy backends. + - "least_connections": send each request to the backend with the fewest active connections. + - "source_ip": hash the client IP so the same client lands on the same backend. enum: - round_robin - least_connections @@ -90,8 +97,10 @@ spec: type: string apiServerPort: default: 6443 - description: APIServerPort is the port for the Kubernetes API - server. + description: |- + APIServerPort is the LB listener port exposed for the Kubernetes API + server. Defaults to 6443. The pool always targets the API server on the + control plane nodes' 6443. format: int32 maximum: 65535 minimum: 1 @@ -105,7 +114,9 @@ spec: type: boolean flavor: default: lb-standard - description: Flavor is the load balancer flavor (size). + description: |- + Flavor is the cloudscale.ch load balancer flavor slug. Defaults to + "lb-standard". type: string healthMonitor: description: HealthMonitor configures the load balancer health @@ -155,8 +166,9 @@ spec: description: Name is the name of the Secret. type: string namespace: - description: Namespace is the namespace of the Secret. Defaults - to the cluster namespace. + description: |- + Namespace is the namespace of the Secret. Defaults to the + CloudscaleCluster's own namespace if unset. type: string required: - name @@ -243,15 +255,20 @@ spec: - name x-kubernetes-list-type: map region: - description: Region is the cloudscale.ch region (e.g., "rma", "lpg"). + description: |- + Region is the cloudscale.ch region the cluster is provisioned in. + Determines the default zone and the set of available flavors. + Immutable after cluster creation. enum: - rma - lpg type: string zone: description: |- - Zone is the cloudscale.ch zone (e.g., "rma1", "lpg1"). - Defaults to region + "1" if not specified. + Zone is the cloudscale.ch zone within Region. + Defaults to Region + "1" (e.g., "rma1", "lpg1"). Set explicitly only when + the region offers multiple zones and you need to pin the cluster to one. + Immutable after cluster creation. type: string required: - credentialsRef diff --git a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachines.yaml b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachines.yaml index 28f415e..51573a4 100644 --- a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachines.yaml +++ b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachines.yaml @@ -36,7 +36,10 @@ spec: name: v1beta2 schema: openAPIV3Schema: - description: CloudscaleMachine is the Schema for the cloudscalemachines API + description: |- + CloudscaleMachine represents a single cloudscale.ch server backing a CAPI + Machine. Most spec fields are immutable after creation — to change them, + roll the owning MachineDeployment or KubeadmControlPlane. properties: apiVersion: description: |- @@ -59,12 +62,22 @@ spec: description: spec defines the desired state of CloudscaleMachine properties: flavor: - description: Flavor is the cloudscale.ch server flavor (e.g., "flex-8-4"). + description: |- + Flavor is the cloudscale.ch server flavor slug, e.g. "flex-4-2" or + "plus-8-4". List available flavors via the cloudscale API + (`GET /v1/flavors`) or the control panel. + Immutable after machine creation. minLength: 1 type: string image: - description: Image is the OS image slug (e.g., "ubuntu-24.04"), custom - image slug (e.g., "custom:ubuntu-foo"), or custom image UUID. + description: |- + Image identifies the OS image used to boot the server. One of: + - a public image slug (e.g. "ubuntu-24.04"), + - a custom image slug (e.g. "custom:ubuntu-2404-kube-v1.36.0"), or + - a custom image UUID. + For Kubernetes nodes you typically want a custom image built with + image-builder (https://image-builder.sigs.k8s.io/) that already contains + kubelet, containerd, and the chosen Kubernetes version. minLength: 1 type: string interfaces: @@ -117,7 +130,9 @@ spec: Format: cloudscale:// type: string rootVolumeSize: - description: RootVolumeSize is the root volume size in GB. + description: |- + RootVolumeSize is the root volume size in GB. Minimum 10. If unset, the + cloudscale.ch default for the chosen flavor is used. minimum: 10 type: integer serverGroup: @@ -129,8 +144,9 @@ spec: name: description: |- Name is the server group name. Machines with the same server group name - in the same zone will be placed on different physical hosts. - The server group is created automatically if it doesn't exist. + in the same zone are placed on different physical hosts. The group is + created automatically the first time CAPCS sees the name. + Immutable after machine creation. minLength: 1 type: string required: @@ -139,7 +155,10 @@ spec: tags: additionalProperties: type: string - description: Tags are key-value pairs to apply to the server. + description: |- + Tags are user-defined key/value pairs applied to the server as cloudscale + tags. CAPCS additionally sets its own ownership tag with the key + "capcs-cluster-"; do not set keys with the "capcs-" prefix. type: object required: - flavor diff --git a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachinetemplates.yaml b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachinetemplates.yaml index 8b05648..222fd21 100644 --- a/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachinetemplates.yaml +++ b/config/crd/bases/infrastructure.cluster.x-k8s.io_cloudscalemachinetemplates.yaml @@ -17,8 +17,11 @@ spec: - name: v1beta2 schema: openAPIV3Schema: - description: CloudscaleMachineTemplate is the Schema for the cloudscalemachinetemplates - API + description: |- + CloudscaleMachineTemplate is the immutable template a MachineDeployment or + KubeadmControlPlane uses to stamp out CloudscaleMachines. Its Status.Capacity + reports the CPU/memory of the chosen flavor (plus the root volume size) so + the cluster autoscaler can scale a MachineDeployment up from zero replicas. properties: apiVersion: description: |- @@ -48,14 +51,22 @@ spec: of the machine. properties: flavor: - description: Flavor is the cloudscale.ch server flavor (e.g., - "flex-8-4"). + description: |- + Flavor is the cloudscale.ch server flavor slug, e.g. "flex-4-2" or + "plus-8-4". List available flavors via the cloudscale API + (`GET /v1/flavors`) or the control panel. + Immutable after machine creation. minLength: 1 type: string image: - description: Image is the OS image slug (e.g., "ubuntu-24.04"), - custom image slug (e.g., "custom:ubuntu-foo"), or custom - image UUID. + description: |- + Image identifies the OS image used to boot the server. One of: + - a public image slug (e.g. "ubuntu-24.04"), + - a custom image slug (e.g. "custom:ubuntu-2404-kube-v1.36.0"), or + - a custom image UUID. + For Kubernetes nodes you typically want a custom image built with + image-builder (https://image-builder.sigs.k8s.io/) that already contains + kubelet, containerd, and the chosen Kubernetes version. minLength: 1 type: string interfaces: @@ -108,7 +119,9 @@ spec: Format: cloudscale:// type: string rootVolumeSize: - description: RootVolumeSize is the root volume size in GB. + description: |- + RootVolumeSize is the root volume size in GB. Minimum 10. If unset, the + cloudscale.ch default for the chosen flavor is used. minimum: 10 type: integer serverGroup: @@ -120,8 +133,9 @@ spec: name: description: |- Name is the server group name. Machines with the same server group name - in the same zone will be placed on different physical hosts. - The server group is created automatically if it doesn't exist. + in the same zone are placed on different physical hosts. The group is + created automatically the first time CAPCS sees the name. + Immutable after machine creation. minLength: 1 type: string required: @@ -130,7 +144,10 @@ spec: tags: additionalProperties: type: string - description: Tags are key-value pairs to apply to the server. + description: |- + Tags are user-defined key/value pairs applied to the server as cloudscale + tags. CAPCS additionally sets its own ownership tag with the key + "capcs-cluster-"; do not set keys with the "capcs-" prefix. type: object required: - flavor diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 0000000..a742fee --- /dev/null +++ b/docs/development.md @@ -0,0 +1,139 @@ +# Development + +For contributors working on CAPCS itself. End-user docs are in +[Getting Started](getting-started.md) and [Troubleshooting](troubleshooting.md). + +## Architecture sketch + +CAPCS is a kubebuilder-scaffolded infrastructure provider. Three CRDs, three +reconcilers, a webhook per CRD, and a thin wrapper around the cloudscale-go-sdk. + +``` +api/v1beta2/ CRD types (CloudscaleCluster, CloudscaleMachine, CloudscaleMachineTemplate) +internal/controller/ Reconcilers, one file per cloudscale resource (network, LB, FIP, server group, server) +internal/webhook/v1beta2/ Defaulting + validating webhooks (one per CRD) +internal/cloudscale/ SDK wrapper: shared HTTP transport, flavor/region helpers, per-cluster services +internal/credentials/ Resolves the per-cluster API token from `credentialsRef` +internal/scope/ Per-cluster / per-machine reconciliation scope objects +cmd/main.go Manager setup, controller wiring, leader election, webhook registration +``` + +A few conventions to know before touching code: + +- **Webhooks own all defaulting and validation.** Controllers must never repeat + validation logic — if a field needs a default or a check, it goes in the + webhook so behavior stays consistent between `kubectl apply` and the + reconcile loop. +- **Ownership tags.** Cloudscale resources are tagged with the key + `capcs-cluster-` so the reconciler can identify what it owns + and clean it up. See `api/v1beta2/tags.go` and `internal/controller/cloudscale_tags.go`. +- **Shared HTTP transport.** Per-cluster cloudscale clients share an + `http.Transport` (see `internal/cloudscale/services.go`) so connection + pooling works across reconciliations. + +## Setup + +You need: + +- Go (version pinned in `go.mod`) +- [kind](https://kind.sigs.k8s.io/), [clusterctl](https://cluster-api.sigs.k8s.io/user/quick-start#install-clusterctl), + `kubectl`, `kustomize` +- [Tilt](https://tilt.dev/) for the inner-loop workflow +- A cloudscale.ch API token (export `CLOUDSCALE_API_TOKEN`) +- A cloudscale.ch custom image (see [Getting Started](getting-started.md#prerequisites)) + +## Make targets + +```bash +make test # unit tests + envtest (runs fmt, vet, generate, manifests) +make manifests # regenerate CRDs / webhook config from kubebuilder markers +make generate # regenerate deepcopy code +make lint # golangci-lint +make build # build the manager binary + +make test-e2e-lifecycle # smallest E2E suite — single CP + 1 worker +make test-e2e # full conformance-fast E2E suite (slow, real cloudscale) +``` + +E2E suites and their cadence are documented in +[Testing Releases](testing-releases.md). + +## Iterating on cluster templates locally + +When you change a file under `templates/`, you can test it before it ships in a +release by pointing `clusterctl generate` at the local file: + +```bash +clusterctl generate cluster my-cluster \ + --infrastructure cloudscale-ch-cloudscale \ + --kubernetes-version v1.36.0 \ + --from templates/cluster-template-fip.yaml \ + | kubectl apply -f - +``` + +This is a contributor flow only — end users consume published flavors via +`--flavor` (see [Getting Started](getting-started.md#3-pick-a-cluster-template-flavor)). + +## Tilt + +The fastest inner loop is Cluster API's +[Tilt setup](https://cluster-api.sigs.k8s.io/developer/core/tilt.html). It runs +out of a local clone of [cluster-api](https://github.com/kubernetes-sigs/cluster-api), +**not** out of this repository. + +Drop a `tilt-settings.yaml` next to the cluster-api checkout: + +```yaml +default_registry: "" +provider_repos: + - path/to/local/clone/cluster-api-provider-cloudscale +enable_providers: + - cloudscale + - kubeadm-bootstrap + - kubeadm-control-plane +deploy_cert_manager: true +kustomize_substitutions: + CLOUDSCALE_API_TOKEN: "INSERT_TOKEN_HERE" + CLOUDSCALE_SSH_PUBLIC_KEY: "INSERT_SSH_PUBLIC_KEY_HERE" + CLOUDSCALE_REGION: "lpg" + CLOUDSCALE_CONTROL_PLANE_MACHINE_FLAVOR: "flex-4-2" + CLOUDSCALE_WORKER_MACHINE_FLAVOR: "flex-4-2" + CLOUDSCALE_MACHINE_IMAGE: "IMAGE_NAME" + CLOUDSCALE_ROOT_VOLUME_SIZE: "50" + # Required for the fip / public-lb-private-nodes / pre-existing-network flavors: + # CLOUDSCALE_NETWORK_UUID: "UUID_HERE" +extra_args: + cloudscale: + - "--zap-log-level=5" +template_dirs: + docker: + - ./test/infrastructure/docker/templates + cloudscale: + - path/to/local/clone/cluster-api-provider-cloudscale/templates +``` + +Then `tilt up` from the cluster-api checkout. + +## Tests + +| Layer | Location | What it covers | +|---------|-------------------------------------------|---------------------------------------------------------------------------------------| +| Unit | `*_test.go` next to each file | Pure logic; cloudscale API mocked | +| envtest | `internal/controller/suite_test.go` setup | Reconcilers against a real apiserver + etcd, cloudscale API mocked | +| E2E | `test/e2e/` | Real workload clusters on cloudscale.ch (see [Testing Releases](testing-releases.md)) | + +PRs do not run E2E automatically. Run the relevant suite locally before +submitting (`make test-e2e-lifecycle` at minimum); reviewers can run additional +suites or trigger the `test-e2e.yml` workflow manually after reviewing the +diff is safe. + +## Releases + +See [Releasing](releasing.md) for the tag-and-publish flow and +[Testing Releases](testing-releases.md) for post-release verification. + +## Notes for AI agent contributors + +If you are an AI agent contributing changes, read [`AGENTS.md`](../AGENTS.md) at +the repo root — it covers kubebuilder rules, auto-generated files to leave +alone, and project-specific conventions in more detail. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..8c003f1 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,138 @@ +# Getting Started + +This guide walks you through provisioning your first workload Kubernetes cluster +on [cloudscale.ch](https://www.cloudscale.ch) with CAPCS. For Cluster API +fundamentals (concepts, `clusterctl`, upgrades) see the +[upstream documentation](https://cluster-api.sigs.k8s.io/) — this guide only +covers what is cloudscale-specific. + +## Prerequisites + +1. **cloudscale.ch account and API token.** Create a token with read/write + permissions in the [cloudscale.ch control panel](https://control.cloudscale.ch/). + Keep it out of version control. +2. **Custom OS image** imported into your cloudscale.ch project. CAPCS does not + publish a pre-built image — build one with + [image-builder for OpenStack](https://image-builder.sigs.k8s.io/capi/providers/openstack) + targeting the Kubernetes version you want, then upload it via the cloudscale.ch + control panel or API. The image name you set there is what you pass as + `CLOUDSCALE_MACHINE_IMAGE` (with `custom:` as a prefix). +3. **Management cluster.** Any conformant Kubernetes cluster works; a local + [kind](https://kind.sigs.k8s.io/) cluster is the easiest starting point. +4. **`clusterctl`.** Install it per the + [upstream instructions](https://cluster-api.sigs.k8s.io/user/quick-start#install-clusterctl). +5. **(Optional) Pre-existing network with NAT gateway.** Required for the `fip`, + `pre-existing-network`, and `public-lb-private-nodes` template flavors. Create + it in the cloudscale.ch control panel, contact support to setup the NAT gateway, and note its UUID. + +## 1. Install the provider on the management cluster + +```bash +export CLOUDSCALE_API_TOKEN= +clusterctl init --infrastructure cloudscale-ch-cloudscale +``` + +`clusterctl init` also installs the Cluster API core, kubeadm bootstrap, and +kubeadm control plane components if they aren't already present. + +## 2. Configure environment variables + +`clusterctl generate cluster` substitutes these into the chosen template: + +| Variable | Description | Example | +|-------------------------------------------|-------------------------------------------------------|-----------------------------------| +| `CLOUDSCALE_API_TOKEN` | API token used by the workload cluster's CAPCS Secret | `abc123...` | +| `CLOUDSCALE_REGION` | cloudscale.ch region | `lpg` or `rma` | +| `CLOUDSCALE_MACHINE_IMAGE` | Name of your imported custom image | `custom:ubuntu-2404-kube-v1.36.0` | +| `CLOUDSCALE_CONTROL_PLANE_MACHINE_FLAVOR` | Flavor for control plane nodes | `flex-4-2` | +| `CLOUDSCALE_WORKER_MACHINE_FLAVOR` | Flavor for worker nodes | `flex-4-2` | +| `CLOUDSCALE_ROOT_VOLUME_SIZE` | Root volume size in GB | `50` | +| `CLOUDSCALE_SSH_PUBLIC_KEY` | SSH public key added to every node | `ssh-ed25519 AAAA...` | +| `CLOUDSCALE_NETWORK_UUID` | Pre-existing network UUID (non-default flavors only) | `2db69ba3-...` | + +Set them once in your shell, or keep them in `clusterctl`'s config file at +`~/.config/cluster-api/clusterctl.yaml`. + +## 3. Pick a cluster template flavor + +| Flavor | Network | Control plane endpoint | Node connectivity | Requires | +|---------------------------|--------------------------|------------------------|-------------------|------------------------------------------------------| +| *(default)* | Managed, `172.18.0.0/24` | Public LB, DualStack | Public + cluster | — | +| `fip` | Pre-existing | Floating IP, IPv4 | Public + cluster | `CLOUDSCALE_NETWORK_UUID` | +| `pre-existing-network` | Pre-existing | Public LB, DualStack | Public + cluster | `CLOUDSCALE_NETWORK_UUID` | +| `public-lb-private-nodes` | Pre-existing + NAT | Public LB | Private only | `CLOUDSCALE_NETWORK_UUID`, with a NAT gateway set up | + +The default's `172.18.0.0/24` network CIDR is chosen so it does not overlap with +the default Cilium cluster-pool range (`10.0.0.0/8`). If you change +`networks[].cidr` to a value inside your CNI's pod or service range, the control +plane load balancer's health checks will break — adjust the CNI accordingly. + +## 4. Generate and apply the cluster + +```bash +clusterctl generate cluster my-cluster \ + --infrastructure cloudscale-ch-cloudscale \ + --kubernetes-version v1.36.0 \ + --control-plane-machine-count 1 \ + --worker-machine-count 2 \ + --flavor pre-existing-network \ + > my-cluster.yaml + +kubectl apply -f my-cluster.yaml +``` + +Omit `--flavor` for the default template. Inspect `my-cluster.yaml` before +applying — it includes a Secret holding `CLOUDSCALE_API_TOKEN`, which CAPCS +references via `CloudscaleCluster.spec.credentialsRef`. + +Watch progress: + +```bash +clusterctl describe cluster my-cluster +``` + +## 5. Get the kubeconfig and install a CNI + +```bash +clusterctl get kubeconfig my-cluster > my-cluster.kubeconfig +export KUBECONFIG=$(pwd)/my-cluster.kubeconfig +``` + +The cluster has no CNI installed yet — nodes will stay `NotReady` until you +install one. Any standard CNI works; if you choose Cilium, keep its IPAM range +clear of the network CIDR you used in step 3. + +## 6. Install the cloudscale Cloud Controller Manager + +CAPCS provisions infrastructure, but services of type `LoadBalancer` and the +`ProviderID` on nodes come from the +[cloudscale CCM](https://github.com/cloudscale-ch/cloudscale-cloud-controller-manager). +The CCM is shipped as a `ClusterResourceSet` you apply on the **management** +cluster; CAPI then deploys it into any workload cluster labelled +`ccm: cloudscale` (all CAPCS templates set this label): + +```bash +curl -L https://raw.githubusercontent.com/cloudscale-ch/cluster-api-provider-cloudscale/main/templates/addons/ccm.yaml \ + | envsubst | kubectl apply -f - +``` + +This creates a ConfigMap with the CCM manifests, a Secret with the API token, +and the `ClusterResourceSet` that wires them together. + +## 7. Clean up + +```bash +kubectl delete cluster my-cluster +``` + +Deleting the `Cluster` cascades through CAPCS, which removes the servers, load +balancer, floating IPs, server groups, and any managed networks it created. +Pre-existing networks supplied via `CLOUDSCALE_NETWORK_UUID` are left intact. + +## Next steps + +- Look up CRD fields with `kubectl explain cloudscalecluster.spec` (or browse the + CRDs in [`config/crd/bases/`](../config/crd/bases)) +- Read the [troubleshooting guide](troubleshooting.md) when something gets stuck +- Upstream Cluster API tasks (upgrades, scaling, MachineHealthChecks, etc.) are + documented at diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..2328500 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,137 @@ +# Troubleshooting + +Cloudscale-specific failure modes for CAPCS. For generic Cluster API issues +(bootstrap, certificates, MachineHealthCheck, etc.) see the +[upstream troubleshooting guide](https://cluster-api.sigs.k8s.io/user/troubleshooting.html). + +## Where to look first + +```bash +# Cluster-level status, conditions, and child resources +clusterctl describe cluster + +# Cloudscale infrastructure conditions +kubectl describe cloudscalecluster +kubectl describe cloudscalemachine + +# Controller logs +kubectl -n capcs-system logs deploy/capcs-controller-manager -f +``` + +Most problems surface as a `Ready: False` condition with a `Reason` and `Message` +on the `CloudscaleCluster` or `CloudscaleMachine` — read those before diving +into logs. + +## Authentication: `401 Unauthorized` from the cloudscale API + +**Symptom:** controller logs show `401` from `api.cloudscale.ch`; `CloudscaleCluster` +stays `Ready: False` with an auth-related message. + +**Common causes:** + +- The credentials Secret is missing the `token` key, or the value is empty. +- `credentialsRef.namespace` points to a namespace that doesn't contain the + Secret (it defaults to the `CloudscaleCluster`'s own namespace if unset). +- The token was revoked or scoped read-only in the cloudscale.ch control panel. + +**Fix:** verify the Secret: + +```bash +kubectl get secret -o jsonpath='{.data.token}' | base64 -d +``` + +Re-create it with read/write scope if needed and let the controller requeue. + +## Image: server creation fails with "image not found" + +**Symptom:** `CloudscaleMachine` stuck `Ready: False`; cloudscale API returns +404 when creating the server. + +**Cause:** the value of `spec.image` (set via `CLOUDSCALE_MACHINE_IMAGE`) +doesn't match a custom image imported into your cloudscale.ch project. CAPCS +does not ship a public image. + +**Fix:** build and import an image with +[image-builder for OpenStack](https://image-builder.sigs.k8s.io/capi/providers/openstack) +and reference its exact name (typically `custom:`). + +## Network: cluster stuck Provisioning, CIDR overlap + +**Symptom:** workers are `Ready` but the control-plane load balancer never goes +healthy; pod-to-LB traffic from inside the cluster fails. + +**Cause:** the network CIDR set on `CloudscaleCluster.spec.networks[].cidr` +overlaps with the CNI's pod or service range. The default Cilium cluster-pool +range is `10.0.0.0/8`, so any network CIDR inside that range collides. + +**Verify:** Check the route table of the servers using `ip route`. + +**Fix:** keep the network CIDR outside the CNI's IPAM range. The default +template uses `172.18.0.0/24` for this reason. If you must use a different +range, reconfigure your CNI to match. + +## Network: wrong pre-existing network UUID + +**Symptom:** `CloudscaleCluster` rejected by the webhook, or accepted but +reconciliation fails with `network not found`. + +**Cause:** `CLOUDSCALE_NETWORK_UUID` doesn't exist in the cloudscale.ch project +the API token belongs to, or it exists in a different region. + +**Fix:** look up the network in the cloudscale.ch control panel, confirm region +matches `CloudscaleCluster.spec.region`, and update the UUID. + +## Load balancer stuck in `degraded` or `error` + +**Symptom:** `clusterctl describe` shows the LB condition as `degraded` or +`error`; the control plane endpoint is unreachable. + +**Cause:** the cloudscale LB has reported a non-running status. CAPCS does not +block reconciliation on `degraded`/`error` (it does block on `changing`), so +stale pool members will still be removed — but a persistent non-running status +points at an issue on the LB itself or its backends. + +**Fix:** check the LB in the cloudscale.ch control panel; verify pool members +correspond to live control plane machines on the expected port. If a control +plane Machine was deleted and replaced, give the reconciler a minute to drop +the old member, then re-check. + +## Server group: cluster cannot scale beyond 4 nodes per pool + +**Symptom:** `MachineDeployment` scale-up stops at 4; new `CloudscaleMachine` +creation rejected by the cloudscale API. + +**Cause:** cloudscale.ch limits a server group to 4 servers. CAPCS places all +machines from one `CloudscaleMachineTemplate` into the server group named in +`spec.serverGroup.name` (if defined). + +**Fix:** split the workload across multiple `MachineDeployment`s, each +referencing a `CloudscaleMachineTemplate` with a distinct +`spec.serverGroup.name`. + +## Webhook rejection: `unknown flavor` + +**Symptom:** `kubectl apply` fails with a webhook validation error on +`spec.flavor` (or `spec.template.spec.flavor` for `CloudscaleMachineTemplate`). + +**Cause:** the webhook validates `flavor` against the live list of flavors +fetched from the cloudscale API. The value doesn't match any known flavor slug. + +**Fix:** list available flavors via the cloudscale API or control panel and pick +a slug that exists there. + +## Webhook validation: common rejections + +Other validations that commonly trip people up: + +| Rejection | What it means | +|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------| +| `exactly one of uuid or cidr must be specified` on `spec.networks[*]` | Each network entry references either a pre-existing network (uuid) or a managed one (cidr) | +| `gateway must be within CIDR ` | `networks[*].gatewayAddress` is outside the network's own CIDR | +| `floating IPs cannot be attached to a load balancer with a private VIP` | Combine a public LB with a floating IP, or drop one of them | +| `exactly one of ipFamily or ip must be specified` on `floatingIP` | Set `ipFamily` to let CAPCS allocate, or `ip` to reuse a pre-existing floating IP | +| `field is immutable after cluster creation` | Most cloudscale-side topology fields (region, zone, networks, floating IP, etc.) cannot be changed once the cluster exists | +| `field is immutable` on `CloudscaleMachine.spec` | Most machine spec fields (flavor, image, server group, …) cannot be changed once the machine exists — recreate via `MachineDeployment` rollout instead | + +When in doubt, run `kubectl explain cloudscalecluster.spec.` — the +generated CRDs carry the rules the webhook enforces.