Operator Guide

Every supported deploy path collated. Pick one based on your existing ops surface; the same Rust binary, same migrations, same backup rules apply across all of them. Estimated time-to-first-tenant: 5 min (single-node) to 30 min (AWS EKS).

Deploy path picker

Path	When
Single-node systemd + Caddy	Homelab, small VPS, ~dozen developers, one or two tenants
NixOS module	You already manage NixOS hosts declaratively
Docker / Docker Compose	You want a one-host containerized deploy without k8s overhead
Kubernetes (Helm)	You have a cluster already and want a stock chart
AWS EKS (Terraform)	You're starting from zero and want a turnkey AWS deploy

Each path is detailed below. All of them produce the same observable behavior — same routes, same metrics endpoint at /metrics, same healthcheck at /v1/healthz.

Pre-flight (every path)

You need:

Postgres 16+ with the pgvector extension. The default builds work without pgvector; --features fastembed requires it.
A directory or bucket for bundle storage. Any opendal-supported backend works: fs://, s3://, gcs://, azblob://.
A reverse proxy that can do TLS termination and route the /v1/* + /metrics paths to the server, everything else to the SvelteKit portal. Caddy, Traefik, nginx, and the AWS ALB all work.
(Optional) Redis — used as a read-through cache, rate-limit store, and job queue. The server falls back gracefully when it's absent (caches become no-ops, rate limits fail-open, jobs run inline as detached tokio tasks). See Redis & the job queue below for when in-process fallback is acceptable vs. when to provision Redis.

Path 1 — Single-node systemd + Caddy

Reference: docs/deploy/single-node.md.

1. Postgres

sudo apt install -y postgresql-16
sudo -u postgres psql <<'SQL'
  CREATE ROLE skillpool LOGIN PASSWORD 'changeme';
  CREATE DATABASE skillpool OWNER skillpool;
  \c skillpool
  CREATE EXTENSION IF NOT EXISTS vector;
SQL

sqlx migrate run --source server/migrations \
  --database-url 'postgres://skillpool:changeme@localhost/skillpool'

The server does not auto-migrate on startup — migrations are a separate step so a broken deploy can't run a migration as a side effect.

2. Binary + systemd

cargo build --release -p skill-pool-server
sudo install -o root -g root -m 0755 \
  target/release/skill-pool-server /usr/local/bin/

sudo useradd --system --home /var/lib/skill-pool --shell /usr/sbin/nologin skillpool
sudo mkdir -p /var/lib/skill-pool/bundles /etc/skill-pool
sudo chown -R skillpool:skillpool /var/lib/skill-pool

sudo cp packaging/systemd/skill-pool-server.service /etc/systemd/system/
sudo install -o skillpool -g skillpool -m 0600 \
  packaging/systemd/skill-pool-server.env.example \
  /etc/skill-pool/skill-pool-server.env

sudoedit /etc/skill-pool/skill-pool-server.env   # paste real DSN + secrets

sudo systemctl daemon-reload
sudo systemctl enable --now skill-pool-server
journalctl -u skill-pool-server -f

3. Caddy

sudo cp packaging/proxy/Caddyfile /etc/caddy/Caddyfile
sudoedit /etc/caddy/Caddyfile   # set your real domain + email
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

Wildcard certs (tenant subdomains) need a DNS provider plugin — uncomment the acme_dns line in the shipped Caddyfile.

4. First tenant

sudo -u skillpool skill-pool-server admin tenant-create \
  --slug acme --name "Acme Inc."
sudo -u skillpool skill-pool-server admin token-create \
  --tenant acme --name bootstrap

See Tenant Onboarding for the rest of the first-tenant playbook.

Path 2 — NixOS module

Reference: docs/deploy/nixos.md.

Flake input

{
  inputs.skill-pool.url = "github:olafkfreund/skill_pool";
  outputs = { self, nixpkgs, skill-pool, ... }: {
    nixosConfigurations.registry = nixpkgs.lib.nixosSystem {
      system = "x86_64-linux";
      modules = [
        skill-pool.nixosModules.skill-pool-server
        ./registry-config.nix
      ];
    };
  };
}

Minimal configuration

{ pkgs, skill-pool, ... }:
{
  services.skill-pool-server = {
    enable = true;
    package = skill-pool.packages.${pkgs.system}.skill-pool-server;

    bind = "127.0.0.1:8080";
    storageUri = "fs:///var/lib/skill-pool/bundles";
    defaultTenant = "acme";

    environmentFile = "/run/keys/skill-pool.env";
  };

  services.postgresql = {
    enable = true;
    package = pkgs.postgresql_17;
    ensureDatabases = [ "skillpool" ];
    ensureUsers = [{ name = "skillpool"; ensureDBOwnership = true; }];
    extraPlugins = ps: [ ps.pgvector ];
  };

  services.caddy = {
    enable = true;
    virtualHosts."skill-pool.example.com".extraConfig = ''
      reverse_proxy 127.0.0.1:3000
    '';
    virtualHosts."*.skill-pool.example.com".extraConfig = ''
      @api path /v1/* /metrics
      reverse_proxy @api 127.0.0.1:8080
      reverse_proxy 127.0.0.1:3000
    '';
  };
}

Secrets with agenix

age.secrets."skill-pool.env" = {
  file = ./secrets/skill-pool.env.age;
  owner = config.services.skill-pool-server.user;
  group = config.services.skill-pool-server.group;
  mode = "0400";
};

services.skill-pool-server.environmentFile =
  config.age.secrets."skill-pool.env".path;

Module options

Option	Type	Default
`enable`	bool	`false`
`package`	package	—
`bind`	string	`"127.0.0.1:8080"`
`databaseUrl`	nullable string	`null`
`storageUri`	string	`"fs:///var/lib/skill-pool/bundles"`
`defaultTenant`	nullable string	`null`
`logLevel`	string	`"info,skill_pool=info"`
`logFormat`	enum	`"json"`
`otlpEndpoint`	nullable string	`null`
`environmentFile`	nullable path	`null`
`user` / `group`	string	`"skillpool"`
`stateDir`	path	`/var/lib/skill-pool`
`openFirewall`	bool	`false`

Web bundle

The flake exposes packages.skill-pool-web, a buildNpmPackage derivation that produces the adapter-node SvelteKit bundle. Use:

nix build .#skill-pool-web && PORT=3000 node result/index.js

Rebuild + verify

sudo nixos-rebuild switch --flake .#registry
systemctl status skill-pool-server
journalctl -u skill-pool-server -f
curl -s http://127.0.0.1:8080/v1/healthz | jq

Path 3 — Docker / Docker Compose

The repo ships two Dockerfiles (server/Dockerfile, web/Dockerfile). A minimal Compose looks like:

version: "3.9"
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: skillpool
      POSTGRES_USER: skillpool
      POSTGRES_PASSWORD: changeme
    volumes: ["pgdata:/var/lib/postgresql/data"]

  server:
    image: ghcr.io/olafkfreund/skill-pool-server:v0.1.0
    environment:
      SKILL_POOL_DATABASE_URL: postgres://skillpool:changeme@postgres/skillpool
      SKILL_POOL_STORAGE_URI: fs:///var/lib/skill-pool/bundles
    volumes: ["bundles:/var/lib/skill-pool/bundles"]
    depends_on: [postgres]

  web:
    image: ghcr.io/olafkfreund/skill-pool-web:v0.1.0
    environment:
      PUBLIC_API_BASE_URL: http://server:8080
      ORIGIN: https://skill-pool.example.com
    depends_on: [server]

  caddy:
    image: caddy:2
    ports: ["80:80", "443:443"]
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data

volumes: { pgdata: {}, bundles: {}, caddy_data: {} }

Migrations run as a one-shot:

docker compose run --rm server skill-pool-server migrate

Path 4 — Kubernetes (Helm)

Reference: docs/deploy/kubernetes.md + deploy/helm/skill-pool/.

# 1. Ensure the namespace + Secret exist.
kubectl create namespace skill-pool
kubectl -n skill-pool create secret generic skill-pool-env \
  --from-literal=SKILL_POOL_DATABASE_URL='postgres://…' \
  --from-literal=SKILL_POOL_EMAIL_SECRET_KEY="$(openssl rand -hex 32)"

# 2. Run migrations as a one-shot Job.
kubectl -n skill-pool run sqlx-migrate \
  --rm -it --restart=Never \
  --image ghcr.io/olafkfreund/skill-pool-server:v0.1.0 \
  --env "SKILL_POOL_DATABASE_URL=…" \
  --command -- /usr/local/bin/skill-pool-server migrate

# 3. Install the chart.
helm install skill-pool ./deploy/helm/skill-pool \
  -f deploy/helm/skill-pool/values.yaml \
  -n skill-pool

values.yaml keys you care about:

image.server.tag / image.web.tag — pin specific versions.
server.env.SKILL_POOL_STORAGE_URI — S3/GCS/Azure bucket URI.
ingress.hosts[].host — the public hostname.
ingress.annotations — cert-manager issuer, ALB attributes, etc.
redis.existingSecret — if you bring Redis, name of a Secret with SKILL_POOL_REDIS_URL.

Pre-upgrade Helm hook handles migrations automatically on every helm upgrade. To roll back: helm rollback skill-pool <REV>. The old binary reads the new schema fine because all schema changes are additive (see docs/ops/rollback.md).

Path 5 — AWS EKS (Terraform)

Reference: docs/deploy/aws.md + deploy/terraform/aws/.

The Terraform starter provisions:

A VPC across 2 AZs (or 3 if you want HA).
An EKS cluster with managed node groups.
An RDS Postgres 16 instance with pgvector preloaded.
An S3 bucket for bundles with versioning enabled.
ECR repos for both images.
An IAM role for IRSA so the pod can write to S3 without keys.
A GitHub OIDC provider + an IAM role with permissions for the build/deploy workflows.
The AWS Load Balancer Controller via Helm.
cert-manager + a Let's Encrypt cluster issuer.

End-to-end:

cd deploy/terraform/aws/
${EDITOR:-vim} variables.tf   # region, azs, github_repository
terraform init && terraform apply

# ~20 min later, connect to the cluster:
aws eks update-kubeconfig --region "$(terraform output -raw region)" \
                          --name   "$(terraform output -raw cluster_name)"

# Bridge Secrets Manager → k8s Secret (or use External Secrets Operator).
# Then run migrations and helm install — same as Path 4.

helm install skill-pool ./deploy/helm/skill-pool \
  -f deploy/helm/skill-pool/values-aws.yaml \
  -n skill-pool --create-namespace

TLS — `nip.io` + Let's Encrypt (no domain required)

The default deploy uses <dashed-ip>.nip.io for DNS — no domain purchase needed. cert-manager + the LE HTTP-01 challenge issues the cert.

ALB_HOST=$(kubectl -n skill-pool get ingress skill-pool \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
ALB_IP=$(dig +short "$ALB_HOST" | head -n1)
DASHED_IP="${ALB_IP//./-}"

helm upgrade skill-pool ./deploy/helm/skill-pool \
  -f deploy/helm/skill-pool/values-aws.yaml \
  --set ingress.hosts[0].host="skill-pool.${DASHED_IP}.nip.io" \
  --reuse-values

Wait 30–120s for cert issuance:

kubectl -n skill-pool get certificate -w

Cost (lean baseline, eu-west-1, May 2026)

Component	Monthly
EKS control plane	$73
2× t3.medium	$60
RDS t4g.medium	$50
ALB	$22
NAT (single AZ)	$32
Misc (S3, ECR, R53, SM)	$11
Total	~$248

HA (Multi-AZ RDS, per-AZ NAT, third worker): +$110/mo. Dev/staging (single Spot, no NAT): ~$130/mo.

GitHub Actions CI/CD

Four workflows ship in .github/workflows/:

Workflow	Triggers	Purpose
CI	push to `main`, PRs	fmt + clippy + tests + web lint + helm lint
Build & push	push to `main`, tag `v*`, manual	Build + push both images to ECR
Deploy to EKS	tag `v*`, manual	`helm upgrade` + smoke-test + auto-rollback on failure
DB migrations	manual only	Break-glass: run `sqlx migrate run` from a one-shot pod

All AWS-touching workflows authenticate via OIDC — no long-lived AWS keys in GitHub. The only repo-level secret is AWS_ROLE_ARN.

Required repo-level variables:

Name	Example
`AWS_REGION`	`eu-west-1`
`ECR_REPO_SERVER`	`skill-pool/server`
`ECR_REPO_WEB`	`skill-pool/web`
`EKS_CLUSTER_NAME`	`skill-pool-prod`
`HELM_RELEASE_NAME`	`skill-pool`
`HELM_NAMESPACE`	`skill-pool`
`PUBLIC_HOSTNAME`	`skill-pool.example.com`

Image tagging: pushes to main tag <git-sha> + latest; pushes of v* tag also push <git-ref-name> (semver). values-aws.yaml pins specific tags — latest is for human convenience only and should never be referenced by the cluster.

Auto-rollback: deploy.yml runs helm rollback if the rollout-status or smoke-test step fails after a successful helm upgrade.

Backup & restore

Postgres

# Single node — daily cron:
pg_dump -Fc skillpool > /backups/skillpool-$(date +%F).dump

# RDS — automated snapshots; bump retention to 30 days for prod.

Bundle storage

fs:// — tar /var/lib/skill-pool/bundles weekly. Bundles are immutable once published, so the diff is small.
s3:// — turn on bucket versioning. Lifecycle rule: delete non-current versions after 90 days.

Restore drill

Recommended every quarter:

pg_restore into a scratch Postgres.
Point a scratch skill-pool-server at it with --storage-uri pointed at the bundle-storage backup.
curl /v1/healthz, list skills, download one.

Full rollback procedures: docs/ops/rollback.md.

Day-2 ops

Metrics — /metrics on the server in Prometheus format. The Grafana dashboard ships in ops/grafana/skill-pool.json; the Prometheus alert rules in ops/prometheus/skill-pool.rules.yaml.
Tracing — set SKILL_POOL_OTLP_ENDPOINT=http://collector:4317. All request/response spans plus the background-task spans go out with service.name=skill-pool-server.
Logs — JSON to stdout by default (SKILL_POOL_LOG_FORMAT=json). Pretty mode for dev: SKILL_POOL_LOG_FORMAT=pretty.
Runbook — docs/ops/runbook.md covers the SLO breach playbook per top-N alert.
Capacity planning — docs/ops/capacity.md covers the tier-by-tier sizing curve.
Rollback — docs/ops/rollback.md covers the forward-only migration discipline + DR from snapshots.

Plugin storage

Plugins (per-tenant Claude Code marketplace, see docs/plugins.md) introduce two operator-visible concerns beyond skill bundles: on-disk bare git repos for the /git/plugins/<slug>.git endpoint, and the per-tenant pre-rendered marketplace cache in Postgres.

On-disk layout

Source of truth: server/src/storage.rs:71-94 (Storage::plugin_git_path).

For each internal- or mirror-sourced plugin, skill-pool materialises a bare git repo on first publish at:

<storage-root>/<tenant-uuid>/plugins/<slug>.git/

<storage-root> is the path component of SKILL_POOL_STORAGE_URI when it starts with fs://. The git endpoint requires fs:// storage — S3/GCS/Azure backends cannot serve git-upload-pack and the endpoint returns an explicit error at publish time. A per-process checkout cache for object-store-backed plugin git is deferred.

The tenant UUID prefix is the same one bundle storage uses, so rm -rf <storage-root>/<tenant-uuid>/ cleans plugin repos and skill bundles in one shot when a tenant is decommissioned.

Backup

Bare git repos are append-mostly trees of immutable blobs (a publish only adds objects; archive flips a DB row, never deletes files). Two practical rules:

Include <storage-root>/<tenant-uuid>/plugins/ in the same backup job that snapshots bundles/. A daily tar of <storage-root> covers both. Incremental backup tools (restic, borg) deduplicate well — only newly published plugin objects transfer on each run.
A restored repo serves correctly without re-materialisation from Postgres. Trees + blobs are self-contained; the only DB row needed is plugins (for the sourcing_mode check in plugin_git::resolve_repo_path).

If the bare repo is missing after a publish (storage write failed silently — logged at warn level), the API returns 404 from /git/plugins/<slug>.git. Recovery: republish. The materialiser is idempotent — the second pass walks the same content tree and writes the same objects.

Marketplace cache

Source of truth: server/migrations/0032_plugin_marketplace_entries.sql.

The plugin_marketplace_entries table holds one row per (tenant_id, plugin_slug) — the latest published version pre-rendered into the exact JSON object that splices into /.claude-plugin/marketplace.json. The marketplace handler (server/src/routes/marketplace.rs) is a single SELECT plus a JSON wrapper, with a strong ETag and Cache-Control: public, max-age=60 on the response. Conditional GETs return 304 on match.

Storage cost: a few hundred bytes per plugin per tenant — negligible versus the bundle tarballs.

Mirror refresh

mirror-sourced plugins are listed in marketplace.json with a local bare repo. Two things drive the cache forward:

Per-plugin pull_interval_secs (set on POST /v1/plugins/import, default 86400 = 24h, minimum 300). The server's periodic sweep (spawn_mirror_sweep in server/src/main.rs) re-enqueues mirror jobs whose last_pulled_at is older than the interval.
Failures record fetch_error + fetch_error_at on the plugin row so operators can spot stuck mirrors. The next sweep retries.

fetch_error is surfaced in the admin Plugin detail page; the /marketplace public browser hides plugins whose latest pull failed.

If you need fresh mirror content immediately, hit POST /v1/plugins/import again with the same slug and url — the upsert is idempotent and the job is re-enqueued (or, without Redis, spawned in-process).

Growth notes

Plugin bare repos grow with commits × tree-size. A plugin bundling a dozen skills with one publish per week settles at single-digit MB after a year.
The 256 KiB cap on the publish-time manifest body (server/src/routes/plugins.rs:42) caps the inline-blob blast radius — operators don't need a separate per-plugin size monitor.
The total number of plugins per tenant has no hard cap, but the marketplace JSON is fetched on every Claude Code refresh; tenants with thousands of plugins will see noticeable cold-fetch latency. Tier-by-tier sizing for plugins-heavy tenants is on the capacity planning backlog (docs/ops/capacity.md).

Redis & the job queue

skill-pool uses Redis for three optional things:

Use	Without Redis	When you need Redis
Read-through cache for hot reads	Falls back to direct DB	Multi-replica deployments where you want all replicas hot
Per-tenant rate-limit counters	Fail-open	Production multi-tenant with strict per-tenant SLOs
Job queue for `PluginMirrorJob` (and future job kinds)	In-process tokio task per import (no retry, no durability across restarts)	You need durability, retry-with-backoff, multi-worker scale-out, or queue observability

Set SKILL_POOL_REDIS_URL=redis://host:6379/0 (or its NixOS module equivalent: services.skill-pool-server.environment.SKILL_POOL_REDIS_URL) to enable.

When in-process fallback is fine

Single-node deployments (e.g. one VM running both server and Postgres).
Light mirror traffic — a handful of /v1/plugins/import calls per day.
You're OK with: mirror jobs interrupted by a server restart need to be re-triggered via a second POST /v1/plugins/import.

The fallback path returns outcome:"enqueued_inline" and job_id:"inline-<plugin_id>" so callers can distinguish it from durable queueing. The actual run_mirror work is identical.

When to provision Redis

Multi-node server replicas (the in-process queue does not coordinate across nodes — two replicas would each enqueue inline tasks for the same import).
Any expectation of automatic retry on transient git fetch failures.
You want to inspect or replay jobs out of band (Redis stream gives you a real queue UI; the in-process path does not).

The provisioning itself is straightforward — a single Redis container or managed instance, no clustering required at typical skill-pool scale. Wire SKILL_POOL_REDIS_URL and restart; no schema changes.

Where to read next

Tenant Onboarding — first-tenant playbook
SSO Setup — OIDC + SAML per IdP
Custom Domain + ACME — per-tenant hostnames
API Reference — every endpoint
FAQ — real failure modes from the first install

Cross-links into the codebase

server/src/main.rs — boot sequence
server/src/state.rs — AppState construction (DB, storage, Redis)
server/migrations/ — sqlx migration set (run in order, forward only)
packaging/systemd/ — systemd unit files (server + capturer)
packaging/proxy/ — Caddyfile + Traefik dynamic config
deploy/helm/skill-pool/ — Helm chart
deploy/terraform/aws/ — AWS Terraform starter
.github/workflows/ — CI/CD pipelines
docs/ops/ — runbook, capacity, rollback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator Guide

Deploy path picker

Pre-flight (every path)

Path 1 — Single-node systemd + Caddy

1. Postgres

2. Binary + systemd

3. Caddy

4. First tenant

Path 2 — NixOS module

Flake input

Minimal configuration

Secrets with agenix

Module options

Web bundle

Rebuild + verify

Path 3 — Docker / Docker Compose

Path 4 — Kubernetes (Helm)

Path 5 — AWS EKS (Terraform)

TLS — `nip.io` + Let's Encrypt (no domain required)

Cost (lean baseline, eu-west-1, May 2026)

GitHub Actions CI/CD

Backup & restore

Postgres

Bundle storage

Restore drill

Day-2 ops

Plugin storage

On-disk layout

Backup

Marketplace cache

Mirror refresh

Growth notes

Redis & the job queue

When in-process fallback is fine

When to provision Redis

Where to read next

Cross-links into the codebase

FilesExpand file tree

Operator-Guide.md

Latest commit

History

Operator-Guide.md

File metadata and controls

Operator Guide

Deploy path picker

Pre-flight (every path)

Path 1 — Single-node systemd + Caddy

1. Postgres

2. Binary + systemd

3. Caddy

4. First tenant

Path 2 — NixOS module

Flake input

Minimal configuration

Secrets with agenix

Module options

Web bundle

Rebuild + verify

Path 3 — Docker / Docker Compose

Path 4 — Kubernetes (Helm)

Path 5 — AWS EKS (Terraform)

TLS — nip.io + Let's Encrypt (no domain required)

Cost (lean baseline, eu-west-1, May 2026)

GitHub Actions CI/CD

Backup & restore

Postgres

Bundle storage

Restore drill

Day-2 ops

Plugin storage

On-disk layout

Backup

Marketplace cache

Mirror refresh

Growth notes

Redis & the job queue

When in-process fallback is fine

When to provision Redis

Where to read next

Cross-links into the codebase

TLS — `nip.io` + Let's Encrypt (no domain required)