Every supported deploy path collated. Pick one based on your existing ops surface; the same Rust binary, same migrations, same backup rules apply across all of them. Estimated time-to-first-tenant: 5 min (single-node) to 30 min (AWS EKS).
| Path | When |
|---|---|
| Single-node systemd + Caddy | Homelab, small VPS, ~dozen developers, one or two tenants |
| NixOS module | You already manage NixOS hosts declaratively |
| Docker / Docker Compose | You want a one-host containerized deploy without k8s overhead |
| Kubernetes (Helm) | You have a cluster already and want a stock chart |
| AWS EKS (Terraform) | You're starting from zero and want a turnkey AWS deploy |
Each path is detailed below. All of them produce the same observable
behavior — same routes, same metrics endpoint at /metrics, same
healthcheck at /v1/healthz.
You need:
- Postgres 16+ with the
pgvectorextension. The default builds work withoutpgvector;--features fastembedrequires it. - A directory or bucket for bundle storage. Any opendal-supported
backend works:
fs://,s3://,gcs://,azblob://. - A reverse proxy that can do TLS termination and route the
/v1/*+/metricspaths to the server, everything else to the SvelteKit portal. Caddy, Traefik, nginx, and the AWS ALB all work. - (Optional) Redis — used as a read-through cache, rate-limit store, and job queue. The server falls back gracefully when it's absent (caches become no-ops, rate limits fail-open, jobs run inline as detached tokio tasks). See Redis & the job queue below for when in-process fallback is acceptable vs. when to provision Redis.
Reference: docs/deploy/single-node.md.
sudo apt install -y postgresql-16
sudo -u postgres psql <<'SQL'
CREATE ROLE skillpool LOGIN PASSWORD 'changeme';
CREATE DATABASE skillpool OWNER skillpool;
\c skillpool
CREATE EXTENSION IF NOT EXISTS vector;
SQL
sqlx migrate run --source server/migrations \
--database-url 'postgres://skillpool:changeme@localhost/skillpool'The server does not auto-migrate on startup — migrations are a separate step so a broken deploy can't run a migration as a side effect.
cargo build --release -p skill-pool-server
sudo install -o root -g root -m 0755 \
target/release/skill-pool-server /usr/local/bin/
sudo useradd --system --home /var/lib/skill-pool --shell /usr/sbin/nologin skillpool
sudo mkdir -p /var/lib/skill-pool/bundles /etc/skill-pool
sudo chown -R skillpool:skillpool /var/lib/skill-pool
sudo cp packaging/systemd/skill-pool-server.service /etc/systemd/system/
sudo install -o skillpool -g skillpool -m 0600 \
packaging/systemd/skill-pool-server.env.example \
/etc/skill-pool/skill-pool-server.env
sudoedit /etc/skill-pool/skill-pool-server.env # paste real DSN + secrets
sudo systemctl daemon-reload
sudo systemctl enable --now skill-pool-server
journalctl -u skill-pool-server -fsudo cp packaging/proxy/Caddyfile /etc/caddy/Caddyfile
sudoedit /etc/caddy/Caddyfile # set your real domain + email
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddyWildcard certs (tenant subdomains) need a DNS provider plugin —
uncomment the acme_dns line in the shipped Caddyfile.
sudo -u skillpool skill-pool-server admin tenant-create \
--slug acme --name "Acme Inc."
sudo -u skillpool skill-pool-server admin token-create \
--tenant acme --name bootstrapSee Tenant Onboarding for the rest of the first-tenant playbook.
Reference: docs/deploy/nixos.md.
{
inputs.skill-pool.url = "github:olafkfreund/skill_pool";
outputs = { self, nixpkgs, skill-pool, ... }: {
nixosConfigurations.registry = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [
skill-pool.nixosModules.skill-pool-server
./registry-config.nix
];
};
};
}{ pkgs, skill-pool, ... }:
{
services.skill-pool-server = {
enable = true;
package = skill-pool.packages.${pkgs.system}.skill-pool-server;
bind = "127.0.0.1:8080";
storageUri = "fs:///var/lib/skill-pool/bundles";
defaultTenant = "acme";
environmentFile = "/run/keys/skill-pool.env";
};
services.postgresql = {
enable = true;
package = pkgs.postgresql_17;
ensureDatabases = [ "skillpool" ];
ensureUsers = [{ name = "skillpool"; ensureDBOwnership = true; }];
extraPlugins = ps: [ ps.pgvector ];
};
services.caddy = {
enable = true;
virtualHosts."skill-pool.example.com".extraConfig = ''
reverse_proxy 127.0.0.1:3000
'';
virtualHosts."*.skill-pool.example.com".extraConfig = ''
@api path /v1/* /metrics
reverse_proxy @api 127.0.0.1:8080
reverse_proxy 127.0.0.1:3000
'';
};
}age.secrets."skill-pool.env" = {
file = ./secrets/skill-pool.env.age;
owner = config.services.skill-pool-server.user;
group = config.services.skill-pool-server.group;
mode = "0400";
};
services.skill-pool-server.environmentFile =
config.age.secrets."skill-pool.env".path;| Option | Type | Default |
|---|---|---|
enable |
bool | false |
package |
package | — |
bind |
string | "127.0.0.1:8080" |
databaseUrl |
nullable string | null |
storageUri |
string | "fs:///var/lib/skill-pool/bundles" |
defaultTenant |
nullable string | null |
logLevel |
string | "info,skill_pool=info" |
logFormat |
enum | "json" |
otlpEndpoint |
nullable string | null |
environmentFile |
nullable path | null |
user / group |
string | "skillpool" |
stateDir |
path | /var/lib/skill-pool |
openFirewall |
bool | false |
The flake exposes packages.skill-pool-web, a buildNpmPackage
derivation that produces the adapter-node SvelteKit bundle. Use:
nix build .#skill-pool-web && PORT=3000 node result/index.jssudo nixos-rebuild switch --flake .#registry
systemctl status skill-pool-server
journalctl -u skill-pool-server -f
curl -s http://127.0.0.1:8080/v1/healthz | jqThe repo ships two Dockerfiles (server/Dockerfile, web/Dockerfile).
A minimal Compose looks like:
version: "3.9"
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: skillpool
POSTGRES_USER: skillpool
POSTGRES_PASSWORD: changeme
volumes: ["pgdata:/var/lib/postgresql/data"]
server:
image: ghcr.io/olafkfreund/skill-pool-server:v0.1.0
environment:
SKILL_POOL_DATABASE_URL: postgres://skillpool:changeme@postgres/skillpool
SKILL_POOL_STORAGE_URI: fs:///var/lib/skill-pool/bundles
volumes: ["bundles:/var/lib/skill-pool/bundles"]
depends_on: [postgres]
web:
image: ghcr.io/olafkfreund/skill-pool-web:v0.1.0
environment:
PUBLIC_API_BASE_URL: http://server:8080
ORIGIN: https://skill-pool.example.com
depends_on: [server]
caddy:
image: caddy:2
ports: ["80:80", "443:443"]
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
volumes: { pgdata: {}, bundles: {}, caddy_data: {} }Migrations run as a one-shot:
docker compose run --rm server skill-pool-server migrateReference: docs/deploy/kubernetes.md + deploy/helm/skill-pool/.
# 1. Ensure the namespace + Secret exist.
kubectl create namespace skill-pool
kubectl -n skill-pool create secret generic skill-pool-env \
--from-literal=SKILL_POOL_DATABASE_URL='postgres://…' \
--from-literal=SKILL_POOL_EMAIL_SECRET_KEY="$(openssl rand -hex 32)"
# 2. Run migrations as a one-shot Job.
kubectl -n skill-pool run sqlx-migrate \
--rm -it --restart=Never \
--image ghcr.io/olafkfreund/skill-pool-server:v0.1.0 \
--env "SKILL_POOL_DATABASE_URL=…" \
--command -- /usr/local/bin/skill-pool-server migrate
# 3. Install the chart.
helm install skill-pool ./deploy/helm/skill-pool \
-f deploy/helm/skill-pool/values.yaml \
-n skill-poolvalues.yaml keys you care about:
image.server.tag/image.web.tag— pin specific versions.server.env.SKILL_POOL_STORAGE_URI— S3/GCS/Azure bucket URI.ingress.hosts[].host— the public hostname.ingress.annotations— cert-manager issuer, ALB attributes, etc.redis.existingSecret— if you bring Redis, name of a Secret withSKILL_POOL_REDIS_URL.
Pre-upgrade Helm hook handles migrations automatically on every
helm upgrade. To roll back: helm rollback skill-pool <REV>. The
old binary reads the new schema fine because all schema changes are
additive (see docs/ops/rollback.md).
Reference: docs/deploy/aws.md + deploy/terraform/aws/.
The Terraform starter provisions:
- A VPC across 2 AZs (or 3 if you want HA).
- An EKS cluster with managed node groups.
- An RDS Postgres 16 instance with
pgvectorpreloaded. - An S3 bucket for bundles with versioning enabled.
- ECR repos for both images.
- An IAM role for IRSA so the pod can write to S3 without keys.
- A GitHub OIDC provider + an IAM role with permissions for the build/deploy workflows.
- The AWS Load Balancer Controller via Helm.
- cert-manager + a Let's Encrypt cluster issuer.
End-to-end:
cd deploy/terraform/aws/
${EDITOR:-vim} variables.tf # region, azs, github_repository
terraform init && terraform apply
# ~20 min later, connect to the cluster:
aws eks update-kubeconfig --region "$(terraform output -raw region)" \
--name "$(terraform output -raw cluster_name)"
# Bridge Secrets Manager → k8s Secret (or use External Secrets Operator).
# Then run migrations and helm install — same as Path 4.
helm install skill-pool ./deploy/helm/skill-pool \
-f deploy/helm/skill-pool/values-aws.yaml \
-n skill-pool --create-namespaceThe default deploy uses <dashed-ip>.nip.io for DNS — no domain
purchase needed. cert-manager + the LE HTTP-01 challenge issues the
cert.
ALB_HOST=$(kubectl -n skill-pool get ingress skill-pool \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
ALB_IP=$(dig +short "$ALB_HOST" | head -n1)
DASHED_IP="${ALB_IP//./-}"
helm upgrade skill-pool ./deploy/helm/skill-pool \
-f deploy/helm/skill-pool/values-aws.yaml \
--set ingress.hosts[0].host="skill-pool.${DASHED_IP}.nip.io" \
--reuse-valuesWait 30–120s for cert issuance:
kubectl -n skill-pool get certificate -w| Component | Monthly |
|---|---|
| EKS control plane | $73 |
| 2× t3.medium | $60 |
| RDS t4g.medium | $50 |
| ALB | $22 |
| NAT (single AZ) | $32 |
| Misc (S3, ECR, R53, SM) | $11 |
| Total | ~$248 |
HA (Multi-AZ RDS, per-AZ NAT, third worker): +$110/mo. Dev/staging (single Spot, no NAT): ~$130/mo.
Four workflows ship in .github/workflows/:
| Workflow | Triggers | Purpose |
|---|---|---|
| CI | push to main, PRs |
fmt + clippy + tests + web lint + helm lint |
| Build & push | push to main, tag v*, manual |
Build + push both images to ECR |
| Deploy to EKS | tag v*, manual |
helm upgrade + smoke-test + auto-rollback on failure |
| DB migrations | manual only | Break-glass: run sqlx migrate run from a one-shot pod |
All AWS-touching workflows authenticate via OIDC — no long-lived
AWS keys in GitHub. The only repo-level secret is AWS_ROLE_ARN.
Required repo-level variables:
| Name | Example |
|---|---|
AWS_REGION |
eu-west-1 |
ECR_REPO_SERVER |
skill-pool/server |
ECR_REPO_WEB |
skill-pool/web |
EKS_CLUSTER_NAME |
skill-pool-prod |
HELM_RELEASE_NAME |
skill-pool |
HELM_NAMESPACE |
skill-pool |
PUBLIC_HOSTNAME |
skill-pool.example.com |
Image tagging: pushes to main tag <git-sha> + latest; pushes of
v* tag also push <git-ref-name> (semver). values-aws.yaml pins
specific tags — latest is for human convenience only and should
never be referenced by the cluster.
Auto-rollback: deploy.yml runs helm rollback if the rollout-status
or smoke-test step fails after a successful helm upgrade.
# Single node — daily cron:
pg_dump -Fc skillpool > /backups/skillpool-$(date +%F).dump
# RDS — automated snapshots; bump retention to 30 days for prod.- fs:// — tar
/var/lib/skill-pool/bundlesweekly. Bundles are immutable once published, so the diff is small. - s3:// — turn on bucket versioning. Lifecycle rule: delete non-current versions after 90 days.
Recommended every quarter:
pg_restoreinto a scratch Postgres.- Point a scratch
skill-pool-serverat it with--storage-uripointed at the bundle-storage backup. curl /v1/healthz, list skills, download one.
Full rollback procedures: docs/ops/rollback.md.
- Metrics —
/metricson the server in Prometheus format. The Grafana dashboard ships inops/grafana/skill-pool.json; the Prometheus alert rules inops/prometheus/skill-pool.rules.yaml. - Tracing — set
SKILL_POOL_OTLP_ENDPOINT=http://collector:4317. All request/response spans plus the background-task spans go out withservice.name=skill-pool-server. - Logs — JSON to stdout by default (
SKILL_POOL_LOG_FORMAT=json). Pretty mode for dev:SKILL_POOL_LOG_FORMAT=pretty. - Runbook —
docs/ops/runbook.mdcovers the SLO breach playbook per top-N alert. - Capacity planning —
docs/ops/capacity.mdcovers the tier-by-tier sizing curve. - Rollback —
docs/ops/rollback.mdcovers the forward-only migration discipline + DR from snapshots.
Plugins (per-tenant Claude Code marketplace, see
docs/plugins.md) introduce two operator-visible
concerns beyond skill bundles: on-disk bare git repos for the
/git/plugins/<slug>.git endpoint, and the per-tenant pre-rendered
marketplace cache in Postgres.
Source of truth: server/src/storage.rs:71-94
(Storage::plugin_git_path).
For each internal- or mirror-sourced plugin, skill-pool
materialises a bare git repo on first publish at:
<storage-root>/<tenant-uuid>/plugins/<slug>.git/
<storage-root> is the path component of SKILL_POOL_STORAGE_URI
when it starts with fs://. The git endpoint requires fs://
storage — S3/GCS/Azure backends cannot serve git-upload-pack and
the endpoint returns an explicit error at publish time. A per-process
checkout cache for object-store-backed plugin git is deferred.
The tenant UUID prefix is the same one bundle storage uses, so
rm -rf <storage-root>/<tenant-uuid>/ cleans plugin repos and skill
bundles in one shot when a tenant is decommissioned.
Bare git repos are append-mostly trees of immutable blobs (a publish only adds objects; archive flips a DB row, never deletes files). Two practical rules:
- Include
<storage-root>/<tenant-uuid>/plugins/in the same backup job that snapshotsbundles/. A dailytarof<storage-root>covers both. Incremental backup tools (restic,borg) deduplicate well — only newly published plugin objects transfer on each run. - A restored repo serves correctly without re-materialisation
from Postgres. Trees + blobs are self-contained; the only DB
row needed is
plugins(for thesourcing_modecheck inplugin_git::resolve_repo_path).
If the bare repo is missing after a publish (storage write failed
silently — logged at warn level), the API returns 404 from
/git/plugins/<slug>.git. Recovery: republish. The materialiser
is idempotent — the second pass walks the same content tree and
writes the same objects.
Source of truth: server/migrations/0032_plugin_marketplace_entries.sql.
The plugin_marketplace_entries table holds one row per
(tenant_id, plugin_slug) — the latest published version pre-rendered
into the exact JSON object that splices into
/.claude-plugin/marketplace.json. The marketplace handler
(server/src/routes/marketplace.rs) is a single SELECT plus a JSON
wrapper, with a strong ETag and Cache-Control: public, max-age=60
on the response. Conditional GETs return 304 on match.
Storage cost: a few hundred bytes per plugin per tenant — negligible versus the bundle tarballs.
mirror-sourced plugins are listed in marketplace.json with a local
bare repo. Two things drive the cache forward:
- Per-plugin
pull_interval_secs(set onPOST /v1/plugins/import, default 86400 = 24h, minimum 300). The server's periodic sweep (spawn_mirror_sweepinserver/src/main.rs) re-enqueues mirror jobs whoselast_pulled_atis older than the interval. - Failures record
fetch_error+fetch_error_aton the plugin row so operators can spot stuck mirrors. The next sweep retries.
fetch_error is surfaced in the admin Plugin detail page; the
/marketplace public browser hides plugins whose latest pull failed.
If you need fresh mirror content immediately, hit POST /v1/plugins/import
again with the same slug and url — the upsert is idempotent and the
job is re-enqueued (or, without Redis, spawned in-process).
- Plugin bare repos grow with
commits × tree-size. A plugin bundling a dozen skills with one publish per week settles at single-digit MB after a year. - The 256 KiB cap on the publish-time
manifestbody (server/src/routes/plugins.rs:42) caps the inline-blob blast radius — operators don't need a separate per-plugin size monitor. - The total number of plugins per tenant has no hard cap, but the
marketplace JSON is fetched on every Claude Code refresh; tenants
with thousands of plugins will see noticeable cold-fetch latency.
Tier-by-tier sizing for plugins-heavy tenants is on the capacity
planning backlog (
docs/ops/capacity.md).
skill-pool uses Redis for three optional things:
| Use | Without Redis | When you need Redis |
|---|---|---|
| Read-through cache for hot reads | Falls back to direct DB | Multi-replica deployments where you want all replicas hot |
| Per-tenant rate-limit counters | Fail-open | Production multi-tenant with strict per-tenant SLOs |
Job queue for PluginMirrorJob (and future job kinds) |
In-process tokio task per import (no retry, no durability across restarts) | You need durability, retry-with-backoff, multi-worker scale-out, or queue observability |
Set SKILL_POOL_REDIS_URL=redis://host:6379/0 (or its NixOS module
equivalent: services.skill-pool-server.environment.SKILL_POOL_REDIS_URL)
to enable.
- Single-node deployments (e.g. one VM running both server and Postgres).
- Light mirror traffic — a handful of
/v1/plugins/importcalls per day. - You're OK with: mirror jobs interrupted by a server restart need to be
re-triggered via a second
POST /v1/plugins/import.
The fallback path returns outcome:"enqueued_inline" and
job_id:"inline-<plugin_id>" so callers can distinguish it from durable
queueing. The actual run_mirror work is identical.
- Multi-node server replicas (the in-process queue does not coordinate across nodes — two replicas would each enqueue inline tasks for the same import).
- Any expectation of automatic retry on transient git fetch failures.
- You want to inspect or replay jobs out of band (Redis stream gives you a real queue UI; the in-process path does not).
The provisioning itself is straightforward — a single Redis container
or managed instance, no clustering required at typical skill-pool
scale. Wire SKILL_POOL_REDIS_URL and restart; no schema changes.
- Tenant Onboarding — first-tenant playbook
- SSO Setup — OIDC + SAML per IdP
- Custom Domain + ACME — per-tenant hostnames
- API Reference — every endpoint
- FAQ — real failure modes from the first install
server/src/main.rs— boot sequenceserver/src/state.rs—AppStateconstruction (DB, storage, Redis)server/migrations/— sqlx migration set (run in order, forward only)packaging/systemd/— systemd unit files (server + capturer)packaging/proxy/— Caddyfile + Traefik dynamic configdeploy/helm/skill-pool/— Helm chartdeploy/terraform/aws/— AWS Terraform starter.github/workflows/— CI/CD pipelinesdocs/ops/— runbook, capacity, rollback