Skip to content

Single-box AWS deployment: prod compose stack, Caddy edge, lifecycle scripts#32

Merged
sovITxyz merged 11 commits into
mainfrom
deploy/aws-v0
Jun 6, 2026
Merged

Single-box AWS deployment: prod compose stack, Caddy edge, lifecycle scripts#32
sovITxyz merged 11 commits into
mainfrom
deploy/aws-v0

Conversation

@sovITxyz

@sovITxyz sovITxyz commented Jun 6, 2026

Copy link
Copy Markdown
Owner

What

Operator tooling to run v0 on one EC2 box (pre-Phase-29/30; those phases own image-build CI and k8s later):

  • Dockerfiles for api/, worker/, web/ — lockfile-frozen uv sync (CPU-only torch preserved), pandoc+libmagic+WordNet baked into the worker (zips extracted — nltk can't read them zipped), Next standalone bundle (output: "standalone" added to next.config.ts; dev/CI unaffected).
  • infra/docker-compose.prod.yml — self-contained prod stack. Only Caddy publishes ports (80/443); Postgres/Redis/Milvus/MinIO/etcd/api/web are compose-network-only. ${VAR:?} guards refuse to boot on any missing secret (stand-in for the Phase 18 startup guard). Shared volumes: upload handoff (api→worker) + HF model cache (populated once by the prewarm one-shot; runtime is HF_HUB_OFFLINE=1).
  • infra/caddy/ — Caddy 2.11.4 + rate-limit module: 10/min on /api/auth/*, 6/min on /api/search-summary + /api/upload, 600/min backstop (Phase 19 gap, edge-mitigated); body caps 1MB JSON / 200MB global (below the api's 200MiB so the Next proxy never buffers a doomed upload); default_sni so bare-IP browsers (no SNI) can handshake.
  • infra/aws/ — provision / deploy / start / stop / status / destroy. Tag-based, re-runnable, secrets generated on-box (0600, never in git). Stop = ~$12/mo resting cost.
  • docs/DEPLOY_AWS.md — runbook: costs, security posture (mitigated vs accepted v0 risks), day-2 ops, domain-later steps.

Verification

  • All four images built; api import graph (12 routes), worker pipeline imports + dedup.signature(), Celery task registration, alembic + bootstrap entrypoints validated in-container.
  • Full local dress rehearsal of the prod stack (isolated project): migrations 0001→0002, library_vectors created, no-SNI TLS handshake with IP-SAN cert, signup 201 → login 200 → HttpOnly sg_session → authed /library 200, anon 307, bad-login 401, HTTP→HTTPS 308, rate limiter returning 429 after the auth-zone threshold.
  • Multi-agent adversarial review over the diff (Dockerfiles/compose/Caddy/scripts/security dimensions); all confirmed findings fixed (branch preflight in deploy.sh, SG ingress self-heal, smoke-test hardening, argv-safe key writes).

Notes

  • No app-code changes beyond next.config.ts. Tenant-isolation surface untouched (no new DB/Milvus queries).
  • Accepted v0 risks are enumerated in the runbook and map to planned Phases 18–22.
  • /search-summary needs GOOGLE_API_KEY or PPQ_API_KEY on the box; route 503s cleanly until set.

sovITxyz added 11 commits June 5, 2026 18:34
output: 'standalone' in next.config.ts so the image ships only
.next/standalone + .next/static and runs 'node server.js' (not
'next start'). API_BASE_URL stays a runtime env var (server-only,
never inlined); NODE_ENV=production baked so the session cookie
gets its Secure attribute. pnpm 9 pinned explicitly — package.json
deliberately has no packageManager field (web/AGENTS.md).
Build context is the repo root because api/ imports db/embedding/
retrieval/scripts.bootstrap_milvus from ../worker at runtime (the
one allowed cross-package import). uv sync --frozen --no-dev from
api/uv.lock keeps torch on the CPU index (no CUDA wheels). Models
are NOT baked — they live in the shared hf-cache volume populated
by the prewarm one-shot; runtime loads with HF_HUB_OFFLINE=1.
Single uvicorn worker on purpose: each process lazily loads ~3.7GB
of models. Root .dockerignore filters the api/worker contexts.
Same image serves the Celery worker and the migrate /
bootstrap-milvus one-shots (cwd must be /app/worker — the project
is non-packaged by design). pandoc + libmagic1 + libgomp1 installed;
WordNet baked at build and explicitly UNZIPPED — nltk cannot read
the corpus from the downloader's zips (verified empirically:
find('corpora/wordnet') fails against the zip, dedup.signature()
works after extraction). --concurrency=1 because each new-book
ingest loads BGE-Large up to twice (~3-4GB); --time-limit=3600 so
one poisoned upload can't pin the CPU.
Self-contained prod compose (NOT an overlay — compose merges ports:
additively, and 'only Caddy publishes a port' is the security
property this file guarantees). Postgres/Redis/etcd/MinIO/Milvus
(which has no auth)/api/web stay on the internal network; secrets
use ${VAR:?} so a missing one refuses to boot — the no-code stand-in
for the Phase 18 startup guard. Shared volumes carry the api→worker
upload handoff and the HF model cache (prewarm one-shot is the only
thing allowed to touch HuggingFace; runtime is offline).

Caddy 2.11.4 + mholt/caddy-ratelimit: per-IP limits on /api/auth/*
(10/min), /api/search-summary + /api/upload (6/min), 600/min
backstop — verified live (429 + Retry-After). default_sni is
load-bearing: browsers send no SNI for bare-IP URLs and the
handshake aborts without it (verified). Body caps nest, smallest
wins: 1MB JSON routes, 200MB global — deliberately below the api's
200MiB cap so the Next proxy never buffers a doomed upload.
Healthcheck traverses TLS→host-match→proxy→web, not just :80.
Tag-based and re-runnable: provision.sh (Ubuntu 24.04 t3a.xlarge,
100GB gp3, IMDSv2 required, SG = 443/80 world + 22 admin-IP with
self-healing ingress, Elastic IP), deploy.sh (clone → on-box secret
generation with 0600 perms → build → migrate → bootstrap-milvus
with cold-gRPC retry → model prewarm → up → outside-in smoke:
fresh user signup 201 / login 200 / HttpOnly cookie / authed
library 200 / bad-login 401; @example.com because the api 422s
reserved TLDs), start/stop/status (stopped box bills ~$12/mo:
EBS + EIP only), destroy (typed confirmation). LLM keys forward
from the operator shell via ssh stdin and are written without
touching argv; branch preflight fails fast if not pushed.
Cost table (running/stopped/8h-day), what deploy.sh does, the
security posture (mitigated vs accepted-v0 risks), day-2 ops,
domain-later steps (SITE_HOST + drop tls internal for ACME), and
deltas vs the Phase 29/30 plan.
The remote script reaches the box via 'bash -s' reading ssh stdin;
'docker compose run' attaches the container's stdin by default, so
the migrate one-shot consumed the remainder of the script off the
stream — bash hit EOF and exited 0 right after migrate, skipping
bootstrap-milvus/prewarm/up, and the smoke test then failed against
a stack that was never started (first real deploy, curl exit 7).
</dev/null on every compose run inside the heredoc.
The unquoted REMOTE delimiter command-substitutes backticks on the
OPERATOR machine during expansion — a backticked 'bash -s' in a
comment executed locally and hung the deploy blocking on stdin.
Comment text now uses single quotes, with a warning for future
editors.
busybox ssl_client sends a literal-IP SNI; default_sni only covers
ABSENT SNI, so the in-container TLS healthcheck found no cert for
127.0.0.1 and unhealthy-looped on the box (masked in rehearsal where
SITE_HOST was 127.0.0.1). Healthcheck now probes a loopback-only
:8081 plain-HTTP listener that still traverses the proxy→web chain;
fallback_sni added so real clients sending unmatched SNI get the
site cert; the TLS path stays covered by the external smoke test.
…as a directive

A bare-address single-site Caddyfile swallows any later site block
as an unrecognized directive of the first site; caddy restart-looped
on 'unrecognized directive: http://127.0.0.1:8081'. Main site now
explicitly braced. (The earlier local validation masked this by
piping caddy validate through tail, eating the exit code.)
@sovITxyz sovITxyz merged commit 878c78d into main Jun 6, 2026
10 checks passed
@sovITxyz sovITxyz deleted the deploy/aws-v0 branch June 6, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant