diff --git a/CHANGELOG.md b/CHANGELOG.md index 5bc9fd7..379c66e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,10 @@ All notable changes to this project are documented here. Format follows [Keep a - **Shared Redis is now provisioned by default in every HA / autoscale module.** Previously `ha-hot-hot/{aws,azure}` and `unlimited-scale/{aws,azure}` shipped two-or-more application instances behind a load balancer with no shared session store, which silently broke cross-instance login and the worker-lock heartbeat in production HA deployments. The new default is an ElastiCache (AWS, Multi-AZ) / Azure Cache for Redis (Standard or Premium, zone-redundant) replication group sized at the procurement-friendly tier (`cache.t4g.small` / `Standard C1`). The Azure modules reject the single-node `Basic` SKU at validation time so an unsafe SKU choice fails fast. - **Pre-patch SSM / Run Command documents fail loud on a missing on-AMI script.** Previously the `if [ -x /opt/hailbytes/bin/ha-pre-patch-backup.sh ]; then ...; else WARN ...; fi` guard masked the case where the AMI was built before the Packer change that installs the script. Customers running an older AMI now see an explicit "rebuild the marketplace image from main" error instead of a silently no-op backup. Same change on Azure pre-patch. Applies to both `ha-hot-hot` and `unlimited-scale`. +### Documentation + +- **Azure HA / autoscale: TLS termination tradeoff called out in READMEs.** In the default Standard LB mode the frontend is TCP passthrough on 443, so the browser terminates against the VM's self-signed certificate — and the certificate CN (now the per-VM IMDS hostname after the corresponding `hailbytes-asm` / `hailbytes-sat` `setup.sh` change) does not match the LB public IP nor any DNS record customers point at it. Production deployments should set `enable_application_gateway = true` with a real PFX, or front the module with their own upstream L7 LB. No code change; this documents an existing behavior that was previously silent. + ### Added - **Post-patch verifier SSM / Run Command documents** on every HA / autoscale module (AWS `aws_ssm_document.post_patch_verify`, Azure `azurerm_virtual_machine_run_command.post_patch_verify` / `azurerm_virtual_machine_scale_set_extension.post_patch_verify`). Invokes the on-AMI `/opt/hailbytes/bin/ha-post-patch-verify.sh` five-probe verifier so a rolling-replace can fail fast on a schema-version regression, encryption-key fingerprint mismatch, or worker-lock outage. diff --git a/modules/ha-hot-hot/azure/README.md b/modules/ha-hot-hot/azure/README.md index 96640f5..3c1e48b 100644 --- a/modules/ha-hot-hot/azure/README.md +++ b/modules/ha-hot-hot/azure/README.md @@ -22,6 +22,17 @@ flowchart TB RC -.failover.-> RCS[(Replica in second zone)] ``` +## TLS termination + +The default frontend is the Standard Load Balancer, which does **TCP passthrough on 443** — the operator's browser terminates TLS directly against the VM's self-signed certificate. The marketplace AMI generates that certificate on first boot with the per-VM hostname as the CN, so it will **not** match the LB public IP nor any DNS record (Azure Private DNS, Route 53, etc.) you point at it. Browsers will warn on every visit. + +For production, pick one: + +- **Recommended.** Set `enable_application_gateway = true` and supply a valid PFX bundle via `appgw_tls_pfx_base64` / `appgw_tls_pfx_password`. App Gateway terminates TLS with your certificate; the backend hop to the VMs is not user-visible. This also unlocks `waf_policy_id` for WAF parity with the AWS ALB story. +- Front the module with your own upstream L7 LB / reverse proxy (Azure Front Door, NGINX, etc.) that terminates TLS with a certificate matching the URL operators actually use. + +The default LB mode is appropriate for dev / PoC and for compliance-led deployments where the operator URL is the per-VM hostname inside a private vnet. + ## Cost estimate (East US, pay-as-you-go) For the three-shape AWS comparison and the canonical procurement-grade diff --git a/modules/unlimited-scale/azure/README.md b/modules/unlimited-scale/azure/README.md index 1a27602..61ff850 100644 --- a/modules/unlimited-scale/azure/README.md +++ b/modules/unlimited-scale/azure/README.md @@ -22,6 +22,17 @@ flowchart TB Mon -.alerts.-> AG[Action Group
email] ``` +## TLS termination + +The default frontend is the Standard Load Balancer, which does **TCP passthrough on 443** — the operator's browser terminates TLS directly against the VMSS instance's self-signed certificate. Because VMSS instances rotate on autoscale and rolling refresh, the cert CN is the per-instance hostname (generated on first boot from IMDS) and never matches the LB public IP or any DNS record you point at it. Operators see a browser warning on every visit, and the warning surface gets worse as instances roll. + +For production, pick one: + +- **Recommended.** Set `enable_application_gateway = true` and supply a valid PFX bundle via `appgw_tls_pfx_base64` / `appgw_tls_pfx_password`. App Gateway terminates TLS with your certificate; per-instance certs are no longer user-visible and rolling-refresh stops surfacing cert churn. This also unlocks `waf_policy_id` for WAF parity with the AWS ALB story. +- Front the module with your own upstream L7 LB / reverse proxy (Azure Front Door, NGINX, etc.) that terminates TLS with a certificate matching the URL operators actually use. + +The default LB mode is appropriate for dev / PoC and for compliance-led deployments where the operator URL is a per-instance hostname inside a private vnet. + ## Cost estimate (East US, pay-as-you-go, default sizing) Unlimited-scale on Azure is a fundamentally different cost shape from