Add shared Redis session store to HA and autoscale modules#3
Merged
Conversation
…h SSM doc, align cost table The SAT/ASM runbook and customer-facing pricing have always described ElastiCache Multi-AZ as part of the HA topology, and the application refuses to share session state across instances without it. The Terraform modules were silently shipping HA without Redis, so a customer applying the module got two app instances that couldn't share logins or worker locks. Three fixes: 1. ElastiCache replication group provisioned by default in both ha-hot-hot and unlimited-scale, with var.enable_managed_redis / var.redis_node_type / var.redis_endpoint_override knobs for customers who bring their own Redis (MemoryDB, existing ElastiCache, self-managed Sentinel). Module wires the endpoint into user_data so the AMI's first boot picks it up. New redis_endpoint and redis_mode outputs surface the wiring (and "disabled" when neither managed Redis nor an override is configured). 2. Post-patch SSM document. The pre-patch document already invokes /opt/hailbytes/bin/ha-pre-patch-backup.sh; this adds the companion post_patch_verify document that runs the on-VM five-probe verifier. Lets the autoscaling instance_refresh fail fast on a schema-version regression, encryption-key fingerprint mismatch, or smoke-test failure. 3. ha-hot-hot README cost table now shows two reference shapes side-by- side: the Terraform "starter" defaults (t3.large + db.t3.medium) and the procurement-grade sizing the SAT runbook quotes (m6i.large + db.m6g.large + ElastiCache cache.t4g.small). Numbers now agree with hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md and the customer-facing pricing the account team uses. Architecture diagram updated to include Redis. Prerequisites bumped to include ElastiCache permissions.
…le tier Match the cost-shape framing from ha-hot-hot: show infra, marketplace meter, and all-in totals so procurement reviewers can compare single-vm (~$84/mo + meter), ha-hot-hot (~$435-515/mo + meter), and unlimited-scale (~$1,200/mo + meter for 3 instances) on the same axis. Add ElastiCache to the table now that the module actually provisions it. Note the cache.t4g.small -> cache.m6g.large bump around 5+ instances.
Procurement reviewers were getting two different price tables when they cross-referenced the SAT runbook (m6i.large / db.m6g.large / $1,150-1,215/mo all-in) against the Terraform module READMEs (t3.large defaults, ~$375/mo infra, meter buried as "separate"). The drift was real, not just doc-style: the modules ship cheaper starter defaults so PoCs don't burn $1k/mo, but the runbook quotes the recommended procurement-grade sizing. Both numbers are correct - they're describing different shapes. This commit makes that explicit: - COST_SHAPES.md (new, repo root): three AWS shapes side-by-side (single / HA / unlimited-scale), each with infra + per-vCore meter + all-in totals at procurement-grade sizing. Includes the self-managed -DB variant of HA. Documents the per-vCore meter as a first-class cost line in its own table (the largest single line in HA and unlimited-scale, scales with instance count not topology). Points at hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md as the canonical source so there is one place to update when prices change. Includes the Asiera/HEAnet EU pricing note. - single-vm/aws/README.md: quantify the marketplace meter line (\$0.24/vCPU-hr x 2 vCPU = ~\$350/mo) so the all-in total (~\$435/mo) is visible. Add starter vs procurement-grade row. - ha-hot-hot/aws/README.md, unlimited-scale/aws/README.md: link to COST_SHAPES.md for the cross-tier comparison. - README.md (root): add COST_SHAPES.md to the Documentation index.
…e fail-loud, repo CI
Three things in one commit so the three-deployment-types x two-products
x two-clouds matrix is finally symmetric and the AMI install path
regresses loudly instead of silently.
1. Azure HA parity (modules/ha-hot-hot/azure/):
- Provision Azure Cache for Redis Multi-AZ by default. Same shape
as the AWS ElastiCache addition: Standard/Premium SKUs only (Basic
is single-node and breaks HA - validated). enable_managed_redis /
redis_sku_name / redis_capacity / redis_endpoint_override knobs
mirror the AWS module. VM custom_data now carries redis_host /
redis_port / redis_tls so the marketplace image picks it up on
first boot.
- Add azurerm_virtual_machine_run_command.post_patch_verify on each
VM, mirroring the AWS aws_ssm_document.post_patch_verify so
customers running the asm-azure-ha / sat-azure-ha modules have
the same five-probe verifier surface as their AWS peers.
- Outputs: redis_endpoint, redis_mode, post_patch_run_command_name.
- README cost table now includes Redis ($55/mo) and the per-vCPU
marketplace meter ($700/mo for 4 vCPU); all-in $1,285/mo at
procurement-grade sizing. Links to the canonical AWS shapes in
COST_SHAPES.md.
2. AWS fail-loud guard (modules/ha-hot-hot/aws/main.tf and
modules/unlimited-scale/aws/main.tf): both pre_patch_backup SSM
docs now exit 1 with an explicit "rebuild the AMI from main"
message when /opt/hailbytes/bin/ha-pre-patch-backup.sh is missing,
instead of WARNing and proceeding. Mirrors the post-patch behaviour
already in place. With provision.sh in both repos now guaranteeing
the install path, a missing script means a stale AMI - operators
should learn that loudly. Azure pre/post follow the same pattern.
3. .github/workflows/ci.yml (new): terraform fmt -check, terraform
validate per module (matrix across all 22 module dirs including
network/aws and network/azure), tflint --recursive (uses the
existing .tflint.hcl with the aws + azurerm plugins), plus a
cost-shapes-sync check that fails the PR if COST_SHAPES.md loses
the canonical markers (single-vm/aws, ha-hot-hot/aws,
unlimited-scale/aws, $0.24/vCPU). Empty .github/workflows/ until
now; this is the cheap gate that catches malformed HCL before a
customer's terraform apply does.
…le + COST_SHAPES Azure) and expanded CI to guard it
Three deliberately-deferred callouts from the previous commit close out
here, plus an expanded CI suite to make sure the next round of drift
gets caught at PR time rather than at customer-apply time.
## Wrapper variable forwarding (12 wrappers)
The earlier commit added 7-8 Redis variables to ha-hot-hot/{aws,azure}
and unlimited-scale/{aws,azure} but the per-product wrappers
(sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale, and the
four Azure equivalents) only declared 36/44 of the core surface — so
customers using the wrappers couldn't override Redis sizing or, on
Azure, the new enable_post_patch_run_command. Fixed by forwarding the
new variables through all 8 affected wrappers. Single-VM wrappers
already had clean parity.
## modules/unlimited-scale/azure parity
Mirrors the ha-hot-hot/azure additions from the prior commit:
- azurerm_redis_cache.main provisioned by default; Standard / Premium
SKUs only (Basic rejected via validation). enable_managed_redis /
redis_sku_name / redis_capacity / redis_endpoint_override knobs
match the ha-hot-hot Azure shape. VMSS custom_data now carries
redis_host / redis_port / redis_tls.
- azurerm_virtual_machine_scale_set_extension.post_patch_verify
baked alongside the existing pre_patch_backup extension. Same
five-probe verifier as the AWS post_patch_verify SSM doc; fails
loud with an explicit rebuild-the-AMI message if the script is
missing.
- README cost table includes Redis ($55/mo) and per-vCPU meter line;
all-in $2,530/mo at 3-instance steady state, $5,150/mo at 10
instances. Links to COST_SHAPES.md for the cross-cloud comparison.
- Both pre-patch documents (Azure HA + Azure autoscale) now exit 1
instead of WARN-ing when the script is missing. Parity with the
AWS fail-loud change from the prior commit.
## COST_SHAPES.md Azure rows
The earlier "Azure pricing is currently tracked separately" placeholder
is replaced with a full three-shape Azure table aligned with the AWS
section: single ~$445/mo, HA hot-hot ~$1,285/mo (≈ 2.9× single, within
6% of the AWS HA shape), unlimited-scale ~$2,530/mo at min. Plus an
Azure Cache for Redis sizing table (Standard C1 / C2 / C3 / Premium
P1) mirroring the per-vCore meter table for AWS. A cross-cloud note
explains that AWS-vs-Azure parity is intentional — quote whichever
cloud the customer's finance team has commitments with.
## CI expansion (.github/workflows/ci.yml)
Five new gates, each scoped to catch real-world failure modes:
- **tfsec** — static security scan with SARIF upload to GitHub
code-scanning. HIGH/CRITICAL findings fail the build; MEDIUM/LOW
surface in code-scanning UI without breaking the gate.
- **examples-validate** — terraform validate every
modules/*/{aws,azure}/examples/basic/ subtree so customer copy-paste
starting points stay buildable. Matrix across all 8 example dirs.
- **marketplace-id-consistency** — asserts every modules/**/*.tf using
marketplace_product_codes carries the canonical AWS AMI codes
(d19hjbz3gakqdlonlf8twdmll for SAT, 1n57wg1f6735e30vj5fn420bp for
ASM) and the canonical Azure publisher / offer slugs. Catches the
drift the SAT/ASM MARKETPLACE.md cross-repo audit would catch later.
- **wrapper-forwarding** — diffs every wrapper's variables.tf against
its core module's variables.tf and fails on any core var (except
the intentionally-hidden 'product') missing from the wrapper. Would
have caught the Redis-vars-not-forwarded gap this commit fixes.
- **versions-tf** — every module dir with .tf files must have a
versions.tf declaring required_version and required_providers.
Prevents accidental provider-version drift between modules.
Plus: cost-shapes-sync gate now also checks the Azure tier markers
and the Standard C1 Redis-sizing marker so a partial COST_SHAPES.md
edit that drops the Azure section fails fast.
Two pre-merge polish items.
1. CHANGELOG.md: detailed [Unreleased] entry covering the
Redis-by-default fix, fail-loud SSM behavior, post-patch verifier
documents, COST_SHAPES.md, wrapper variable forwarding, and the
five new CI gates. Includes a "Migration notes (existing customers)"
subsection explaining the expected plan diff on first apply after
the upgrade — ElastiCache / Azure Cache for Redis additions,
VM replace-on-change because user_data carries Redis endpoint
wiring, the recommended marketplace-AMI rebuild ordering. Names
"redis_mode = disabled" as the loud signal in terraform output
when a deployment is misconfigured.
2. CONTRIBUTING.md (new): documents the nine CI gates a PR will
hit, the wrapper-forwarding contract that the new CI check
enforces, the cross-repo marketplace-id verification step the
marketplace-id-consistency check intentionally defers to release
time (and references hailbytes-{sat,asm}/MARKETPLACE.md as the
upstream sources of truth), procedures for adding a new tier or
knob, and the "Migration notes" expectation for any PR producing
a non-empty plan diff for existing customers. The
marketplace-id-consistency CI check in ci.yml references this
file as the documentation home — previously it didn't exist.
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
Addresses the ~67 Checkov findings that were genuine security wins with no cost or customer-choice impact. The remaining ~115 findings are addressed in two follow-on commits: production-hardening variables (tier 2) and documented suppressions (tier 3). Substantive fixes: - **CKV_AWS_23** (10 hits): security group rule descriptions now populated across every ingress / egress rule in ha-hot-hot/aws and unlimited-scale/aws. terraform fmt also cleaned alignment drift in every wrapper (the pre-existing terraform-validate workflow's fmt -check was failing on asm-aws-single and friends). - **CKV_AWS_300** (8 hits): S3 lifecycle configs now carry an abort_incomplete_multipart_upload rule (7-day retention) on the backup bucket in single-vm/aws, ha-hot-hot/aws, unlimited-scale/aws and on the alb_logs bucket in unlimited-scale/aws. - **CKV_AWS_91** (2 hits): ALB access logging on ha-hot-hot/aws now available as an opt-in feature (var.enable_alb_access_logging, default false). When enabled, provisions an alb_logs S3 bucket with versioning + public-access-block + KMS-when-CMK-enabled + lifecycle + a 365-day retention default. Symmetric with the unlimited-scale module which already had access logging. - **CKV_AWS_150** (4 hits): ALB deletion protection now configurable via var.enable_alb_deletion_protection on ha-hot-hot/aws and unlimited-scale/aws. Default true; dev/test deployments can override to allow `terraform destroy` without manual cleanup. - **CKV_AZURE_41** (4 hits): Key Vault DB-password secrets now carry an expiration_date set via timeadd(timestamp(), var.db_secret_expiration_hours). lifecycle.ignore_changes covers expiration_date so reruns don't show drift. Default 8760h = one year. - **CKV_AZURE_114** (4 hits): Same KV secrets now declare content_type = "application/x-postgresql-password" so rotation tooling can identify their semantics. - **CKV_AZURE_97** (2 hits): VMSS in unlimited-scale/azure now sets encryption_at_host_enabled = true (hypervisor-level encryption on top of platform-managed disk encryption; no additional cost). Plus wrapper variable forwarding for the new knobs: - enable_alb_deletion_protection / enable_alb_access_logging / alb_access_log_retention_days through sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale. - db_secret_expiration_hours through all four Azure HA/autoscale wrappers. All 12 wrappers pass the wrapper-forwarding CI check (every core variable except `product` is now declared in every wrapper). terraform fmt -recursive applied to the whole tree as part of this commit; the pre-existing terraform-validate workflow runs `fmt -check -recursive` per module so this fixes the validate-on-wrapper failures the PR was hitting (asm-aws-single, sat-aws-autoscale, etc).
Addresses the ~75 Checkov findings that are real production-grade hardening but carry a cost or complexity tradeoff that doesn't belong as a default. Every knob ships off so PoC deployments stay cheap; production deployments turn them on with a single line and clear no the cost they're agreeing to. New variables (defaults preserve existing behaviour): - **`rds_enhanced_monitoring_interval`** (default 0): CKV_AWS_118. Set to 60 in production. When non-zero, the module also provisions an IAM role with the AWS-managed `AmazonRDSEnhancedMonitoringRole` policy. Adds ~$15/mo per monitored instance via CloudWatch ingestion. - **`rds_enabled_cloudwatch_log_types`** (default `[]`): CKV_AWS_129. Production should set to `["postgresql", "upgrade"]` for the audit trail and major-version-upgrade safety. Empty list keeps PoC CloudWatch bills clean. - **`rds_iam_authentication_enabled`** (default false): CKV_AWS_161. Real value once the app side wires up IAM token minting; off by default because today's connections use the Secrets Manager password. - **`rds_performance_insights_enabled`** (default false): CKV_AWS_354. When true and `enable_customer_managed_key` is also true, PI is KMS-encrypted. `rds_performance_insights_retention_days` lets customers pick 7 (free tier) or 731 (paid long-term). - **`postgres_geo_redundant_backup_enabled`** (default false): CKV_AZURE_136. Replaces the previously-hardcoded values (`false` in ha-hot-hot/azure, `true` in unlimited-scale/azure) with a customer-driven knob. Forwarded through every relevant wrapper (sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale + Azure equivalents). Wrapper-forwarding CI check passes for all 12 wrappers. What's NOT here (deliberate): - S3 access logging, CWL retention >= 1 year — these turn into module-side cost (a second log bucket + longer CWL ingestion) and the same outcomes are reachable via the customer's existing log pipeline. Documented as suppressions in the tier-3 follow-up rather than re-implemented in module code. - Secrets Manager rotation Lambda (CKV2_AWS_57) — needs a full rotation function with DB user management; substantial scope. Customer-managed rotation policy is the better abstraction; documented in tier-3. - Azure disk encryption sets, private endpoints — require subnet / vnet plumbing that's customer-owned. Suppressed with rationale in tier-3.
…kov green)
Final round of the three-tier triage. The `.checkov.yaml` config
file at the repo root carries every suppression with a one-line
category code and reason — anyone reviewing the security posture
sees WHY a check is suppressed without having to dig through commit
history.
Five categories, every suppression labelled with one:
(A) Not applicable — customer-owned resource we don't manage
(B) By design — wrapped by an existing variable or runbook section
(C) False positive — Checkov can't trace the check through
`count`-conditional or separate-resource patterns we use
(D) Customer governance — cost or policy tradeoff that doesn't
belong as a module default
(E) Opt-in variable — default off keeps the starter shape cheap;
the tier-2 production-hardening commit wired the knobs
Plus three small residual fixes uncovered during validation:
- SG rule descriptions on unlimited-scale/aws (5 rules) —
ha-hot-hot/aws had them via terraform fmt's earlier pass;
unlimited-scale/aws is structurally similar but the rules were
declared without descriptions. CKV_AWS_23.
- CloudWatch flow_logs group in unlimited-scale/aws now sets
kms_key_id when var.enable_customer_managed_key is true. The
earlier CWL fix targeted the RDS log groups; the VPC flow-log
group was missed. CKV_AWS_158.
- Azure backup storage accounts now explicitly set
public_network_access_enabled = false +
allow_nested_items_to_be_public = false. CKV_AZURE_59.
(Previously these were set further down in the resource as `true`
— a copy-paste from the documented Azure example; flipped to
false now.)
Also updates .github/workflows/checkov.yml to use the config file
(`config_file: .checkov.yaml`) instead of an inline skip-check list,
so future suppressions are reviewed in the same place as their
rationale.
Result against the current modules/ tree:
Passed checks: 1023, Failed checks: 0, Skipped checks: 0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds mandatory shared Redis session store provisioning to all HA and horizontally-scaled deployment modules. Previously,
ha-hot-hotandunlimited-scalemodules deployed multiple application instances without a shared session backend, which silently broke cross-instance login and worker-lock coordination in production.Key Changes
AWS modules: Added ElastiCache Multi-AZ replication group provisioning to
ha-hot-hot/awsandunlimited-scale/awsenable_managed_redis,redis_node_type,redis_engine_version,redis_snapshot_retention_days,redis_endpoint_override,redis_endpoint_override_portcache.t4g.small(procurement-friendly)enable_managed_redis = false+redis_endpoint_overridefor customer-managed RedisAzure modules: Added Azure Cache for Redis provisioning to
ha-hot-hot/azureandunlimited-scale/azureenable_managed_redis,redis_sku_name,redis_family,redis_capacity,redis_endpoint_override,redis_endpoint_override_port,redis_endpoint_override_tlsProduct wrapper modules: Updated all 12 per-product wrappers (
sat-aws-ha,asm-aws-ha, etc.) to forward Redis variables through to core modulesDocumentation:
COST_SHAPES.md: canonical procurement-grade pricing reference for all three deployment shapes (single, HA, unlimited-scale)CONTRIBUTING.md: contributor guidelines, CI gate descriptions, and cross-repo verification proceduresCHANGELOG.mdwith breaking change noticeCI/CD: Added comprehensive
.github/workflows/ci.ymlwith 9 gates:terraform fmt,validate,tflint,tfsec(with SARIF upload)Post-patch verification: Added
post_patch_ssm_document_name/post_patch_run_command_nameoutputs to AWS and Azure modules respectivelyImplementation Details
enable_managed_redis = falseand providingredis_endpoint_overridecustom_data/user_datametadatahttps://claude.ai/code/session_01AtsowyQYV4CdLVmhJBQguF