Skip to content

Add shared Redis session store to HA and autoscale modules#3

Merged
dmchaledev merged 9 commits into
mainfrom
claude/audit-marketplace-consistency-RnL5W
May 19, 2026
Merged

Add shared Redis session store to HA and autoscale modules#3
dmchaledev merged 9 commits into
mainfrom
claude/audit-marketplace-consistency-RnL5W

Conversation

@dmchaledev
Copy link
Copy Markdown
Contributor

Summary

This PR adds mandatory shared Redis session store provisioning to all HA and horizontally-scaled deployment modules. Previously, ha-hot-hot and unlimited-scale modules deployed multiple application instances without a shared session backend, which silently broke cross-instance login and worker-lock coordination in production.

Key Changes

  • AWS modules: Added ElastiCache Multi-AZ replication group provisioning to ha-hot-hot/aws and unlimited-scale/aws

    • New variables: enable_managed_redis, redis_node_type, redis_engine_version, redis_snapshot_retention_days, redis_endpoint_override, redis_endpoint_override_port
    • Default node type: cache.t4g.small (procurement-friendly)
    • Configurable via enable_managed_redis = false + redis_endpoint_override for customer-managed Redis
  • Azure modules: Added Azure Cache for Redis provisioning to ha-hot-hot/azure and unlimited-scale/azure

    • New variables: enable_managed_redis, redis_sku_name, redis_family, redis_capacity, redis_endpoint_override, redis_endpoint_override_port, redis_endpoint_override_tls
    • Default SKU: Standard (primary/replica pair, zone-redundant in Premium)
    • Validates that Basic SKU is rejected (single-node, breaks HA)
  • Product wrapper modules: Updated all 12 per-product wrappers (sat-aws-ha, asm-aws-ha, etc.) to forward Redis variables through to core modules

  • Documentation:

    • Added COST_SHAPES.md: canonical procurement-grade pricing reference for all three deployment shapes (single, HA, unlimited-scale)
    • Added CONTRIBUTING.md: contributor guidelines, CI gate descriptions, and cross-repo verification procedures
    • Updated module READMEs with Redis architecture diagrams and revised cost tables
    • Updated CHANGELOG.md with breaking change notice
  • CI/CD: Added comprehensive .github/workflows/ci.yml with 9 gates:

    • terraform fmt, validate, tflint, tfsec (with SARIF upload)
    • Example validation, marketplace ID consistency, wrapper variable forwarding, versions.tf checks
    • Cost shape marker sync validation
  • Post-patch verification: Added post_patch_ssm_document_name / post_patch_run_command_name outputs to AWS and Azure modules respectively

Implementation Details

  • Redis is enabled by default in all HA/autoscale modules; customers can opt out by setting enable_managed_redis = false and providing redis_endpoint_override
  • Security groups restrict Redis access to application instance security groups only
  • Azure modules use port 6380 (TLS) for managed Redis; AWS uses 6379
  • Marketplace image receives Redis endpoint via custom_data / user_data metadata
  • All wrapper modules maintain variable parity with core modules (validated by CI)
  • Cost tables distinguish between "starter defaults" (smaller, cheaper PoC) and "procurement-grade" (production-recommended) sizing

https://claude.ai/code/session_01AtsowyQYV4CdLVmhJBQguF

claude added 6 commits May 18, 2026 17:54
…h SSM doc, align cost table

The SAT/ASM runbook and customer-facing pricing have always described
ElastiCache Multi-AZ as part of the HA topology, and the application
refuses to share session state across instances without it. The Terraform
modules were silently shipping HA without Redis, so a customer applying
the module got two app instances that couldn't share logins or worker
locks. Three fixes:

1. ElastiCache replication group provisioned by default in both ha-hot-hot
   and unlimited-scale, with var.enable_managed_redis / var.redis_node_type
   / var.redis_endpoint_override knobs for customers who bring their own
   Redis (MemoryDB, existing ElastiCache, self-managed Sentinel). Module
   wires the endpoint into user_data so the AMI's first boot picks it up.
   New redis_endpoint and redis_mode outputs surface the wiring (and
   "disabled" when neither managed Redis nor an override is configured).

2. Post-patch SSM document. The pre-patch document already invokes
   /opt/hailbytes/bin/ha-pre-patch-backup.sh; this adds the companion
   post_patch_verify document that runs the on-VM five-probe verifier.
   Lets the autoscaling instance_refresh fail fast on a schema-version
   regression, encryption-key fingerprint mismatch, or smoke-test failure.

3. ha-hot-hot README cost table now shows two reference shapes side-by-
   side: the Terraform "starter" defaults (t3.large + db.t3.medium) and
   the procurement-grade sizing the SAT runbook quotes (m6i.large +
   db.m6g.large + ElastiCache cache.t4g.small). Numbers now agree with
   hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md and the customer-facing
   pricing the account team uses. Architecture diagram updated to include
   Redis. Prerequisites bumped to include ElastiCache permissions.
…le tier

Match the cost-shape framing from ha-hot-hot: show infra, marketplace
meter, and all-in totals so procurement reviewers can compare single-vm
(~$84/mo + meter), ha-hot-hot (~$435-515/mo + meter), and unlimited-scale
(~$1,200/mo + meter for 3 instances) on the same axis. Add ElastiCache
to the table now that the module actually provisions it. Note the
cache.t4g.small -> cache.m6g.large bump around 5+ instances.
Procurement reviewers were getting two different price tables when they
cross-referenced the SAT runbook (m6i.large / db.m6g.large /
$1,150-1,215/mo all-in) against the Terraform module READMEs (t3.large
defaults, ~$375/mo infra, meter buried as "separate"). The drift was
real, not just doc-style: the modules ship cheaper starter defaults so
PoCs don't burn $1k/mo, but the runbook quotes the recommended
procurement-grade sizing. Both numbers are correct - they're describing
different shapes.

This commit makes that explicit:

- COST_SHAPES.md (new, repo root): three AWS shapes side-by-side
  (single / HA / unlimited-scale), each with infra + per-vCore meter +
  all-in totals at procurement-grade sizing. Includes the self-managed
  -DB variant of HA. Documents the per-vCore meter as a first-class
  cost line in its own table (the largest single line in HA and
  unlimited-scale, scales with instance count not topology). Points at
  hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md as the canonical source so
  there is one place to update when prices change. Includes the
  Asiera/HEAnet EU pricing note.

- single-vm/aws/README.md: quantify the marketplace meter line
  (\$0.24/vCPU-hr x 2 vCPU = ~\$350/mo) so the all-in total (~\$435/mo)
  is visible. Add starter vs procurement-grade row.

- ha-hot-hot/aws/README.md, unlimited-scale/aws/README.md: link to
  COST_SHAPES.md for the cross-tier comparison.

- README.md (root): add COST_SHAPES.md to the Documentation index.
…e fail-loud, repo CI

Three things in one commit so the three-deployment-types x two-products
x two-clouds matrix is finally symmetric and the AMI install path
regresses loudly instead of silently.

1. Azure HA parity (modules/ha-hot-hot/azure/):
   - Provision Azure Cache for Redis Multi-AZ by default. Same shape
     as the AWS ElastiCache addition: Standard/Premium SKUs only (Basic
     is single-node and breaks HA - validated). enable_managed_redis /
     redis_sku_name / redis_capacity / redis_endpoint_override knobs
     mirror the AWS module. VM custom_data now carries redis_host /
     redis_port / redis_tls so the marketplace image picks it up on
     first boot.
   - Add azurerm_virtual_machine_run_command.post_patch_verify on each
     VM, mirroring the AWS aws_ssm_document.post_patch_verify so
     customers running the asm-azure-ha / sat-azure-ha modules have
     the same five-probe verifier surface as their AWS peers.
   - Outputs: redis_endpoint, redis_mode, post_patch_run_command_name.
   - README cost table now includes Redis ($55/mo) and the per-vCPU
     marketplace meter ($700/mo for 4 vCPU); all-in $1,285/mo at
     procurement-grade sizing. Links to the canonical AWS shapes in
     COST_SHAPES.md.

2. AWS fail-loud guard (modules/ha-hot-hot/aws/main.tf and
   modules/unlimited-scale/aws/main.tf): both pre_patch_backup SSM
   docs now exit 1 with an explicit "rebuild the AMI from main"
   message when /opt/hailbytes/bin/ha-pre-patch-backup.sh is missing,
   instead of WARNing and proceeding. Mirrors the post-patch behaviour
   already in place. With provision.sh in both repos now guaranteeing
   the install path, a missing script means a stale AMI - operators
   should learn that loudly. Azure pre/post follow the same pattern.

3. .github/workflows/ci.yml (new): terraform fmt -check, terraform
   validate per module (matrix across all 22 module dirs including
   network/aws and network/azure), tflint --recursive (uses the
   existing .tflint.hcl with the aws + azurerm plugins), plus a
   cost-shapes-sync check that fails the PR if COST_SHAPES.md loses
   the canonical markers (single-vm/aws, ha-hot-hot/aws,
   unlimited-scale/aws, $0.24/vCPU). Empty .github/workflows/ until
   now; this is the cheap gate that catches malformed HCL before a
   customer's terraform apply does.
…le + COST_SHAPES Azure) and expanded CI to guard it

Three deliberately-deferred callouts from the previous commit close out
here, plus an expanded CI suite to make sure the next round of drift
gets caught at PR time rather than at customer-apply time.

## Wrapper variable forwarding (12 wrappers)

The earlier commit added 7-8 Redis variables to ha-hot-hot/{aws,azure}
and unlimited-scale/{aws,azure} but the per-product wrappers
(sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale, and the
four Azure equivalents) only declared 36/44 of the core surface — so
customers using the wrappers couldn't override Redis sizing or, on
Azure, the new enable_post_patch_run_command. Fixed by forwarding the
new variables through all 8 affected wrappers. Single-VM wrappers
already had clean parity.

## modules/unlimited-scale/azure parity

Mirrors the ha-hot-hot/azure additions from the prior commit:

- azurerm_redis_cache.main provisioned by default; Standard / Premium
  SKUs only (Basic rejected via validation). enable_managed_redis /
  redis_sku_name / redis_capacity / redis_endpoint_override knobs
  match the ha-hot-hot Azure shape. VMSS custom_data now carries
  redis_host / redis_port / redis_tls.
- azurerm_virtual_machine_scale_set_extension.post_patch_verify
  baked alongside the existing pre_patch_backup extension. Same
  five-probe verifier as the AWS post_patch_verify SSM doc; fails
  loud with an explicit rebuild-the-AMI message if the script is
  missing.
- README cost table includes Redis ($55/mo) and per-vCPU meter line;
  all-in $2,530/mo at 3-instance steady state, $5,150/mo at 10
  instances. Links to COST_SHAPES.md for the cross-cloud comparison.
- Both pre-patch documents (Azure HA + Azure autoscale) now exit 1
  instead of WARN-ing when the script is missing. Parity with the
  AWS fail-loud change from the prior commit.

## COST_SHAPES.md Azure rows

The earlier "Azure pricing is currently tracked separately" placeholder
is replaced with a full three-shape Azure table aligned with the AWS
section: single ~$445/mo, HA hot-hot ~$1,285/mo (≈ 2.9× single, within
6% of the AWS HA shape), unlimited-scale ~$2,530/mo at min. Plus an
Azure Cache for Redis sizing table (Standard C1 / C2 / C3 / Premium
P1) mirroring the per-vCore meter table for AWS. A cross-cloud note
explains that AWS-vs-Azure parity is intentional — quote whichever
cloud the customer's finance team has commitments with.

## CI expansion (.github/workflows/ci.yml)

Five new gates, each scoped to catch real-world failure modes:

- **tfsec** — static security scan with SARIF upload to GitHub
  code-scanning. HIGH/CRITICAL findings fail the build; MEDIUM/LOW
  surface in code-scanning UI without breaking the gate.
- **examples-validate** — terraform validate every
  modules/*/{aws,azure}/examples/basic/ subtree so customer copy-paste
  starting points stay buildable. Matrix across all 8 example dirs.
- **marketplace-id-consistency** — asserts every modules/**/*.tf using
  marketplace_product_codes carries the canonical AWS AMI codes
  (d19hjbz3gakqdlonlf8twdmll for SAT, 1n57wg1f6735e30vj5fn420bp for
  ASM) and the canonical Azure publisher / offer slugs. Catches the
  drift the SAT/ASM MARKETPLACE.md cross-repo audit would catch later.
- **wrapper-forwarding** — diffs every wrapper's variables.tf against
  its core module's variables.tf and fails on any core var (except
  the intentionally-hidden 'product') missing from the wrapper. Would
  have caught the Redis-vars-not-forwarded gap this commit fixes.
- **versions-tf** — every module dir with .tf files must have a
  versions.tf declaring required_version and required_providers.
  Prevents accidental provider-version drift between modules.

Plus: cost-shapes-sync gate now also checks the Azure tier markers
and the Standard C1 Redis-sizing marker so a partial COST_SHAPES.md
edit that drops the Azure section fails fast.
Two pre-merge polish items.

1. CHANGELOG.md: detailed [Unreleased] entry covering the
   Redis-by-default fix, fail-loud SSM behavior, post-patch verifier
   documents, COST_SHAPES.md, wrapper variable forwarding, and the
   five new CI gates. Includes a "Migration notes (existing customers)"
   subsection explaining the expected plan diff on first apply after
   the upgrade — ElastiCache / Azure Cache for Redis additions,
   VM replace-on-change because user_data carries Redis endpoint
   wiring, the recommended marketplace-AMI rebuild ordering. Names
   "redis_mode = disabled" as the loud signal in terraform output
   when a deployment is misconfigured.

2. CONTRIBUTING.md (new): documents the nine CI gates a PR will
   hit, the wrapper-forwarding contract that the new CI check
   enforces, the cross-repo marketplace-id verification step the
   marketplace-id-consistency check intentionally defers to release
   time (and references hailbytes-{sat,asm}/MARKETPLACE.md as the
   upstream sources of truth), procedures for adding a new tier or
   knob, and the "Migration notes" expectation for any PR producing
   a non-empty plan diff for existing customers. The
   marketplace-id-consistency CI check in ci.yml references this
   file as the documentation home — previously it didn't exist.
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/unlimited-scale/aws/main.tf Fixed
Comment thread modules/unlimited-scale/aws/main.tf Fixed
Comment thread modules/unlimited-scale/aws/main.tf Fixed
Comment thread modules/unlimited-scale/aws/main.tf Fixed
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Addresses the ~67 Checkov findings that were genuine security wins
with no cost or customer-choice impact. The remaining ~115 findings
are addressed in two follow-on commits: production-hardening
variables (tier 2) and documented suppressions (tier 3).

Substantive fixes:

- **CKV_AWS_23** (10 hits): security group rule descriptions now
  populated across every ingress / egress rule in ha-hot-hot/aws and
  unlimited-scale/aws. terraform fmt also cleaned alignment drift in
  every wrapper (the pre-existing terraform-validate workflow's
  fmt -check was failing on asm-aws-single and friends).
- **CKV_AWS_300** (8 hits): S3 lifecycle configs now carry an
  abort_incomplete_multipart_upload rule (7-day retention) on the
  backup bucket in single-vm/aws, ha-hot-hot/aws, unlimited-scale/aws
  and on the alb_logs bucket in unlimited-scale/aws.
- **CKV_AWS_91** (2 hits): ALB access logging on ha-hot-hot/aws now
  available as an opt-in feature (var.enable_alb_access_logging,
  default false). When enabled, provisions an alb_logs S3 bucket with
  versioning + public-access-block + KMS-when-CMK-enabled + lifecycle
  + a 365-day retention default. Symmetric with the unlimited-scale
  module which already had access logging.
- **CKV_AWS_150** (4 hits): ALB deletion protection now configurable
  via var.enable_alb_deletion_protection on ha-hot-hot/aws and
  unlimited-scale/aws. Default true; dev/test deployments can
  override to allow `terraform destroy` without manual cleanup.
- **CKV_AZURE_41** (4 hits): Key Vault DB-password secrets now carry
  an expiration_date set via timeadd(timestamp(), var.db_secret_expiration_hours).
  lifecycle.ignore_changes covers expiration_date so reruns don't
  show drift. Default 8760h = one year.
- **CKV_AZURE_114** (4 hits): Same KV secrets now declare
  content_type = "application/x-postgresql-password" so rotation
  tooling can identify their semantics.
- **CKV_AZURE_97** (2 hits): VMSS in unlimited-scale/azure now sets
  encryption_at_host_enabled = true (hypervisor-level encryption on
  top of platform-managed disk encryption; no additional cost).

Plus wrapper variable forwarding for the new knobs:
- enable_alb_deletion_protection / enable_alb_access_logging /
  alb_access_log_retention_days through sat-aws-ha, asm-aws-ha,
  sat-aws-autoscale, asm-aws-autoscale.
- db_secret_expiration_hours through all four Azure HA/autoscale
  wrappers.

All 12 wrappers pass the wrapper-forwarding CI check (every core
variable except `product` is now declared in every wrapper).

terraform fmt -recursive applied to the whole tree as part of this
commit; the pre-existing terraform-validate workflow runs `fmt -check
-recursive` per module so this fixes the validate-on-wrapper failures
the PR was hitting (asm-aws-single, sat-aws-autoscale, etc).
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
Comment thread modules/ha-hot-hot/aws/main.tf Fixed
claude added 2 commits May 19, 2026 10:43
Addresses the ~75 Checkov findings that are real production-grade
hardening but carry a cost or complexity tradeoff that doesn't
belong as a default. Every knob ships off so PoC deployments stay
cheap; production deployments turn them on with a single line and
clear no the cost they're agreeing to.

New variables (defaults preserve existing behaviour):

- **`rds_enhanced_monitoring_interval`** (default 0): CKV_AWS_118.
  Set to 60 in production. When non-zero, the module also provisions
  an IAM role with the AWS-managed
  `AmazonRDSEnhancedMonitoringRole` policy. Adds ~$15/mo per
  monitored instance via CloudWatch ingestion.
- **`rds_enabled_cloudwatch_log_types`** (default `[]`):
  CKV_AWS_129. Production should set to `["postgresql", "upgrade"]`
  for the audit trail and major-version-upgrade safety. Empty list
  keeps PoC CloudWatch bills clean.
- **`rds_iam_authentication_enabled`** (default false):
  CKV_AWS_161. Real value once the app side wires up IAM token
  minting; off by default because today's connections use the
  Secrets Manager password.
- **`rds_performance_insights_enabled`** (default false): CKV_AWS_354.
  When true and `enable_customer_managed_key` is also true, PI is
  KMS-encrypted. `rds_performance_insights_retention_days` lets
  customers pick 7 (free tier) or 731 (paid long-term).
- **`postgres_geo_redundant_backup_enabled`** (default false):
  CKV_AZURE_136. Replaces the previously-hardcoded values
  (`false` in ha-hot-hot/azure, `true` in unlimited-scale/azure)
  with a customer-driven knob.

Forwarded through every relevant wrapper (sat-aws-ha, asm-aws-ha,
sat-aws-autoscale, asm-aws-autoscale + Azure equivalents).
Wrapper-forwarding CI check passes for all 12 wrappers.

What's NOT here (deliberate):
- S3 access logging, CWL retention >= 1 year — these turn into
  module-side cost (a second log bucket + longer CWL ingestion) and
  the same outcomes are reachable via the customer's existing log
  pipeline. Documented as suppressions in the tier-3 follow-up
  rather than re-implemented in module code.
- Secrets Manager rotation Lambda (CKV2_AWS_57) — needs a
  full rotation function with DB user management; substantial
  scope. Customer-managed rotation policy is the better
  abstraction; documented in tier-3.
- Azure disk encryption sets, private endpoints — require subnet /
  vnet plumbing that's customer-owned. Suppressed with rationale
  in tier-3.
…kov green)

Final round of the three-tier triage. The `.checkov.yaml` config
file at the repo root carries every suppression with a one-line
category code and reason — anyone reviewing the security posture
sees WHY a check is suppressed without having to dig through commit
history.

Five categories, every suppression labelled with one:
  (A) Not applicable — customer-owned resource we don't manage
  (B) By design — wrapped by an existing variable or runbook section
  (C) False positive — Checkov can't trace the check through
      `count`-conditional or separate-resource patterns we use
  (D) Customer governance — cost or policy tradeoff that doesn't
      belong as a module default
  (E) Opt-in variable — default off keeps the starter shape cheap;
      the tier-2 production-hardening commit wired the knobs

Plus three small residual fixes uncovered during validation:

- SG rule descriptions on unlimited-scale/aws (5 rules) —
  ha-hot-hot/aws had them via terraform fmt's earlier pass;
  unlimited-scale/aws is structurally similar but the rules were
  declared without descriptions. CKV_AWS_23.
- CloudWatch flow_logs group in unlimited-scale/aws now sets
  kms_key_id when var.enable_customer_managed_key is true. The
  earlier CWL fix targeted the RDS log groups; the VPC flow-log
  group was missed. CKV_AWS_158.
- Azure backup storage accounts now explicitly set
  public_network_access_enabled = false +
  allow_nested_items_to_be_public = false. CKV_AZURE_59.
  (Previously these were set further down in the resource as `true`
  — a copy-paste from the documented Azure example; flipped to
  false now.)

Also updates .github/workflows/checkov.yml to use the config file
(`config_file: .checkov.yaml`) instead of an inline skip-check list,
so future suppressions are reviewed in the same place as their
rationale.

Result against the current modules/ tree:
  Passed checks: 1023, Failed checks: 0, Skipped checks: 0
@dmchaledev dmchaledev merged commit bd3e89b into main May 19, 2026
73 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants