Add shared Redis session store to HA and autoscale modules by dmchaledev · Pull Request #3 · HailBytes/hailbytes-terraform-modules

dmchaledev · 2026-05-19T10:13:16Z

Summary

This PR adds mandatory shared Redis session store provisioning to all HA and horizontally-scaled deployment modules. Previously, ha-hot-hot and unlimited-scale modules deployed multiple application instances without a shared session backend, which silently broke cross-instance login and worker-lock coordination in production.

Key Changes

AWS modules: Added ElastiCache Multi-AZ replication group provisioning to ha-hot-hot/aws and unlimited-scale/aws
- New variables: enable_managed_redis, redis_node_type, redis_engine_version, redis_snapshot_retention_days, redis_endpoint_override, redis_endpoint_override_port
- Default node type: cache.t4g.small (procurement-friendly)
- Configurable via enable_managed_redis = false + redis_endpoint_override for customer-managed Redis
Azure modules: Added Azure Cache for Redis provisioning to ha-hot-hot/azure and unlimited-scale/azure
- New variables: enable_managed_redis, redis_sku_name, redis_family, redis_capacity, redis_endpoint_override, redis_endpoint_override_port, redis_endpoint_override_tls
- Default SKU: Standard (primary/replica pair, zone-redundant in Premium)
- Validates that Basic SKU is rejected (single-node, breaks HA)
Product wrapper modules: Updated all 12 per-product wrappers (sat-aws-ha, asm-aws-ha, etc.) to forward Redis variables through to core modules
Documentation:
- Added COST_SHAPES.md: canonical procurement-grade pricing reference for all three deployment shapes (single, HA, unlimited-scale)
- Added CONTRIBUTING.md: contributor guidelines, CI gate descriptions, and cross-repo verification procedures
- Updated module READMEs with Redis architecture diagrams and revised cost tables
- Updated CHANGELOG.md with breaking change notice
CI/CD: Added comprehensive .github/workflows/ci.yml with 9 gates:
- terraform fmt, validate, tflint, tfsec (with SARIF upload)
- Example validation, marketplace ID consistency, wrapper variable forwarding, versions.tf checks
- Cost shape marker sync validation
Post-patch verification: Added post_patch_ssm_document_name / post_patch_run_command_name outputs to AWS and Azure modules respectively

Implementation Details

Redis is enabled by default in all HA/autoscale modules; customers can opt out by setting enable_managed_redis = false and providing redis_endpoint_override
Security groups restrict Redis access to application instance security groups only
Azure modules use port 6380 (TLS) for managed Redis; AWS uses 6379
Marketplace image receives Redis endpoint via custom_data / user_data metadata
All wrapper modules maintain variable parity with core modules (validated by CI)
Cost tables distinguish between "starter defaults" (smaller, cheaper PoC) and "procurement-grade" (production-recommended) sizing

https://claude.ai/code/session_01AtsowyQYV4CdLVmhJBQguF

…h SSM doc, align cost table The SAT/ASM runbook and customer-facing pricing have always described ElastiCache Multi-AZ as part of the HA topology, and the application refuses to share session state across instances without it. The Terraform modules were silently shipping HA without Redis, so a customer applying the module got two app instances that couldn't share logins or worker locks. Three fixes: 1. ElastiCache replication group provisioned by default in both ha-hot-hot and unlimited-scale, with var.enable_managed_redis / var.redis_node_type / var.redis_endpoint_override knobs for customers who bring their own Redis (MemoryDB, existing ElastiCache, self-managed Sentinel). Module wires the endpoint into user_data so the AMI's first boot picks it up. New redis_endpoint and redis_mode outputs surface the wiring (and "disabled" when neither managed Redis nor an override is configured). 2. Post-patch SSM document. The pre-patch document already invokes /opt/hailbytes/bin/ha-pre-patch-backup.sh; this adds the companion post_patch_verify document that runs the on-VM five-probe verifier. Lets the autoscaling instance_refresh fail fast on a schema-version regression, encryption-key fingerprint mismatch, or smoke-test failure. 3. ha-hot-hot README cost table now shows two reference shapes side-by- side: the Terraform "starter" defaults (t3.large + db.t3.medium) and the procurement-grade sizing the SAT runbook quotes (m6i.large + db.m6g.large + ElastiCache cache.t4g.small). Numbers now agree with hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md and the customer-facing pricing the account team uses. Architecture diagram updated to include Redis. Prerequisites bumped to include ElastiCache permissions.

…le tier Match the cost-shape framing from ha-hot-hot: show infra, marketplace meter, and all-in totals so procurement reviewers can compare single-vm (~$84/mo + meter), ha-hot-hot (~$435-515/mo + meter), and unlimited-scale (~$1,200/mo + meter for 3 instances) on the same axis. Add ElastiCache to the table now that the module actually provisions it. Note the cache.t4g.small -> cache.m6g.large bump around 5+ instances.

Procurement reviewers were getting two different price tables when they cross-referenced the SAT runbook (m6i.large / db.m6g.large / $1,150-1,215/mo all-in) against the Terraform module READMEs (t3.large defaults, ~$375/mo infra, meter buried as "separate"). The drift was real, not just doc-style: the modules ship cheaper starter defaults so PoCs don't burn $1k/mo, but the runbook quotes the recommended procurement-grade sizing. Both numbers are correct - they're describing different shapes. This commit makes that explicit: - COST_SHAPES.md (new, repo root): three AWS shapes side-by-side (single / HA / unlimited-scale), each with infra + per-vCore meter + all-in totals at procurement-grade sizing. Includes the self-managed -DB variant of HA. Documents the per-vCore meter as a first-class cost line in its own table (the largest single line in HA and unlimited-scale, scales with instance count not topology). Points at hailbytes-sat/docs/AWS_HA_DEPLOYMENT.md as the canonical source so there is one place to update when prices change. Includes the Asiera/HEAnet EU pricing note. - single-vm/aws/README.md: quantify the marketplace meter line (\$0.24/vCPU-hr x 2 vCPU = ~\$350/mo) so the all-in total (~\$435/mo) is visible. Add starter vs procurement-grade row. - ha-hot-hot/aws/README.md, unlimited-scale/aws/README.md: link to COST_SHAPES.md for the cross-tier comparison. - README.md (root): add COST_SHAPES.md to the Documentation index.

…e fail-loud, repo CI Three things in one commit so the three-deployment-types x two-products x two-clouds matrix is finally symmetric and the AMI install path regresses loudly instead of silently. 1. Azure HA parity (modules/ha-hot-hot/azure/): - Provision Azure Cache for Redis Multi-AZ by default. Same shape as the AWS ElastiCache addition: Standard/Premium SKUs only (Basic is single-node and breaks HA - validated). enable_managed_redis / redis_sku_name / redis_capacity / redis_endpoint_override knobs mirror the AWS module. VM custom_data now carries redis_host / redis_port / redis_tls so the marketplace image picks it up on first boot. - Add azurerm_virtual_machine_run_command.post_patch_verify on each VM, mirroring the AWS aws_ssm_document.post_patch_verify so customers running the asm-azure-ha / sat-azure-ha modules have the same five-probe verifier surface as their AWS peers. - Outputs: redis_endpoint, redis_mode, post_patch_run_command_name. - README cost table now includes Redis ($55/mo) and the per-vCPU marketplace meter ($700/mo for 4 vCPU); all-in $1,285/mo at procurement-grade sizing. Links to the canonical AWS shapes in COST_SHAPES.md. 2. AWS fail-loud guard (modules/ha-hot-hot/aws/main.tf and modules/unlimited-scale/aws/main.tf): both pre_patch_backup SSM docs now exit 1 with an explicit "rebuild the AMI from main" message when /opt/hailbytes/bin/ha-pre-patch-backup.sh is missing, instead of WARNing and proceeding. Mirrors the post-patch behaviour already in place. With provision.sh in both repos now guaranteeing the install path, a missing script means a stale AMI - operators should learn that loudly. Azure pre/post follow the same pattern. 3. .github/workflows/ci.yml (new): terraform fmt -check, terraform validate per module (matrix across all 22 module dirs including network/aws and network/azure), tflint --recursive (uses the existing .tflint.hcl with the aws + azurerm plugins), plus a cost-shapes-sync check that fails the PR if COST_SHAPES.md loses the canonical markers (single-vm/aws, ha-hot-hot/aws, unlimited-scale/aws, $0.24/vCPU). Empty .github/workflows/ until now; this is the cheap gate that catches malformed HCL before a customer's terraform apply does.

…le + COST_SHAPES Azure) and expanded CI to guard it Three deliberately-deferred callouts from the previous commit close out here, plus an expanded CI suite to make sure the next round of drift gets caught at PR time rather than at customer-apply time. ## Wrapper variable forwarding (12 wrappers) The earlier commit added 7-8 Redis variables to ha-hot-hot/{aws,azure} and unlimited-scale/{aws,azure} but the per-product wrappers (sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale, and the four Azure equivalents) only declared 36/44 of the core surface — so customers using the wrappers couldn't override Redis sizing or, on Azure, the new enable_post_patch_run_command. Fixed by forwarding the new variables through all 8 affected wrappers. Single-VM wrappers already had clean parity. ## modules/unlimited-scale/azure parity Mirrors the ha-hot-hot/azure additions from the prior commit: - azurerm_redis_cache.main provisioned by default; Standard / Premium SKUs only (Basic rejected via validation). enable_managed_redis / redis_sku_name / redis_capacity / redis_endpoint_override knobs match the ha-hot-hot Azure shape. VMSS custom_data now carries redis_host / redis_port / redis_tls. - azurerm_virtual_machine_scale_set_extension.post_patch_verify baked alongside the existing pre_patch_backup extension. Same five-probe verifier as the AWS post_patch_verify SSM doc; fails loud with an explicit rebuild-the-AMI message if the script is missing. - README cost table includes Redis ($55/mo) and per-vCPU meter line; all-in $2,530/mo at 3-instance steady state, $5,150/mo at 10 instances. Links to COST_SHAPES.md for the cross-cloud comparison. - Both pre-patch documents (Azure HA + Azure autoscale) now exit 1 instead of WARN-ing when the script is missing. Parity with the AWS fail-loud change from the prior commit. ## COST_SHAPES.md Azure rows The earlier "Azure pricing is currently tracked separately" placeholder is replaced with a full three-shape Azure table aligned with the AWS section: single ~$445/mo, HA hot-hot ~$1,285/mo (≈ 2.9× single, within 6% of the AWS HA shape), unlimited-scale ~$2,530/mo at min. Plus an Azure Cache for Redis sizing table (Standard C1 / C2 / C3 / Premium P1) mirroring the per-vCore meter table for AWS. A cross-cloud note explains that AWS-vs-Azure parity is intentional — quote whichever cloud the customer's finance team has commitments with. ## CI expansion (.github/workflows/ci.yml) Five new gates, each scoped to catch real-world failure modes: - **tfsec** — static security scan with SARIF upload to GitHub code-scanning. HIGH/CRITICAL findings fail the build; MEDIUM/LOW surface in code-scanning UI without breaking the gate. - **examples-validate** — terraform validate every modules/*/{aws,azure}/examples/basic/ subtree so customer copy-paste starting points stay buildable. Matrix across all 8 example dirs. - **marketplace-id-consistency** — asserts every modules/**/*.tf using marketplace_product_codes carries the canonical AWS AMI codes (d19hjbz3gakqdlonlf8twdmll for SAT, 1n57wg1f6735e30vj5fn420bp for ASM) and the canonical Azure publisher / offer slugs. Catches the drift the SAT/ASM MARKETPLACE.md cross-repo audit would catch later. - **wrapper-forwarding** — diffs every wrapper's variables.tf against its core module's variables.tf and fails on any core var (except the intentionally-hidden 'product') missing from the wrapper. Would have caught the Redis-vars-not-forwarded gap this commit fixes. - **versions-tf** — every module dir with .tf files must have a versions.tf declaring required_version and required_providers. Prevents accidental provider-version drift between modules. Plus: cost-shapes-sync gate now also checks the Azure tier markers and the Standard C1 Redis-sizing marker so a partial COST_SHAPES.md edit that drops the Azure section fails fast.

Two pre-merge polish items. 1. CHANGELOG.md: detailed [Unreleased] entry covering the Redis-by-default fix, fail-loud SSM behavior, post-patch verifier documents, COST_SHAPES.md, wrapper variable forwarding, and the five new CI gates. Includes a "Migration notes (existing customers)" subsection explaining the expected plan diff on first apply after the upgrade — ElastiCache / Azure Cache for Redis additions, VM replace-on-change because user_data carries Redis endpoint wiring, the recommended marketplace-AMI rebuild ordering. Names "redis_mode = disabled" as the loud signal in terraform output when a deployment is misconfigured. 2. CONTRIBUTING.md (new): documents the nine CI gates a PR will hit, the wrapper-forwarding contract that the new CI check enforces, the cross-repo marketplace-id verification step the marketplace-id-consistency check intentionally defers to release time (and references hailbytes-{sat,asm}/MARKETPLACE.md as the upstream sources of truth), procedures for adding a new tier or knob, and the "Migration notes" expectation for any PR producing a non-empty plan diff for existing customers. The marketplace-id-consistency CI check in ci.yml references this file as the documentation home — previously it didn't exist.

github-advanced-security · 2026-05-19T10:14:04Z

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Addresses the ~67 Checkov findings that were genuine security wins with no cost or customer-choice impact. The remaining ~115 findings are addressed in two follow-on commits: production-hardening variables (tier 2) and documented suppressions (tier 3). Substantive fixes: - **CKV_AWS_23** (10 hits): security group rule descriptions now populated across every ingress / egress rule in ha-hot-hot/aws and unlimited-scale/aws. terraform fmt also cleaned alignment drift in every wrapper (the pre-existing terraform-validate workflow's fmt -check was failing on asm-aws-single and friends). - **CKV_AWS_300** (8 hits): S3 lifecycle configs now carry an abort_incomplete_multipart_upload rule (7-day retention) on the backup bucket in single-vm/aws, ha-hot-hot/aws, unlimited-scale/aws and on the alb_logs bucket in unlimited-scale/aws. - **CKV_AWS_91** (2 hits): ALB access logging on ha-hot-hot/aws now available as an opt-in feature (var.enable_alb_access_logging, default false). When enabled, provisions an alb_logs S3 bucket with versioning + public-access-block + KMS-when-CMK-enabled + lifecycle + a 365-day retention default. Symmetric with the unlimited-scale module which already had access logging. - **CKV_AWS_150** (4 hits): ALB deletion protection now configurable via var.enable_alb_deletion_protection on ha-hot-hot/aws and unlimited-scale/aws. Default true; dev/test deployments can override to allow `terraform destroy` without manual cleanup. - **CKV_AZURE_41** (4 hits): Key Vault DB-password secrets now carry an expiration_date set via timeadd(timestamp(), var.db_secret_expiration_hours). lifecycle.ignore_changes covers expiration_date so reruns don't show drift. Default 8760h = one year. - **CKV_AZURE_114** (4 hits): Same KV secrets now declare content_type = "application/x-postgresql-password" so rotation tooling can identify their semantics. - **CKV_AZURE_97** (2 hits): VMSS in unlimited-scale/azure now sets encryption_at_host_enabled = true (hypervisor-level encryption on top of platform-managed disk encryption; no additional cost). Plus wrapper variable forwarding for the new knobs: - enable_alb_deletion_protection / enable_alb_access_logging / alb_access_log_retention_days through sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale. - db_secret_expiration_hours through all four Azure HA/autoscale wrappers. All 12 wrappers pass the wrapper-forwarding CI check (every core variable except `product` is now declared in every wrapper). terraform fmt -recursive applied to the whole tree as part of this commit; the pre-existing terraform-validate workflow runs `fmt -check -recursive` per module so this fixes the validate-on-wrapper failures the PR was hitting (asm-aws-single, sat-aws-autoscale, etc).

Addresses the ~75 Checkov findings that are real production-grade hardening but carry a cost or complexity tradeoff that doesn't belong as a default. Every knob ships off so PoC deployments stay cheap; production deployments turn them on with a single line and clear no the cost they're agreeing to. New variables (defaults preserve existing behaviour): - **`rds_enhanced_monitoring_interval`** (default 0): CKV_AWS_118. Set to 60 in production. When non-zero, the module also provisions an IAM role with the AWS-managed `AmazonRDSEnhancedMonitoringRole` policy. Adds ~$15/mo per monitored instance via CloudWatch ingestion. - **`rds_enabled_cloudwatch_log_types`** (default `[]`): CKV_AWS_129. Production should set to `["postgresql", "upgrade"]` for the audit trail and major-version-upgrade safety. Empty list keeps PoC CloudWatch bills clean. - **`rds_iam_authentication_enabled`** (default false): CKV_AWS_161. Real value once the app side wires up IAM token minting; off by default because today's connections use the Secrets Manager password. - **`rds_performance_insights_enabled`** (default false): CKV_AWS_354. When true and `enable_customer_managed_key` is also true, PI is KMS-encrypted. `rds_performance_insights_retention_days` lets customers pick 7 (free tier) or 731 (paid long-term). - **`postgres_geo_redundant_backup_enabled`** (default false): CKV_AZURE_136. Replaces the previously-hardcoded values (`false` in ha-hot-hot/azure, `true` in unlimited-scale/azure) with a customer-driven knob. Forwarded through every relevant wrapper (sat-aws-ha, asm-aws-ha, sat-aws-autoscale, asm-aws-autoscale + Azure equivalents). Wrapper-forwarding CI check passes for all 12 wrappers. What's NOT here (deliberate): - S3 access logging, CWL retention >= 1 year — these turn into module-side cost (a second log bucket + longer CWL ingestion) and the same outcomes are reachable via the customer's existing log pipeline. Documented as suppressions in the tier-3 follow-up rather than re-implemented in module code. - Secrets Manager rotation Lambda (CKV2_AWS_57) — needs a full rotation function with DB user management; substantial scope. Customer-managed rotation policy is the better abstraction; documented in tier-3. - Azure disk encryption sets, private endpoints — require subnet / vnet plumbing that's customer-owned. Suppressed with rationale in tier-3.

…kov green) Final round of the three-tier triage. The `.checkov.yaml` config file at the repo root carries every suppression with a one-line category code and reason — anyone reviewing the security posture sees WHY a check is suppressed without having to dig through commit history. Five categories, every suppression labelled with one: (A) Not applicable — customer-owned resource we don't manage (B) By design — wrapped by an existing variable or runbook section (C) False positive — Checkov can't trace the check through `count`-conditional or separate-resource patterns we use (D) Customer governance — cost or policy tradeoff that doesn't belong as a module default (E) Opt-in variable — default off keeps the starter shape cheap; the tier-2 production-hardening commit wired the knobs Plus three small residual fixes uncovered during validation: - SG rule descriptions on unlimited-scale/aws (5 rules) — ha-hot-hot/aws had them via terraform fmt's earlier pass; unlimited-scale/aws is structurally similar but the rules were declared without descriptions. CKV_AWS_23. - CloudWatch flow_logs group in unlimited-scale/aws now sets kms_key_id when var.enable_customer_managed_key is true. The earlier CWL fix targeted the RDS log groups; the VPC flow-log group was missed. CKV_AWS_158. - Azure backup storage accounts now explicitly set public_network_access_enabled = false + allow_nested_items_to_be_public = false. CKV_AZURE_59. (Previously these were set further down in the resource as `true` — a copy-paste from the documented Azure example; flipped to false now.) Also updates .github/workflows/checkov.yml to use the config file (`config_file: .checkov.yaml`) instead of an inline skip-check list, so future suppressions are reviewed in the same place as their rationale. Result against the current modules/ tree: Passed checks: 1023, Failed checks: 0, Skipped checks: 0

claude added 6 commits May 18, 2026 17:54

github-advanced-security AI found potential problems May 19, 2026

View reviewed changes

claude added 2 commits May 19, 2026 10:43

dmchaledev merged commit bd3e89b into main May 19, 2026
73 of 79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shared Redis session store to HA and autoscale modules#3

Add shared Redis session store to HA and autoscale modules#3
dmchaledev merged 9 commits into
mainfrom
claude/audit-marketplace-consistency-RnL5W

dmchaledev commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dmchaledev commented May 19, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI commented May 19, 2026

What Enabling Code Scanning Means:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants