Skip to content

feat(deploy): enable prod stage via the lite deploy flow#836

Draft
law-chain-hot wants to merge 10 commits into
mainfrom
feat/prod-enable
Draft

feat(deploy): enable prod stage via the lite deploy flow#836
law-chain-hot wants to merge 10 commits into
mainfrom
feat/prod-enable

Conversation

@law-chain-hot

@law-chain-hot law-chain-hot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Enables a second SST stage (prod) through the existing one-click Deploy + Runner Rollout buttons, without changing the dev flow. This branch is built on top of the dev-button work (#834) and supersedes it — it contains those commits plus the prod-enablement, so #834 does not need to be merged separately.

Changes

  • sst.config.tsREGION reads AWS_REGION (default ap-southeast-1); runner EC2 Name tag is stage-scoped (boxlite-<stage>-runner-default) so dev/prod runners stay distinguishable.
  • deploy.yml / runner-rollout.yml — stage choice adds prod; new region input replaces the hardcoded region; deploy gates on a per-stage GitHub Environment (environment: ${{ inputs.stage }}); runner-rollout S3_BUCKET is stage-scoped.
  • Deploy role renamed boxlite-github-deployer-<stage>boxlite-app-<stage>-deploy (assumed via OIDC).
  • runner-update-binary.sh — stage-scoped tag lookup with a dev-only legacy fallback + a multi-match guard.
  • scripts/seed-prod-from-dev.mjs (new) — copies a stage's SSM env to another, domain-rewriting STACK_DOMAIN, leaving Auth0 / SSH / runner-ip blank for manual seeding. Dry-run by default.

Prerequisites before a prod deploy can run (account owner)

  1. Create boxlite-app-dev-deploy and boxlite-app-prod-deploy (admin, unbounded; OIDC trust sub repo:boxlite-ai/boxlite:environment:<stage>). The existing dev role is renamed to boxlite-app-dev-deploy.
  2. Extend boxlite-bounded-role-admin so PassRole + AddRoleToInstanceProfile cover role/boxlite-prod-* (today they list only boxlite-dev-* / boxlite-e2e-*).
  3. Create a prod GitHub Environment.
  4. Seed SSM /boxlite/prod/{cloudflare-*, env/*} (env via seed-prod-from-dev.mjs; Auth0 values filled once the prod Auth0 app exists).

Safety

dev flow is unchanged (region defaults to ap-southeast-1, the dev GitHub Environment already exists). Reviewed for dev-safety, region-correctness, and GitHub Actions semantics; verified locally (YAML parse, node --check, bash -n, seed dry-run).

Summary by CodeRabbit

  • New Features
    • Added manually-triggered deployment and runner rollout workflows with stage/region selection, diff/preview vs deploy, and upgrade/rollback controls.
  • Improvements
    • Centralized deploy-time configuration in AWS SSM: seed dev→prod, auto-load stage-specific environment variables during sst, and updated teardown guidance.
    • Runner binary updates are now stage-aware, support installing from dev tarballs or release builds, and target the correct runner instances by stage.
    • Deploy configuration now uses an explicit AWS region fallback and applies a default IAM permissions boundary where needed.
  • Bug Fixes
    • Prevented non-critical runner version checks from failing deployments.

@law-chain-hot law-chain-hot requested a review from a team as a code owner June 22, 2026 14:24
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0042105d-3a14-4e04-835f-9db737e6d049

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two manual GitHub Actions workflows (deploy.yml for infra deploys, runner-rollout.yml for runner binary rollouts), three SSM-related scripts for seeding and loading environment variables, and updates sst.config.ts for dynamic region, IAM permission boundaries, and stage-scoped resource names. Also hardens runner-update-binary.sh for stage-aware instance lookup and dev/release tarball installs.

Changes

Deploy & Runner Rollout Pipeline

Layer / File(s) Summary
SSM env seeding and loading scripts
apps/infra/scripts/seed-deploy-env-ssm.mjs, apps/infra/scripts/seed-prod-from-dev.mjs, apps/infra/scripts/sst-with-cloudflare.mjs
seed-deploy-env-ssm.mjs writes .env key/value pairs to SSM as SecureString under /boxlite/<stage>/env/ with dry-run support. seed-prod-from-dev.mjs reads dev-stage SSM parameters, classifies each key as copy/domain-rewrite/blank, and writes to the production SSM path when --apply is passed. sst-with-cloudflare.mjs gains a loadEnvFromSsm(stage) function that populates process.env from SSM before spawning sst, preserving local env precedence.
sst.config.ts: dynamic region, IAM boundary, stage-scoped runner names
apps/infra/sst.config.ts
REGION reads from process.env.AWS_REGION with ap-southeast-1 fallback. A global $transform on aws.iam.Role applies a permissionsBoundary ARN to every IAM role when not already set. Production checks now use 'prod' instead of 'production'. RunnerRole, RunnerProfile, and the default runner tag are renamed to explicit stage-scoped names.
runner-update-binary.sh: stage-aware instance and tarball handling
scripts/deploy/runner-update-binary.sh
Instance selection now checks BOXLITE_RUNNER_INSTANCE_ID first, then the boxlite-<stage>-runner-default tag (plus legacy boxlite-runner-default for dev). Asset URL switches between BOXLITE_RUNNER_TARBALL_URL (dev builds) and the GitHub release URL. SHA_URL is derived from BOXLITE_RUNNER_TARBALL_SHA256_URL or appending .sha256; checksum fetch failure logs a warning and proceeds. The --version cosmetic call is guarded with || true.
deploy.yml: manual infra deploy workflow
.github/workflows/deploy.yml
New workflow_dispatch workflow accepts stage, mode (diff/deploy), and region. Configures OIDC AWS auth scoped to the chosen stage's IAM role, stage-scoped concurrency without cancellation, then runs either sst diff or npm run deploy for apps/infra.
runner-rollout.yml: manual runner rollout workflow
.github/workflows/runner-rollout.yml
New workflow_dispatch workflow accepts stage, region, source (dev-build/release), ref, mode (upgrade/rollback), and optional instance_id. For dev-build: checks out the ref, builds Go daemon/computer-use/runner binaries with embedded versioning, packages a tarball with sha256, uploads to S3, presigns URLs (30-min TTL), and invokes runner-update-binary.sh with tarball/sha URLs. For release: invokes runner-update-binary.sh directly with the version ref. Rollback mode is rejected unless source=release; prod stage is rejected unless source=release.
Documentation update: stage naming consistency
apps/infra/README.md
Updates the Cost section to reference --stage prod instead of --stage production for consistency with config conventions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • boxlite-ai/boxlite#793: Publishes *.sha256 files alongside runner builds; directly feeds into this PR's checksum verification logic in runner-update-binary.sh.
  • boxlite-ai/boxlite#819: Updates the runner EC2 default Name-tag convention and instance-selection logic in runner-update-binary.sh, which this PR further extends with stage-scoped naming.

Suggested reviewers

  • G4614

Poem

🐇 Hoppity-hop through the SSM store,
Seeding secrets from dev to prod shore.
IAM boundaries, stage-scoped names in tow,
One-click deploys wherever we go!
The runner rolls out with tarball in paw —
No hardcoded tags left to gnaw! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling a production stage through the deployment workflow, which is the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/prod-enable

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08b4a28041

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +79 to +82
rollout:
name: Roll out runner
needs: config
runs-on: ubuntu-latest

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add the rollout job environment before assuming AWS

The rollout job assumes boxlite-app-<stage>-deploy, whose documented OIDC trust is repo:boxlite-ai/boxlite:environment:<stage>, but this job never declares a GitHub environment the way deploy.yml does. GitHub's OIDC sub only includes environment:<name> when the job references that environment; otherwise it uses the ref/pull_request subject, so the Assume AWS deploy role step will be rejected for both dev and production once the roles are created as described, and relaxing that trust would also remove the intended environment approval gate.

Useful? React with 👍 / 👎.

Comment on lines +101 to +102
if (!key || process.env[key]) continue // already set (e.g. from .env) wins
process.env[key] = p.Value

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Load local dotenv before filling from SSM

When this wrapper runs from a laptop with AWS credentials, these lines populate process.env from SSM before spawning sst; apps/infra/sst.config.ts loads .env later with dotenv's default non-overriding behavior. As a result, local .env values that are not already shell-exported, such as STACK_DOMAIN or OIDC_*, are ignored and replaced by the stage's SSM values, contrary to the comment that laptops keep using .env unchanged.

Useful? React with 👍 / 👎.


import { execFileSync } from 'node:child_process'

const REGION = process.env.AWS_REGION || 'ap-southeast-1'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Separate source and target SSM regions

Using one REGION for both readSourceEnv and putParam breaks the advertised flow when production is deployed outside dev's region. With AWS_REGION=us-west-2, the script reads /boxlite/dev/env/* from us-west-2 even though dev lives in ap-southeast-1; with the default region, it writes /boxlite/production/env/* to ap-southeast-1 where a us-west-2 production deploy will not load it.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
.github/workflows/deploy.yml (1)

58-59: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding a job timeout.

SST deployments can take significant time, but unbounded jobs risk hanging indefinitely on infrastructure issues (e.g., stuck CloudFormation stacks). A reasonable timeout provides a safety net.

   deploy:
     runs-on: ubuntu-latest
+    timeout-minutes: 45
     # GitHub Environment per stage — gates the deploy behind that environment's
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/deploy.yml around lines 58 - 59, The deploy job in the
GitHub Actions workflow currently has no timeout protection, which means it can
hang indefinitely if infrastructure issues occur. Add a timeout-minutes field to
the deploy job configuration to set a reasonable maximum execution time for the
job, preventing it from consuming resources unnecessarily during infrastructure
failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/runner-rollout.yml:
- Around line 162-165: The "Rewrite go.work for minimal modules" step hardcodes
the Go version as 1.25.4 in the printf command, which conflicts with the Go
version defined in config.yml (1.24) and the version already retrieved via
needs.config.outputs.go-version at the workflow setup. Replace the hardcoded go
1.25.4 string in the printf command with a reference to the
needs.config.outputs.go-version output variable to ensure consistency across the
workflow and align with the configured version.

In `@apps/infra/scripts/seed-deploy-env-ssm.mjs`:
- Around line 69-89: The script catches errors from the put-parameter AWS CLI
call but continues executing and completes successfully regardless of whether
any writes failed. To fix this, add a flag (such as an error tracking variable)
that is set to true whenever an error is caught in the catch block at line 84.
After the final console.log statement at line 89, check this flag and call
process.exit with a non-zero code (e.g., 1) if any errors were encountered
during the loop. This ensures the script signals failure to its caller when any
SSM parameter write fails, preventing partially seeded environments from
appearing successful.

In `@apps/infra/scripts/seed-prod-from-dev.mjs`:
- Around line 103-109: The script should fail fast when the source SSM path
contains zero keys. After the readSourceEnv function call and the console.log
statement that outputs the source size, add a validation check to verify that
source.size is greater than 0. If source.size equals 0, throw an error with a
descriptive message indicating that the source environment has no keys and the
script cannot proceed, rather than allowing the script to continue with an empty
configuration.

In `@apps/infra/scripts/sst-with-cloudflare.mjs`:
- Around line 99-103: The truthy check on process.env[key] in the condition on
line 101 causes empty-string environment variables to be overwritten by SSM
values, breaking the intended precedence where already-set values should win.
Replace the truthy check `process.env[key]` with an existence check using the
`in` operator (e.g., `key in process.env`) to properly detect whether a key is
already set in the environment, regardless of whether its value is an empty
string or any other falsy value.
- Around line 88-90: The catch block in the SSM environment load section only
logs a warning when the error code is ENOENT (aws CLI not found), but silently
returns for all other errors including authentication, region, and path
configuration issues. Modify the error handling to add an else clause after the
ENOENT check that logs a warning with the actual error details for non-ENOENT
failures, so that configuration problems are visible and debuggable instead of
being silently swallowed.

---

Nitpick comments:
In @.github/workflows/deploy.yml:
- Around line 58-59: The deploy job in the GitHub Actions workflow currently has
no timeout protection, which means it can hang indefinitely if infrastructure
issues occur. Add a timeout-minutes field to the deploy job configuration to set
a reasonable maximum execution time for the job, preventing it from consuming
resources unnecessarily during infrastructure failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fc77cd02-1035-4d1f-96c9-abbcd8acce01

📥 Commits

Reviewing files that changed from the base of the PR and between 4a52285 and 08b4a28.

📒 Files selected for processing (7)
  • .github/workflows/deploy.yml
  • .github/workflows/runner-rollout.yml
  • apps/infra/scripts/seed-deploy-env-ssm.mjs
  • apps/infra/scripts/seed-prod-from-dev.mjs
  • apps/infra/scripts/sst-with-cloudflare.mjs
  • apps/infra/sst.config.ts
  • scripts/deploy/runner-update-binary.sh

Comment on lines +162 to +165
- name: Rewrite go.work for minimal modules
if: ${{ steps.in.outputs.source == 'dev-build' }}
run: |
printf 'go 1.25.4\n\nuse (\n\t./runner\n\t./daemon\n\t./common-go\n\t./api-client-go\n\t./libs/computer-use\n\t../sdks/go\n)\n' > apps/go.work

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check what go-version config.yml exports
cat .github/workflows/config.yml | grep -A5 -B5 'go-version'

Repository: boxlite-ai/boxlite

Length of output: 919


Hardcoded Go version conflicts with config.yml.

Line 165 hardcodes go 1.25.4 in the generated go.work, but config.yml defines go-version: '1.24' and the Go setup at line 131 uses needs.config.outputs.go-version. This version divergence could cause build failures or unexpected behavior.

Use the configured Go version instead:

Suggested fix
      - name: Rewrite go.work for minimal modules
        if: ${{ steps.in.outputs.source == 'dev-build' }}
+       env:
+         GO_VERSION: ${{ needs.config.outputs.go-version }}
        run: |
-         printf 'go 1.25.4\n\nuse (\n\t./runner\n\t./daemon\n\t./common-go\n\t./api-client-go\n\t./libs/computer-use\n\t../sdks/go\n)\n' > apps/go.work
+         printf 'go %s\n\nuse (\n\t./runner\n\t./daemon\n\t./common-go\n\t./api-client-go\n\t./libs/computer-use\n\t../sdks/go\n)\n' "$GO_VERSION" > apps/go.work
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/runner-rollout.yml around lines 162 - 165, The "Rewrite
go.work for minimal modules" step hardcodes the Go version as 1.25.4 in the
printf command, which conflicts with the Go version defined in config.yml (1.24)
and the version already retrieved via needs.config.outputs.go-version at the
workflow setup. Replace the hardcoded go 1.25.4 string in the printf command
with a reference to the needs.config.outputs.go-version output variable to
ensure consistency across the workflow and align with the configured version.

Comment on lines +69 to +89
let ok = 0
for (const [key, val] of pairs) {
const name = `/boxlite/${stage}/env/${key}`
if (DRY) {
console.log(` would put ${name}`)
continue
}
try {
execFileSync(
'aws',
['ssm', 'put-parameter', '--region', REGION, '--name', name, '--type', 'SecureString', '--overwrite', '--value', val],
{ stdio: ['ignore', 'ignore', 'pipe'] }, // never echo the value or the version output
)
console.log(` ✓ ${key}`)
ok++
} catch (err) {
const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
console.error(` ✗ ${key}: ${msg}`)
}
}
console.log(DRY ? 'seed-deploy-env-ssm: dry-run, nothing written' : `seed-deploy-env-ssm: wrote ${ok}/${pairs.length}`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the script when any SSM write fails.

Line 84 catches put-parameter errors, but Line 89 still completes successfully. That can leave /boxlite/<stage>/env/* partially seeded without signaling failure.

Suggested fix
 let ok = 0
+let failures = 0
 for (const [key, val] of pairs) {
   const name = `/boxlite/${stage}/env/${key}`
@@
   } catch (err) {
     const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
     console.error(`  ✗ ${key}: ${msg}`)
+    failures++
   }
 }
 console.log(DRY ? 'seed-deploy-env-ssm: dry-run, nothing written' : `seed-deploy-env-ssm: wrote ${ok}/${pairs.length}`)
+if (!DRY && failures > 0) process.exit(1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let ok = 0
for (const [key, val] of pairs) {
const name = `/boxlite/${stage}/env/${key}`
if (DRY) {
console.log(` would put ${name}`)
continue
}
try {
execFileSync(
'aws',
['ssm', 'put-parameter', '--region', REGION, '--name', name, '--type', 'SecureString', '--overwrite', '--value', val],
{ stdio: ['ignore', 'ignore', 'pipe'] }, // never echo the value or the version output
)
console.log(` ✓ ${key}`)
ok++
} catch (err) {
const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
console.error(` ✗ ${key}: ${msg}`)
}
}
console.log(DRY ? 'seed-deploy-env-ssm: dry-run, nothing written' : `seed-deploy-env-ssm: wrote ${ok}/${pairs.length}`)
let ok = 0
let failures = 0
for (const [key, val] of pairs) {
const name = `/boxlite/${stage}/env/${key}`
if (DRY) {
console.log(` would put ${name}`)
continue
}
try {
execFileSync(
'aws',
['ssm', 'put-parameter', '--region', REGION, '--name', name, '--type', 'SecureString', '--overwrite', '--value', val],
{ stdio: ['ignore', 'ignore', 'pipe'] }, // never echo the value or the version output
)
console.log(` ✓ ${key}`)
ok++
} catch (err) {
const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
console.error(` ✗ ${key}: ${msg}`)
failures++
}
}
console.log(DRY ? 'seed-deploy-env-ssm: dry-run, nothing written' : `seed-deploy-env-ssm: wrote ${ok}/${pairs.length}`)
if (!DRY && failures > 0) process.exit(1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/seed-deploy-env-ssm.mjs` around lines 69 - 89, The script
catches errors from the put-parameter AWS CLI call but continues executing and
completes successfully regardless of whether any writes failed. To fix this, add
a flag (such as an error tracking variable) that is set to true whenever an
error is caught in the catch block at line 84. After the final console.log
statement at line 89, check this flag and call process.exit with a non-zero code
(e.g., 1) if any errors were encountered during the loop. This ensures the
script signals failure to its caller when any SSM parameter write fails,
preventing partially seeded environments from appearing successful.

Comment on lines +103 to +109
const source = readSourceEnv(SOURCE_STAGE)
console.log(
`seed-prod-from-dev: ${SOURCE_STAGE} -> ${TARGET_STAGE} | region=${REGION} | ` +
`${DEV_DOMAIN} -> ${PROD_DOMAIN} | ${APPLY ? 'APPLY (writing)' : 'DRY-RUN (nothing written)'}`,
)
console.log(` source has ${source.size} key(s) under /boxlite/${SOURCE_STAGE}/env/\n`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when the source SSM path is empty.

If Line 103 loads zero keys, the script currently continues and can report a misleading “done” result. For stage bootstrapping, empty source should be treated as an error condition.

Suggested fix
 const source = readSourceEnv(SOURCE_STAGE)
+if (source.size === 0) {
+  console.error(
+    `seed-prod-from-dev: no keys found under /boxlite/${SOURCE_STAGE}/env/ in ${REGION}; ` +
+      `check source stage/region/permissions before continuing`,
+  )
+  process.exit(1)
+}
 console.log(
   `seed-prod-from-dev: ${SOURCE_STAGE} -> ${TARGET_STAGE} | region=${REGION} | ` +
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const source = readSourceEnv(SOURCE_STAGE)
console.log(
`seed-prod-from-dev: ${SOURCE_STAGE} -> ${TARGET_STAGE} | region=${REGION} | ` +
`${DEV_DOMAIN} -> ${PROD_DOMAIN} | ${APPLY ? 'APPLY (writing)' : 'DRY-RUN (nothing written)'}`,
)
console.log(` source has ${source.size} key(s) under /boxlite/${SOURCE_STAGE}/env/\n`)
const source = readSourceEnv(SOURCE_STAGE)
if (source.size === 0) {
console.error(
`seed-prod-from-dev: no keys found under /boxlite/${SOURCE_STAGE}/env/ in ${REGION}; ` +
`check source stage/region/permissions before continuing`,
)
process.exit(1)
}
console.log(
`seed-prod-from-dev: ${SOURCE_STAGE} -> ${TARGET_STAGE} | region=${REGION} | ` +
`${DEV_DOMAIN} -> ${PROD_DOMAIN} | ${APPLY ? 'APPLY (writing)' : 'DRY-RUN (nothing written)'}`,
)
console.log(` source has ${source.size} key(s) under /boxlite/${SOURCE_STAGE}/env/\n`)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/seed-prod-from-dev.mjs` around lines 103 - 109, The script
should fail fast when the source SSM path contains zero keys. After the
readSourceEnv function call and the console.log statement that outputs the
source size, add a validation check to verify that source.size is greater than
0. If source.size equals 0, throw an error with a descriptive message indicating
that the source environment has no keys and the script cannot proceed, rather
than allowing the script to continue with an empty configuration.

Comment on lines +88 to +90
} catch (err) {
if (err.code === 'ENOENT') console.warn('sst-with-cloudflare: `aws` CLI not found; skipping SSM env load')
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t swallow non-ENOENT SSM load failures silently.

Line 88-90 returns without warning for auth/region/path errors. That hides the actual failure mode and makes deploy-time config issues hard to diagnose.

Suggested fix
   } catch (err) {
-    if (err.code === 'ENOENT') console.warn('sst-with-cloudflare: `aws` CLI not found; skipping SSM env load')
+    if (err.code === 'ENOENT') {
+      console.warn('sst-with-cloudflare: `aws` CLI not found; skipping SSM env load')
+    } else {
+      const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
+      console.warn(`sst-with-cloudflare: failed loading SSM env from ${path}*: ${msg}`)
+    }
     return
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (err) {
if (err.code === 'ENOENT') console.warn('sst-with-cloudflare: `aws` CLI not found; skipping SSM env load')
return
} catch (err) {
if (err.code === 'ENOENT') {
console.warn('sst-with-cloudflare: `aws` CLI not found; skipping SSM env load')
} else {
const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
console.warn(`sst-with-cloudflare: failed loading SSM env from ${path}*: ${msg}`)
}
return
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/sst-with-cloudflare.mjs` around lines 88 - 90, The catch
block in the SSM environment load section only logs a warning when the error
code is ENOENT (aws CLI not found), but silently returns for all other errors
including authentication, region, and path configuration issues. Modify the
error handling to add an else clause after the ENOENT check that logs a warning
with the actual error details for non-ENOENT failures, so that configuration
problems are visible and debuggable instead of being silently swallowed.

Comment on lines +99 to +103
for (const p of params || []) {
const key = p.Name.slice(path.length) // segment after the prefix = env var name
if (!key || process.env[key]) continue // already set (e.g. from .env) wins
process.env[key] = p.Value
loaded++

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix env precedence check for empty-string values.

Line 101 uses a truthy check, so '' gets overwritten by SSM. That breaks the “already set wins” contract when empty-string is intentionally configured.

Suggested fix
-    if (!key || process.env[key]) continue // already set (e.g. from .env) wins
+    if (!key || process.env[key] !== undefined) continue // already set (including '') wins
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/sst-with-cloudflare.mjs` around lines 99 - 103, The truthy
check on process.env[key] in the condition on line 101 causes empty-string
environment variables to be overwritten by SSM values, breaking the intended
precedence where already-set values should win. Replace the truthy check
`process.env[key]` with an existence check using the `in` operator (e.g., `key
in process.env`) to properly detect whether a key is already set in the
environment, regardless of whether its value is an empty string or any other
falsy value.

@law-chain-hot law-chain-hot force-pushed the feat/prod-enable branch 3 times, most recently from 11d51d0 to a83420b Compare June 22, 2026 15:44

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/infra/sst.config.ts (1)

122-124: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Hardcoded AWS account ID limits portability.

The permission boundary ARN contains a hardcoded account ID (064212132677). If this infrastructure is ever deployed to a different AWS account, this will fail. Consider deriving the account ID dynamically.

♻️ Suggested improvement
   $transform(aws.iam.Role, (args) => {
-    args.permissionsBoundary ??= 'arn:aws:iam::064212132677:policy/boxlite-role-boundary'
+    args.permissionsBoundary ??= $interpolate`arn:aws:iam::${aws.getCallerIdentityOutput().accountId}:policy/boxlite-role-boundary`
   })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/sst.config.ts` around lines 122 - 124, The permissionsBoundary ARN
in the $transform function applied to aws.iam.Role contains a hardcoded AWS
account ID (064212132677), which breaks portability across different AWS
accounts. Replace the hardcoded account ID in the ARN string with a dynamic
reference to the current AWS account ID. You can obtain the current account ID
from the aws.getCallerIdentity() function or by using the aws.getAccountId()
helper function, and then construct the ARN string dynamically by interpolating
the retrieved account ID into the
arn:aws:iam::ACCOUNT_ID:policy/boxlite-role-boundary format.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/infra/sst.config.ts`:
- Around line 122-124: The permissionsBoundary ARN in the $transform function
applied to aws.iam.Role contains a hardcoded AWS account ID (064212132677),
which breaks portability across different AWS accounts. Replace the hardcoded
account ID in the ARN string with a dynamic reference to the current AWS account
ID. You can obtain the current account ID from the aws.getCallerIdentity()
function or by using the aws.getAccountId() helper function, and then construct
the ARN string dynamically by interpolating the retrieved account ID into the
arn:aws:iam::ACCOUNT_ID:policy/boxlite-role-boundary format.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b1900312-d2c2-44eb-a383-a7bf468dd8db

📥 Commits

Reviewing files that changed from the base of the PR and between 11d51d0 and a83420b.

📒 Files selected for processing (6)
  • .github/workflows/deploy.yml
  • .github/workflows/runner-rollout.yml
  • apps/infra/README.md
  • apps/infra/scripts/seed-prod-from-dev.mjs
  • apps/infra/sst.config.ts
  • scripts/deploy/runner-update-binary.sh
✅ Files skipped from review due to trivial changes (1)
  • apps/infra/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • apps/infra/scripts/seed-prod-from-dev.mjs
  • scripts/deploy/runner-update-binary.sh
  • .github/workflows/deploy.yml
  • .github/workflows/runner-rollout.yml

GitHub Actions deploy buttons for dev, OIDC-only (no static keys):
- deploy.yml: Cloud deploy (workflow_dispatch diff|deploy, environment: dev) — assumes
  boxlite-github-deployer-<stage>, runs sst diff/deploy with SSM-injected env.
- runner-rollout.yml: swap the boxlite-runner binary in place via SSM (release or
  dev-build from S3) with checksum verify + auto-rollback.
- runner-update-binary.sh: accept an S3 dev-channel source (separate sha256 URL).
- seed-deploy-env-ssm.mjs / sst-with-cloudflare.mjs: inject stage env + Cloudflare
  creds from SSM so CI matches a laptop .env.
- sst.config.ts: $transform stamps boxlite-role-boundary on every IAM role the app
  creates (satisfies the bounded deploy role's conditional create/manage grant);
  RunnerRole/Profile get explicit boxlite-dev-runner names so they fall under the
  deploy role's boxlite-* IAM scope.

Validated 2026-06-22: deploy --stage dev completes end-to-end; dev healthy
(8 services running, 5 ALBs active, api/dashboard 200, box create verified).
…sist-credentials)

- runner-rollout.yml: pass workflow_dispatch inputs (ref/instance_id, free-text) via
  env vars instead of ${{ }} interpolated into run: scripts, so a crafted input is a
  shell value rather than executable script (CodeRabbit/zizmor template-injection).
- deploy.yml + runner-rollout.yml: persist-credentials: false on checkout — these
  workflows auth to AWS via OIDC and never push to the repo, so the GITHUB_TOKEN
  doesn't need to linger in .git/config.

Left for a repo-wide follow-up (not this PR, for consistency): SHA-pin actions across
all privileged workflows + tighten the deployer OIDC trust from repo:*:* to
:environment:dev.
…ew M1/M2)

- Pin all third-party actions in the privileged deploy workflows to commit SHAs
  (actions/checkout, setup-node, setup-go, aws-actions/configure-aws-credentials)
  with version comments — these workflows hold OIDC→AWS access, so an action tag
  hijack is a supply-chain path. (M1)
- runner-rollout.yml Resolve inputs: write the free-text ref/instance_id to
  GITHUB_OUTPUT via heredoc delimiters so a newline-laden value can't inject extra
  output keys (e.g. forge a stage=). (M2; M3 command-injection was fixed in fff6b76)

Follow-up (separate, not this PR): SHA-pin the remaining ~34 @v4/@v5 uses repo-wide,
and tighten the deployer OIDC trust from repo:*:* to :environment:dev.
Extends the existing one-click deploy + runner-rollout buttons to a second
SST stage (prod) in the same account, without changing dev behavior:

- sst.config.ts: REGION reads AWS_REGION (default ap-southeast-1); runner EC2
  Name tag is now stage-scoped (boxlite-<stage>-runner-default); isProd / removal
  key off stage === 'prod' so prod resources are named boxlite-prod-* (matching the
  deploy role's PassRole grant) and get RDS deletion-protection + final snapshot.
- deploy.yml / runner-rollout.yml: stage choice adds prod; a region input replaces
  the hardcoded ap-southeast-1; deploy gates on a per-stage GitHub Environment
  (environment: ${{ inputs.stage }}); runner-rollout S3 bucket is stage-scoped.
- Deploy role boxlite-app-<stage>-deploy assumed via OIDC (boxlite-app-dev-deploy /
  boxlite-app-prod-deploy). Created by the account owner (admin, unbounded), trust
  sub repo:boxlite-ai/boxlite:environment:<stage>; boxlite-bounded-role-admin must
  allow PassRole / AddRoleToInstanceProfile on role/boxlite-prod-* (today dev/e2e only).
- runner-update-binary.sh: tag lookup is stage-scoped with a dev-only legacy
  fallback and a multi-match guard.
- scripts/seed-prod-from-dev.mjs (new): copies a stage's SSM env to another (default
  dev -> prod), domain-rewriting STACK_DOMAIN and leaving Auth0/SSH/runner-ip blank
  for manual seeding. Dry-run by default.

dev flow is unchanged: region defaults to ap-southeast-1 and the dev GitHub
Environment already exists. Reviewed for dev-safety, region-correctness, and
GitHub Actions semantics.
Adds two backward-compatible knobs so a prod stage can use scheme B
(api/ssh/proxy on boxlite.ai, dashboard host on app.boxlite.ai) and deploy to
a different region than where its config lives — both default to today's dev
behavior, so dev is byte-for-byte unchanged.

- sst.config.ts: `servicesDomain = SERVICES_DOMAIN || stackDomain` now backs the
  serviceDomain() helper and the api/proxy/ssh hostnames + their env vars
  (BOXLITE_API_URL, PROXY_DOMAIN, PROXY_TEMPLATE_URL, SSH_GATEWAY_URL,
  DASHBOARD_BASE_API_URL, proxyDomain, the ssh Cloudflare CNAME). The dashboard
  host (Router domain, DASHBOARD_URL, OIDC end-session) stays on stackDomain.
  Adds a `dashboardPath = DASHBOARD_PATH || ''` passthrough (DASHBOARD_PATH env)
  for an upcoming path-based dashboard; '' keeps the dashboard at the root today.
- sst-with-cloudflare.mjs: SSM is read from a fixed region (BOXLITE_CONFIG_REGION
  || ap-southeast-1), decoupled from AWS_REGION, so a stage deployed to us-west-2
  reads its /boxlite/<stage>/* config + cloudflare creds from ap-southeast-1
  (the deploy role reads cross-region; a region-locked SSO can still seed it).

dev: SERVICES_DOMAIN/DASHBOARD_PATH/BOXLITE_CONFIG_REGION all unset -> identical
to before. prod-B seeds SERVICES_DOMAIN=boxlite.ai (+ STACK_DOMAIN=app.boxlite.ai).
…eploy reader)

Adversarial review of the domain/SSM-decouple change found the seed scripts
(seed-deploy-env-ssm.mjs, seed-prod-from-dev.mjs) still read/write SSM from
AWS_REGION while sst-with-cloudflare.mjs now reads from BOXLITE_CONFIG_REGION —
so a stage seeded with AWS_REGION set to its deploy region would write config to
the wrong region and the deploy (reading ap-southeast-1) would not find it.

All three now key off BOXLITE_CONFIG_REGION || ap-southeast-1: config is a
region-fixed store (ap-southeast-1) independent of the resource deploy region
(AWS_REGION, still used by sst.config.ts for resources). Also refreshes the stale
sst.config.ts comment that claimed the SSM helpers follow AWS_REGION.
…d-B)

Lets the dashboard SPA serve under a sub-path (e.g. app.boxlite.ai/dashboard for
prod scheme B) while leaving dev byte-for-byte unchanged. All knobs default to
'' / '/' so the dev root flow is identical to before.

The path lives in ONE place per layer and flows end-to-end:
- vite.config.mts: `base = VITE_DASHBOARD_PATH ? <path>/ : '/'`. Vite rewrites every
  built asset URL (JS/CSS/imported images) to be path-prefixed.
- index.html: favicon `./favicon.ico` (relative) so it resolves correctly under
  either mount point — Vite does NOT rewrite raw <link href="/..."> entries.
- main.tsx: `<BrowserRouter basename={import.meta.env.BASE_URL}>` so client-side
  routes are relative to the mount.
- ConfigProvider.tsx: `redirect_uri = origin + BASE_URL.replace(/\/$/, '')` so
  Auth0 callback matches the SPA's mount URL exactly (e.g. https://app.boxlite.ai/dashboard).
- app.module.ts: ServeStaticModule.renderPath now reads DASHBOARD_PATH env (defaults
  to '/'); must match the Vite base the bundle was built with.
- Dockerfile: `ARG DASHBOARD_PATH=` feeds VITE_DASHBOARD_PATH into the dashboard
  build step, so the bundle ships with the correct asset prefix baked in.
- sst.config.ts: passes dashboardPath as Api image `args: { DASHBOARD_PATH }`,
  closing the loop from SSM (DASHBOARD_PATH) -> build -> serve.

Verified locally: built dashboard twice with VITE_DASHBOARD_PATH=/dashboard and
unset, served via python http-server, curl 200 on /dashboard/, /dashboard/index.html,
/dashboard/assets/index-*.js, /dashboard/favicon.ico (prefix injected) and confirmed
the default build keeps /assets/* (root unchanged).

prod-B will deploy this with SSM /boxlite/prod/env/DASHBOARD_PATH=/dashboard.
$transform(aws.iam.InstanceProfile, ...) defaults args.name to
${$app.name}-${$app.stage}-${resourceName} for any InstanceProfile that didn't
set an explicit name (matches the existing aws.iam.Role boundary transform's pattern).

The boxlite-app-prod-deploy inline policy AllowManageInstanceProfiles only allows
iam:CreateInstanceProfile on instance-profile/boxlite-prod-*. SST's VPC NAT helper
auto-names its profile 'VpcNatInstanceProfile-<hash>' which falls outside that scope
→ deploy fails. With this transform it becomes 'boxlite-prod-VpcNatInstanceProfile',
which matches.

Explicit names (e.g. RunnerProfile with name: \`${app.name}-${app.stage}-runner\`)
win because of `??=`. Dev unchanged: same naming pattern, dev's policy is broader.
The boxlite-<stage>-* InstanceProfile naming transform forced a rename of
pre-existing auto-named profiles on stages that predate it (dev's
VpcNatInstanceProfile-<hash>). The rename triggers RemoveRoleFromInstanceProfile
on the OLD name, which a stage's boxlite-<stage>-*-scoped deploy role cannot
perform -> iam:RemoveRoleFromInstanceProfile AccessDenied, failing the dev deploy.

prod was a fresh deploy so its profiles are created under the scoped name from the
start; gating the transform to prod leaves other stages on SST's default name,
a no-op against their existing state, and keeps prod's scoped-role grant satisfied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant