Skip to content

feat(deploy): one-click Cloud + Runner deploy buttons (lite MVP)#834

Open
law-chain-hot wants to merge 25 commits into
mainfrom
feat/lite-deploy-button
Open

feat(deploy): one-click Cloud + Runner deploy buttons (lite MVP)#834
law-chain-hot wants to merge 25 commits into
mainfrom
feat/lite-deploy-button

Conversation

@law-chain-hot

@law-chain-hot law-chain-hot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

One-click GitHub Actions deploy buttons for the dev stage — OIDC-only, no static AWS keys. The "lite MVP" of the release flow: change a thing → click its button → watch the green check.

Button File Does
Cloud Deploy deploy.yml workflow_dispatch (stage + diff|deploy, environment: dev) → assume boxlite-github-deployer-<stage>sst diff/deploy with SSM-injected env
Runner Rollout runner-rollout.yml swap the boxlite-runner binary in place via SSM (release or dev-build from S3) → checksum verify → restart → auto-rollback

Supporting changes:

  • runner-update-binary.sh — accept an S3 dev-channel source (separate sha256 presigned URL).
  • seed-deploy-env-ssm.mjs / sst-with-cloudflare.mjs — inject the stage's env + Cloudflare creds from SSM so CI matches a laptop .env.
  • sst.config.ts:
    • $transform stamps boxlite-role-boundary on every IAM role the app creates, satisfying the bounded deploy role's conditional create/manage grant.
    • RunnerRole / RunnerProfile get explicit boxlite-dev-runner names so they fall under the deploy role's boxlite-* IAM scope.

IAM (one-time, account admin)

boxlite-github-deployer-<stage> carries boxlite-bounded-role-admin (CreateRole / PutRolePermissionsBoundary / PassRole on boxlite-*, conditioned on iam:PermissionsBoundary == boxlite-role-boundary) + IAMReadOnlyAccess. The $transform above is what satisfies that condition.

Validated (2026-06-22)

deploy --stage dev ran end-to-end via the GitHub Action and completed:

  • ✅ all 8 ECS services running, 5 ALBs active
  • api.dev/proxy.dev/ssh.dev/dev.boxlite.ai all resolve; api/health + dashboard return 200
  • boxlite create → control-plane Running + runner krun VM verified via SSM
  • ✅ follow-up sst diff is clean (no destructive churn) → repeatable

Runner Rollout was separately validated end-to-end (build → S3 → SSM → active → box).

Not in this PR

N4 (DB migrations off-boot), Release-All orchestration, prod stage — tracked separately.

Summary by CodeRabbit

  • New Features
    • Added manual workflows to deploy (or run a diff) to the dev stage and to roll out/upgrade or roll back the runner binary for selected stages/instances.
    • Introduced tooling to build and distribute versioned runner artifacts.
  • Chores
    • Added helpers to seed stage environment variables into SSM and automatically load them during SST runs.
    • Updated documentation and npm/make commands for runner rollout.
  • Bug Fixes
    • Improved runner rollout/upgrade safety with targeted selection, integrity checking, and rollback on service startup failure.
    • Standardized IAM role/profile naming and applied default permissions boundaries when missing.

@law-chain-hot law-chain-hot requested a review from a team as a code owner June 22, 2026 06:42
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds GitHub Actions workflows for SST infra deployment (Deploy) and runner binary rollout (Runner Rollout) with dev-build and release delivery paths. Introduces seed-deploy-env-ssm.mjs to seed .env values to AWS SSM and loadEnvFromSsm() helper in sst-with-cloudflare.mjs to load them back. Updates sst.config.ts with IAM permissions boundary transform and deterministic runner resource naming. Replaces legacy runner-update-binary.sh with new rollout-runner-binary.mjs that deploys via SSM. Adds make dist:runner build target and npm run runner:rollout command.

Changes

Deployment Automation Infrastructure

Layer / File(s) Summary
SSM environment variable seeding and loading
apps/infra/scripts/seed-deploy-env-ssm.mjs, apps/infra/scripts/sst-with-cloudflare.mjs
seed-deploy-env-ssm.mjs parses .env files, filters excluded keys (AWS_PROFILE, Cloudflare secrets), and writes key/value pairs to SSM SecureString parameters under /boxlite/<stage>/env/ with --dry-run support and secure logging (key names only, never values). sst-with-cloudflare.mjs gains loadEnvFromSsm(stage) which fetches those parameters via aws ssm get-parameters-by-path and populates process.env only for unset keys, tolerating missing AWS CLI and parse failures.
SST configuration: IAM permissions boundary and deterministic resource names
apps/infra/sst.config.ts
Registers an SST $transform on aws.iam.Role to set a default permissionsBoundary ARN on every role that doesn't already specify one. RunnerRole and RunnerProfile (instance profile) now declare explicit stage-scoped names (${app.name}-${app.stage}-runner) instead of SST-generated ones. Comment updated to reference new rollout-runner-binary.mjs script.
Deploy GitHub Actions workflow
.github/workflows/deploy.yml
New manually triggered workflow: accepts stage (dev) and mode (diff/deploy) inputs, grants GitHub OIDC permissions, assumes boxlite-github-deployer-<stage> role in ap-southeast-1, checks out repo without persisting credentials, sets up Node.js 22, installs apps/infra dependencies, and conditionally runs sst diff or sst deploy. Per-stage concurrency with cancel-in-progress: false.
Runner Rollout workflow: setup, inputs, and config resolution
.github/workflows/runner-rollout.yml
Manually triggered with stage, source (dev-build or release), ref, mode (upgrade or rollback), and optional instance_id inputs. Grants GitHub OIDC permissions, configures per-stage concurrency with S3 bucket environment. Wires config reusable job, resolves dispatch inputs into outputs, validates that rollback requires source=release, and checks out target git ref (only for dev-build when ref is non-empty).
Runner Rollout workflow: dev-build tarball building
.github/workflows/runner-rollout.yml
For dev-build path: sets up Node.js toolchain (with corepack and yarn 4), installs system dependencies, derives runner version from Cargo.toml and git SHA. Runs make dist:runner to build runner tarball, discovers the tarball filename under dist/runner/, asserts existence, and exports filename as job output for S3 upload.
Runner Rollout workflow: AWS auth, S3 upload, presigning, and npm dispatch
.github/workflows/runner-rollout.yml
Assumes stage-scoped AWS role via GitHub OIDC. For dev-build: uploads tarball and .sha256 to S3, generates 30-minute presigned URLs, masks them in logs, exports as outputs. Runs npm run runner:rollout for both paths: dev-build provides presigned URLs via BOXLITE_RUNNER_TARBALL_URL and BOXLITE_RUNNER_SHA256_URL env vars; release provides optional ref CLI argument. Both support optional instance_id pinning.
Runner rollout script: orchestration and SSM deployment
apps/infra/scripts/rollout-runner-binary.mjs
Reads AWS/stage/instance/tarball env vars, wraps aws CLI calls, extracts Rust crate version from Cargo.toml, provides shell-quoting for URL embedding. buildRemoteScript() generates a bash script that downloads tarball, conditionally verifies checksum, extracts binary, backs up existing binary, stops/starts systemd service, and rolls back with journal logs on failure. Main orchestration resolves target EC2 instance (from env var or EC2 tag query), selects asset URL (dev or release), base64-encodes remote script, submits via aws ssm send-command, waits for completion, and reports output/error status.
Build infrastructure and command wiring
make/dist.mk, make/help.mk, apps/infra/package.json, apps/infra/README.md
make/dist.mk introduces dist:runner target that extracts version from Cargo.toml, builds C/Go/Node components, packages runner binary into versioned tar.gz, and generates sha256 checksum file. apps/infra/package.json adds runner:rollout npm script. make/help.mk documents dist:runner. apps/infra/README.md updated to reference npm run runner:rollout instead of legacy script.

Sequence Diagram(s)

sequenceDiagram
  participant Developer
  participant GHA as GitHub Actions
  participant S3 as S3 Bucket
  participant OIDC as AWS IAM/OIDC
  participant SSM as AWS SSM
  participant EC2 as Runner EC2

  rect rgba(100, 149, 237, 0.5)
    Note over Developer,EC2: dev-build path
    Developer->>GHA: Trigger Runner Rollout (source=dev-build)
    GHA->>GHA: make dist:runner
    GHA->>GHA: Build daemon, computer-use, boxlite-runner binaries
    GHA->>GHA: Package tarball + sha256
    GHA->>OIDC: Assume boxlite-github-deployer-dev via OIDC
    OIDC-->>GHA: Temporary credentials
    GHA->>S3: Upload tarball + sha256
    GHA->>S3: Generate presigned URLs (30 min TTL)
    S3-->>GHA: BOXLITE_RUNNER_TARBALL_URL, BOXLITE_RUNNER_SHA256_URL
    GHA->>GHA: npm run runner:rollout (with presigned URLs)
    GHA->>SSM: aws ssm send-command: AWS-RunShellScript
  end

  rect rgba(144, 238, 144, 0.5)
    Note over Developer,SSM: release path
    Developer->>GHA: Trigger Runner Rollout (source=release, ref=v1.2.3)
    GHA->>OIDC: Assume boxlite-github-deployer-<stage> via OIDC
    OIDC-->>GHA: Temporary credentials
    GHA->>GHA: npm run runner:rollout --ref v1.2.3
    GHA->>SSM: aws ssm send-command: AWS-RunShellScript
  end

  SSM->>EC2: Execute remote bash: download + verify + install + rollback on fail
  EC2->>EC2: Download tarball (presigned or GitHub release)
  EC2->>EC2: Verify checksum if available
  EC2->>EC2: Extract binary, backup existing, systemd restart
  EC2-->>SSM: boxlite-runner --version (stdout) or rollback + logs (stderr)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • boxlite-ai/boxlite#793: Introduces SHA-256 integrity verification for the boxlite-runner binary; this PR builds on that by implementing the download-and-verify flow in the new rollout-runner-binary.mjs script.
  • boxlite-ai/boxlite#794: Updates runner binary deployment logic; this PR removes the previously-hardened scripts/deploy/runner-update-binary.sh and replaces it with the new SSM-driven rollout infrastructure.
  • boxlite-ai/boxlite#819: Updates EC2 instance lookup and runner tag conventions; this PR's rollout-runner-binary.mjs continues targeting the boxlite-runner-default Name tag convention.

Suggested reviewers

  • DorianZheng
  • G4614

Poem

🐰 A bunny hops through pipelines with care,
Seeding secrets in SSM's lair,
With OIDC keys and presigned URLs so bright,
The runner deploys through the GitHub Actions night,
set -euo pipefail keeps rollback running fair! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the PR: introducing one-click GitHub Actions deploy buttons (Cloud Deploy and Runner Rollout workflows) as a lightweight MVP for the dev stage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lite-deploy-button

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bebd8ecbfe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rollout:
name: Roll out runner
needs: config
runs-on: ubuntu-latest

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Put runner rollouts behind the dev environment

For the dev rollout path, this job assumes the same boxlite-github-deployer-dev role and restarts the live runner, but unlike .github/workflows/deploy.yml:48 it never enters the dev environment. In installations where that environment enforces reviewers or the AWS OIDC trust is scoped to environment:dev, this rollout either bypasses the approval gate or fails to assume the role; add environment: dev here (or a stage-mapped environment when more stages exist) before configuring AWS credentials.

Useful? React with 👍 / 👎.

}
}

loadEnvFromSsm(stage)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Load local .env before SSM env injection

When npm run deploy is run locally with AWS access and a .env override that differs from /boxlite/<stage>/env, this call fills process.env from SSM before sst.config.ts later calls dotenv.config(), and dotenv does not override existing variables by default. That makes SSM values win over the laptop .env despite the comments saying local .env is unchanged; load dotenv in this wrapper first or only SSM-fill keys absent from both the shell and .env.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/runner-rollout.yml Outdated
env:
STAGE: ${{ steps.in.outputs.stage }}
BOXLITE_RUNNER_INSTANCE_ID: ${{ steps.in.outputs.instance_id }}
run: bash scripts/deploy/runner-update-binary.sh "${{ steps.in.outputs.ref }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid passing a blank release version

When source=release is selected and the optional ref input is left blank, this quoted empty string is still passed as one positional argument, so runner-update-binary.sh sets VERSION='' instead of using its Cargo.toml fallback. The rollout then tries to fetch .../releases/download/v/boxlite-runner-v-linux-amd64.tar.gz and fails before updating; either require ref for release or omit the argument when it is blank.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
.github/workflows/deploy.yml (1)

46-48: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Environment is hardcoded to dev regardless of stage input.

Line 48 hardcodes environment: dev, but the stage input allows selecting different stages. If more stages are added later (e.g., prod), the environment protection rules won't match.

🔧 Proposed fix to use dynamic environment
   deploy:
     runs-on: ubuntu-latest
-    environment: dev
+    environment: ${{ inputs.stage }}
     steps:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/deploy.yml around lines 46 - 48, The deploy job has the
environment field hardcoded to dev instead of using the dynamic stage input.
Replace the hardcoded dev value in the environment field with a reference to the
inputs.stage variable to ensure the correct environment is selected based on the
stage input parameter passed when triggering the workflow.
apps/infra/scripts/sst-with-cloudflare.mjs (1)

93-97: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Silent JSON parse failure may mask SSM response issues.

If the AWS CLI returns unexpected output (e.g., an error message instead of JSON), the function silently returns without any indication. Adding a warning would help diagnose CI failures.

🔧 Proposed fix to log parse errors
   let params
   try {
     params = JSON.parse(json)
-  } catch {
+  } catch (e) {
+    console.warn(`sst-with-cloudflare: failed to parse SSM response for ${path}*: ${e.message}`)
     return
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/sst-with-cloudflare.mjs` around lines 93 - 97, The catch
block following the JSON.parse(json) statement silently returns without logging,
which masks potential AWS CLI output issues. Add a warning log statement in the
catch block that captures and displays both the parsing error and the
problematic JSON input. This will help diagnose CI failures by making it clear
when AWS returns unexpected output instead of valid JSON. Include both the error
message and the actual response data that failed to parse.
apps/infra/scripts/seed-deploy-env-ssm.mjs (1)

69-89: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider non-zero exit on partial write failures.

The script logs failures but exits successfully even when some keys fail to write. In CI, this could mask issues where critical secrets weren't seeded properly.

🔧 Proposed fix to fail on partial errors
 let ok = 0
+let failed = 0
 for (const [key, val] of pairs) {
   const name = `/boxlite/${stage}/env/${key}`
   if (DRY) {
     console.log(`  would put ${name}`)
     continue
   }
   try {
     execFileSync(
       'aws',
       ['ssm', 'put-parameter', '--region', REGION, '--name', name, '--type', 'SecureString', '--overwrite', '--value', val],
       { stdio: ['ignore', 'ignore', 'pipe'] }, // never echo the value or the version output
     )
     console.log(`  ✓ ${key}`)
     ok++
   } catch (err) {
     const msg = (err.stderr ? err.stderr.toString() : err.message).split('\n')[0]
     console.error(`  ✗ ${key}: ${msg}`)
+    failed++
   }
 }
 console.log(DRY ? 'seed-deploy-env-ssm: dry-run, nothing written' : `seed-deploy-env-ssm: wrote ${ok}/${pairs.length}`)
+if (failed > 0) process.exit(1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/seed-deploy-env-ssm.mjs` around lines 69 - 89, The script
logs failures when writing parameters to SSM but exits with status 0 regardless
of whether all parameters were successfully written. After the loop over pairs
completes, check if the ok counter equals pairs.length, and if they do not match
(indicating one or more write failures), call process.exit(1) to signal failure
to CI. This check should occur after the final console.log statement to ensure
the summary message is printed before exiting with a non-zero status code.
.github/workflows/runner-rollout.yml (3)

137-140: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Hardcoded Go version may drift from config.outputs.go-version.

Line 106 uses ${{ needs.config.outputs.go-version }} for setup-go, but line 140 hardcodes go 1.25.4 in the go.work rewrite. If the project's Go version changes, this hardcoded value could become stale.

Consider deriving from the config output:

♻️ Proposed fix
       - name: Rewrite go.work for minimal modules
         if: ${{ steps.in.outputs.source == 'dev-build' }}
+        env:
+          GO_VERSION: ${{ needs.config.outputs.go-version }}
         run: |
-          printf 'go 1.25.4\n\nuse (\n\t./runner\n\t./daemon\n\t./common-go\n\t./api-client-go\n\t./libs/computer-use\n\t../sdks/go\n)\n' > apps/go.work
+          printf 'go %s\n\nuse (\n\t./runner\n\t./daemon\n\t./common-go\n\t./api-client-go\n\t./libs/computer-use\n\t../sdks/go\n)\n' "$GO_VERSION" > apps/go.work
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/runner-rollout.yml around lines 137 - 140, The "Rewrite
go.work for minimal modules" step hardcodes the Go version as "go 1.25.4" in the
go.work file, but the workflow uses a dynamic Go version from
needs.config.outputs.go-version in the setup-go step. Replace the hardcoded
version string "1.25.4" with a reference to the same config output variable used
in the setup-go action to ensure the go.work file stays synchronized with the
configured Go version whenever it changes in the future.

91-91: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Consider pinning actions to full SHA for supply chain security.

Actions at lines 91, 104, and 189 use version tags (@v4, @v5) which could be force-pushed. Pinning to commit SHA provides immutability:

- uses: actions/checkout@<full-sha>  # v4

This is a defense-in-depth measure; version tags are generally acceptable if the organization trusts the action maintainers.

Also applies to: 104-104, 189-189

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/runner-rollout.yml at line 91, The actions/checkout action
at line 91 uses a version tag (`@v4`) instead of a full commit SHA, which could be
vulnerable to force-push attacks. Replace the version tag references in the
actions/checkout usage at line 91, the action at line 104 (`@v5`), and the action
at line 189 with their corresponding full commit SHAs to ensure immutability and
improve supply chain security. You can find the full SHA for each action version
on the action's GitHub release page.

Source: Linters/SAST tools


91-94: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Set persist-credentials: false to avoid leaving Git credentials in the runner.

While this workflow doesn't upload artifacts via actions/upload-artifact, setting persist-credentials: false is a defense-in-depth practice that prevents the GITHUB_TOKEN from persisting in the .git/config.

🔒 Proposed fix
       - uses: actions/checkout@v4
         with:
           ref: ${{ (steps.in.outputs.source == 'dev-build' && steps.in.outputs.ref != '') && steps.in.outputs.ref || github.ref }}
+          persist-credentials: false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/runner-rollout.yml around lines 91 - 94, The
actions/checkout@v4 step is missing the persist-credentials security
configuration. Add persist-credentials: false to the with section of the
actions/checkout@v4 step, placing it alongside the existing ref parameter to
prevent the GITHUB_TOKEN from persisting in the .git/config file as a
defense-in-depth security practice.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/deploy.yml:
- Around line 50-60: Pin the GitHub Actions to specific commit SHAs for supply
chain security. Update the aws-actions/configure-aws-credentials action to use
v4.3.1 with its corresponding SHA instead of the generic v4 reference.
Additionally, add persist-credentials: false to the actions/checkout action to
prevent credential leakage through artifacts when using OIDC authentication.

In @.github/workflows/runner-rollout.yml:
- Around line 76-84: The "Resolve inputs" step has a template injection
vulnerability where the freeform input values for ref and instance_id are
directly expanded into GITHUB_OUTPUT without sanitization, allowing newlines or
special characters to inject arbitrary output variables. Replace the direct echo
statements for these two inputs (the lines writing ref and instance_id to
GITHUB_OUTPUT) with environment variable assignments using heredoc delimiters to
safely handle potentially malicious or multi-line input values.
- Around line 219-224: The steps.in.outputs.ref value is being passed directly
to the bash script as a command-line argument in the run command, which creates
a command injection vulnerability if the ref contains shell metacharacters. Move
steps.in.outputs.ref into an environment variable by adding it to the env block
(similar to how STAGE and BOXLITE_RUNNER_INSTANCE_ID are defined), then modify
the run command to pass the value as an environment variable reference instead
of a direct shell argument, and ensure the runner-update-binary.sh script reads
it from the environment variable to prevent shell interpretation of any special
characters.

---

Nitpick comments:
In @.github/workflows/deploy.yml:
- Around line 46-48: The deploy job has the environment field hardcoded to dev
instead of using the dynamic stage input. Replace the hardcoded dev value in the
environment field with a reference to the inputs.stage variable to ensure the
correct environment is selected based on the stage input parameter passed when
triggering the workflow.

In @.github/workflows/runner-rollout.yml:
- Around line 137-140: The "Rewrite go.work for minimal modules" step hardcodes
the Go version as "go 1.25.4" in the go.work file, but the workflow uses a
dynamic Go version from needs.config.outputs.go-version in the setup-go step.
Replace the hardcoded version string "1.25.4" with a reference to the same
config output variable used in the setup-go action to ensure the go.work file
stays synchronized with the configured Go version whenever it changes in the
future.
- Line 91: The actions/checkout action at line 91 uses a version tag (`@v4`)
instead of a full commit SHA, which could be vulnerable to force-push attacks.
Replace the version tag references in the actions/checkout usage at line 91, the
action at line 104 (`@v5`), and the action at line 189 with their corresponding
full commit SHAs to ensure immutability and improve supply chain security. You
can find the full SHA for each action version on the action's GitHub release
page.
- Around line 91-94: The actions/checkout@v4 step is missing the
persist-credentials security configuration. Add persist-credentials: false to
the with section of the actions/checkout@v4 step, placing it alongside the
existing ref parameter to prevent the GITHUB_TOKEN from persisting in the
.git/config file as a defense-in-depth security practice.

In `@apps/infra/scripts/seed-deploy-env-ssm.mjs`:
- Around line 69-89: The script logs failures when writing parameters to SSM but
exits with status 0 regardless of whether all parameters were successfully
written. After the loop over pairs completes, check if the ok counter equals
pairs.length, and if they do not match (indicating one or more write failures),
call process.exit(1) to signal failure to CI. This check should occur after the
final console.log statement to ensure the summary message is printed before
exiting with a non-zero status code.

In `@apps/infra/scripts/sst-with-cloudflare.mjs`:
- Around line 93-97: The catch block following the JSON.parse(json) statement
silently returns without logging, which masks potential AWS CLI output issues.
Add a warning log statement in the catch block that captures and displays both
the parsing error and the problematic JSON input. This will help diagnose CI
failures by making it clear when AWS returns unexpected output instead of valid
JSON. Include both the error message and the actual response data that failed to
parse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fb1ddba8-dfbf-4b66-b65c-386069865c58

📥 Commits

Reviewing files that changed from the base of the PR and between 77e932c and bebd8ec.

📒 Files selected for processing (6)
  • .github/workflows/deploy.yml
  • .github/workflows/runner-rollout.yml
  • apps/infra/scripts/seed-deploy-env-ssm.mjs
  • apps/infra/scripts/sst-with-cloudflare.mjs
  • apps/infra/sst.config.ts
  • scripts/deploy/runner-update-binary.sh

Comment thread .github/workflows/deploy.yml Outdated
Comment thread .github/workflows/runner-rollout.yml
Comment thread .github/workflows/runner-rollout.yml Outdated
@law-chain-hot

Copy link
Copy Markdown
Contributor Author

Addressed the SAST review findings:

  • M3 (command injection via steps.in.outputs.ref → shell arg, runner-rollout.yml): fixed in fff6b762 — passed via env: REF and invoked as "$REF".
  • M2 (template injection of free-text inputs → GITHUB_OUTPUT): fixed in 6f9aaad2 — inputs now flow through env: vars, and the free-text ref/instance_id are written with heredoc delimiters so an embedded newline can't inject extra output keys (e.g. forge a stage=).
  • M1 (pin actions to commit SHA): pinned all third-party actions in both privileged deploy workflows (actions/checkout, setup-node, setup-go, aws-actions/configure-aws-credentials) to commit SHAs with version comments, in 6f9aaad2. Rationale: these workflows hold OIDC→AWS access, so a tag hijack is a real supply-chain path.
  • Also added persist-credentials: false to checkout in both workflows.

Deferred to a separate follow-up (to keep this PR focused and avoid a half-pinned repo): SHA-pin the remaining ~34 @v4/@v5 uses repo-wide, and tighten the deployer OIDC trust from repo:<org>/<repo>:* to :environment:dev.

Comment thread scripts/deploy/runner-update-binary.sh Outdated
@@ -13,6 +13,8 @@
# Usage:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove and rewrite as apps/infra scripts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

- name: sst deploy
if: ${{ inputs.mode == 'deploy' }}
working-directory: apps/infra
run: npm run deploy -- --stage ${{ inputs.stage }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how images built on github action?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using docker, which is defined in SST

Comment thread .github/workflows/runner-rollout.yml Outdated
# ("undefined reference") — see 2026-06-20 validation. A true dev-build of
# unreleased source must build libboxlite.a from the Rust core here instead of
# downloading it. TODO: add the Rust libboxlite build for the dev-build path.
- name: Download prebuilt libboxlite.a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to built with main?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not supported yet

Comment thread .github/workflows/runner-rollout.yml Outdated
go -C apps/api-client-go mod download
go -C apps/libs/computer-use mod download

- name: Build daemon

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove code and step

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Comment thread .github/workflows/runner-rollout.yml Outdated
-o runner/pkg/daemon/static/daemon-amd64 \
./daemon/cmd/daemon/

- name: Build computer-use

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove code and step

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Comment thread .github/workflows/runner-rollout.yml Outdated
-o runner/pkg/daemon/static/boxlite-computer-use \
./libs/computer-use/

- name: Build runner

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a make target

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

* node scripts/seed-deploy-env-ssm.mjs --dry-run # list keys, write nothing
*/

import { execFileSync } from 'node:child_process'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to modify .env and review before trigger the deployment action?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use AWS SSM Parameter Store, no review now

> /boxlite/dev/env/KEY = VALUE

#
# ⚠️ Rolling the binary RESTARTS the runner's systemd unit — boxes running on that
# EC2 are interrupted. Box state under /var/lib/boxlite is preserved.
name: Runner Rollout

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does upgrade runner affect online service?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have hot-reload on runner (Synced with @G4614 )

@@ -0,0 +1,78 @@
# One-click deploy button (MVP "lite+" — manual, reviewable, no static AWS keys).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first time deployment and subsequent update

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both supported

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/infra/scripts/rollout-runner-binary.mjs (1)

163-167: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

SSM wait has no explicit timeout.

The aws ssm wait command-executed relies on AWS CLI defaults (typically ~600s with retries). If the remote script hangs, this will block for an extended period. For CI/CD visibility, consider adding an explicit --cli-read-timeout or documenting the expected max duration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/rollout-runner-binary.mjs` around lines 163 - 167, The
execFileSync call executing the aws ssm wait command-executed does not have an
explicit timeout configured, which can cause the process to block indefinitely
if the remote script hangs. Add the --cli-read-timeout parameter to the
arguments array passed to the aws command within the execFileSync call to set an
explicit timeout value that provides better visibility and prevents extended
blocking in CI/CD environments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/infra/scripts/rollout-runner-binary.mjs`:
- Around line 52-67: The resolveInstanceId() function does not validate that
exactly one instance matches the EC2 query filter. When multiple running
instances share the boxlite-runner-default tag, the --output text flag returns
space-separated instance IDs, which will cause issues when passed to SSM. After
the runAws() call returns the result, add validation to ensure exactly one
instance ID is returned; if zero or multiple instances are found, throw an
appropriate error with a clear message indicating the problem and requiring
manual intervention or tag corrections.

---

Nitpick comments:
In `@apps/infra/scripts/rollout-runner-binary.mjs`:
- Around line 163-167: The execFileSync call executing the aws ssm wait
command-executed does not have an explicit timeout configured, which can cause
the process to block indefinitely if the remote script hangs. Add the
--cli-read-timeout parameter to the arguments array passed to the aws command
within the execFileSync call to set an explicit timeout value that provides
better visibility and prevents extended blocking in CI/CD environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a52792bd-70bc-4fbd-aed6-f31f95a26f1d

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9aaad and 5ff8105.

📒 Files selected for processing (11)
  • .github/workflows/deploy.yml
  • .github/workflows/runner-rollout.yml
  • apps/infra/README.md
  • apps/infra/package.json
  • apps/infra/scripts/rollout-runner-binary.mjs
  • apps/infra/scripts/seed-deploy-env-ssm.mjs
  • apps/infra/scripts/sst-with-cloudflare.mjs
  • apps/infra/sst.config.ts
  • make/dist.mk
  • make/help.mk
  • scripts/deploy/runner-update-binary.sh
💤 Files with no reviewable changes (1)
  • scripts/deploy/runner-update-binary.sh
✅ Files skipped from review due to trivial changes (2)
  • make/help.mk
  • apps/infra/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • apps/infra/sst.config.ts
  • apps/infra/scripts/seed-deploy-env-ssm.mjs
  • apps/infra/scripts/sst-with-cloudflare.mjs
  • .github/workflows/deploy.yml

Comment on lines +52 to +67
function resolveInstanceId() {
if (INSTANCE_ID) return INSTANCE_ID
return runAws([
'ec2',
'describe-instances',
'--region',
AWS_REGION,
'--filters',
'Name=tag:Name,Values=boxlite-runner-default',
'Name=instance-state-name,Values=running',
'--query',
'Reservations[].Instances[].InstanceId',
'--output',
'text',
])
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Multiple matching instances could cause unexpected behavior.

If multiple running EC2 instances share the boxlite-runner-default tag, --output text returns space-separated IDs. The script then passes this combined string to SSM, which may fail or behave unexpectedly.

Consider validating that exactly one instance matches or handling the multi-instance case explicitly.

🛡️ Suggested validation
 function resolveInstanceId() {
   if (INSTANCE_ID) return INSTANCE_ID
-  return runAws([
+  const result = runAws([
     'ec2',
     'describe-instances',
     '--region',
     AWS_REGION,
     '--filters',
     'Name=tag:Name,Values=boxlite-runner-default',
     'Name=instance-state-name,Values=running',
     '--query',
     'Reservations[].Instances[].InstanceId',
     '--output',
     'text',
   ])
+  const ids = result.split(/\s+/).filter(Boolean)
+  if (ids.length > 1) {
+    throw new Error(`multiple running boxlite-runner-default instances found: ${ids.join(', ')}; use BOXLITE_RUNNER_INSTANCE_ID to target one`)
+  }
+  return ids[0] || ''
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function resolveInstanceId() {
if (INSTANCE_ID) return INSTANCE_ID
return runAws([
'ec2',
'describe-instances',
'--region',
AWS_REGION,
'--filters',
'Name=tag:Name,Values=boxlite-runner-default',
'Name=instance-state-name,Values=running',
'--query',
'Reservations[].Instances[].InstanceId',
'--output',
'text',
])
}
function resolveInstanceId() {
if (INSTANCE_ID) return INSTANCE_ID
const result = runAws([
'ec2',
'describe-instances',
'--region',
AWS_REGION,
'--filters',
'Name=tag:Name,Values=boxlite-runner-default',
'Name=instance-state-name,Values=running',
'--query',
'Reservations[].Instances[].InstanceId',
'--output',
'text',
])
const ids = result.split(/\s+/).filter(Boolean)
if (ids.length > 1) {
throw new Error(`multiple running boxlite-runner-default instances found: ${ids.join(', ')}; use BOXLITE_RUNNER_INSTANCE_ID to target one`)
}
return ids[0] || ''
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra/scripts/rollout-runner-binary.mjs` around lines 52 - 67, The
resolveInstanceId() function does not validate that exactly one instance matches
the EC2 query filter. When multiple running instances share the
boxlite-runner-default tag, the --output text flag returns space-separated
instance IDs, which will cause issues when passed to SSM. After the runAws()
call returns the result, add validation to ensure exactly one instance ID is
returned; if zero or multiple instances are found, throw an appropriate error
with a clear message indicating the problem and requiring manual intervention or
tag corrections.

GitHub Actions deploy buttons for dev, OIDC-only (no static keys):
- deploy.yml: Cloud deploy (workflow_dispatch diff|deploy, environment: dev) — assumes
  boxlite-github-deployer-<stage>, runs sst diff/deploy with SSM-injected env.
- runner-rollout.yml: swap the boxlite-runner binary in place via SSM (release or
  dev-build from S3) with checksum verify + auto-rollback.
- runner-update-binary.sh: accept an S3 dev-channel source (separate sha256 URL).
- seed-deploy-env-ssm.mjs / sst-with-cloudflare.mjs: inject stage env + Cloudflare
  creds from SSM so CI matches a laptop .env.
- sst.config.ts: $transform stamps boxlite-role-boundary on every IAM role the app
  creates (satisfies the bounded deploy role's conditional create/manage grant);
  RunnerRole/Profile get explicit boxlite-dev-runner names so they fall under the
  deploy role's boxlite-* IAM scope.

Validated 2026-06-22: deploy --stage dev completes end-to-end; dev healthy
(8 services running, 5 ALBs active, api/dashboard 200, box create verified).
…sist-credentials)

- runner-rollout.yml: pass workflow_dispatch inputs (ref/instance_id, free-text) via
  env vars instead of ${{ }} interpolated into run: scripts, so a crafted input is a
  shell value rather than executable script (CodeRabbit/zizmor template-injection).
- deploy.yml + runner-rollout.yml: persist-credentials: false on checkout — these
  workflows auth to AWS via OIDC and never push to the repo, so the GITHUB_TOKEN
  doesn't need to linger in .git/config.

Left for a repo-wide follow-up (not this PR, for consistency): SHA-pin actions across
all privileged workflows + tighten the deployer OIDC trust from repo:*:* to
:environment:dev.
…ew M1/M2)

- Pin all third-party actions in the privileged deploy workflows to commit SHAs
  (actions/checkout, setup-node, setup-go, aws-actions/configure-aws-credentials)
  with version comments — these workflows hold OIDC→AWS access, so an action tag
  hijack is a supply-chain path. (M1)
- runner-rollout.yml Resolve inputs: write the free-text ref/instance_id to
  GITHUB_OUTPUT via heredoc delimiters so a newline-laden value can't inject extra
  output keys (e.g. forge a stage=). (M2; M3 command-injection was fixed in fff6b76)

Follow-up (separate, not this PR): SHA-pin the remaining ~34 @v4/@v5 uses repo-wide,
and tighten the deployer OIDC trust from repo:*:* to :environment:dev.
mode:
description: "upgrade, or rollback (install an older release — requires source=release)"
type: choice
options: [upgrade, rollback]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No diff between upgrade and rollback, but you can only use rollback when source is release @DorianZheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants