Skip to content

Add controlled update cutover guardrail#13

Open
sene1337 wants to merge 2 commits into
cathrynlavery:mainfrom
sene1337:feat/update-cutover-guardrail
Open

Add controlled update cutover guardrail#13
sene1337 wants to merge 2 commits into
cathrynlavery:mainfrom
sene1337:feat/update-cutover-guardrail

Conversation

@sene1337

Copy link
Copy Markdown
Contributor

Title

Add controlled update cutover guardrail for production/custom-runtime upgrades

Summary

This PR adds scripts/update-cutover.sh, a pre/post update guardrail for OpenClaw upgrades that need more than the existing post-update repair flow.

The new script intentionally does not run openclaw update. Instead, it wraps the update with evidence capture and decision gates:

  • preflight baseline capture before the update
  • explicit official vs custom runtime lane selection
  • explicit macOS app scope selection (cli, app, both, or none)
  • config, cron, channel, gateway, launch target, binary, app-version, and temp-runtime-ref capture
  • generated CUTOVER.md checklist/report
  • hack/workaround audit prompt
  • post-update verification that fails on gateway/doctor/channel/cron capture failures
  • rollback rule when verification fails

It also updates SKILL.md, README.md, and tests.

Why this exists

openclaw-ops already has a good post-update recovery path:

bash scripts/post-update.sh

That orchestrates:

  1. check-update.sh --fix
  2. heal.sh
  3. optional workspace reconcile
  4. security-scan.sh
  5. openclaw health --json

That is valuable, but it starts after the update has already happened.

In our own production usage, the risky failures were not only “what broke after the update?” They were often “did we make the right cutover decision before updating?”

Concrete usage history that motivated this:

  • We ran a locally patched OpenClaw build for an ACP-related fix and almost upgraded without first checking whether the upstream release contained the local patch. That created the risk of silently switching back to an official runtime missing behavior we depended on.
  • We hit macOS split-artifact drift: the npm/CLI gateway and /Applications/OpenClaw.app are separate artifacts. Updating one does not prove the other moved with it, and mismatches can create false gateway-disconnected states, metadata upgrade loops, or pairing/app confusion.
  • We previously consolidated a messy OpenClaw install where multiple install locations/entrypoints were fighting. That incident reinforced that the active binary path, launchd service target, and apparent CLI version all need to be captured together before trusting an update.
  • We have seen temp/runtime path drift (/tmp / /private/tmp) matter operationally. A cutover should explicitly check that the live service did not end up pointed at disposable paths.
  • Cron delivery, channel routing, and topic/message delivery have been recurring fragile surfaces. A generic health check is not enough; updates need to capture cron/channel state and require host-specific smoke tests after the fact.
  • We maintain local hacks/workarounds during fast-moving OpenClaw development. Upgrades should force an operator to classify each relevant workaround as keep/retire/modify rather than leaving stale hacks enabled or removing them too early.

Those incidents led us to create a local “OpenClaw Upgrade Cutover SOP” with one core rule:

Treat OpenClaw upgrades as controlled cutovers, not routine updates.

This PR turns that operational pattern into a reusable openclaw-ops script.

Research / due diligence done while creating this

Before writing the script, we compared:

  1. the existing openclaw-ops update process (post-update.sh, check-update.sh, heal.sh, security-scan.sh)
  2. our local cutover SOP and decision log from prior incidents
  3. openclaw-ops architecture and contribution rules
  4. current OpenClaw release/update pain points observed in our own gateway

Key gap analysis:

Gap in current process Added by this PR
post-update.sh starts after update update-cutover.sh --preflight captures before-state and forces a go/no-go gate
no explicit official/custom runtime decision required `--lane official
macOS app/CLI split not encoded required `--app-scope cli
release-note compatibility review left informal generated CUTOVER.md asks operator to classify release risks as Safe / Needs migration / Blocker
current config/cron/channel baseline not bundled captures redacted config, cron list, gateway/status/doctor/channel probe outputs
hack/workaround review easy to skip generated report includes a hack audit section and optional OPENCLAW_HACK_AUDIT_LOG excerpt
temp runtime drift not checked captures and verifies /tmp / /private/tmp runtime references
post verification too easy to hand-wave post mode fails if gateway status, doctor, channels probe, or cron list capture failed
rollback discipline not explicit generated report and failure output say to roll back instead of layering fixes

We also checked the implementation against the openclaw-ops philosophy:

  • scripts first, not prose-only SOPs
  • deterministic evidence capture
  • operator-visible reports
  • no hidden update execution
  • no new restart owner
  • no second watchdog
  • no secret leakage in outputs where practical (sanitize_sensitive is used for captures)
  • tests for non-trivial behavior changes

How this fits openclaw-ops architecture

This is deliberately a guardrail, not a new updater and not a recovery daemon.

It respects the existing architecture in docs/architecture.md:

  • It does not call openclaw update.
  • It does not call openclaw gateway restart.
  • It does not create a second watchdog or restart owner.
  • It leaves update execution and restart policy to the operator / existing OpenClaw tooling.
  • It complements post-update.sh rather than replacing it.

Recommended flow after this PR:

# 1. Before update: capture baseline + decision gate
bash scripts/update-cutover.sh \
  --preflight \
  --target-version v2026.X.Y \
  --lane official \
  --app-scope cli

# 2. Review generated CUTOVER.md and only then run the approved update
openclaw update

# 3. Existing post-update repair/health sequence
bash scripts/post-update.sh

# 4. Verify the same cutover dir after update
bash scripts/update-cutover.sh \
  --post \
  --target-version v2026.X.Y \
  --lane official \
  --app-scope cli \
  --cutover-dir ~/.openclaw/update-cutovers/<stamp>-v2026.X.Y

For simple installs and routine patch updates, post-update.sh remains enough. The cutover guardrail is for production gateways, macOS app/CLI installs, custom/local runtime lanes, or updates being performed to resolve a live incident.

Planned follow-up PR

This PR intentionally keeps the cutover guardrail focused.

A second PR will improve the broader remediation workflow by integrating the existing Remediation Board more directly with upgrade cutovers and bug-fixing flows.

Planned follow-up scope:

  • add clearer remediation item types such as incident, hack, upstream-watch, cron-error, upgrade-blocker, and security-hardening
  • make local hacks/workarounds first-class remediation items instead of a separate forgotten log
  • link remediation items to richer incident notes when a case needs narrative investigation detail
  • add/extend commands for observations, hypotheses, steps tried, upstream issue links, close criteria, and markdown export
  • teach update-cutover.sh preflight to review active remediation items of type hack, upgrade-blocker, and upstream-watch
  • document the ops/remediation/ model as the operator-facing home for operational debt, incidents, hacks, and fixes awaiting verification

Keeping that as PR 2 avoids overloading this PR with a larger tracking architecture change while still making the intended integration path explicit.

Implementation notes

New script:

  • scripts/update-cutover.sh

Updated docs:

  • SKILL.md
  • README.md

Updated tests:

  • tests/run.sh

Test coverage added for:

  • preflight gate/report generation
  • before-state config and cron capture
  • post verification success path
  • post verification failing on target version mismatch
  • post verification failing on channel probe failure
  • missing argument values showing usage instead of falling through

Validation

Run locally:

bash -n scripts/update-cutover.sh tests/run.sh
bash tests/run.sh
git diff --check
bash scripts/skill-audit.sh .

Results:

  • bash -n scripts/update-cutover.sh tests/run.sh — pass
  • bash tests/run.sh — pass
  • git diff --check — pass
  • bash scripts/skill-audit.sh . — pass, risk LOW

I also ran an adversarial review against the repo contribution rules and open-source release checklist. The first review found real blockers; those were fixed. The second review passed with no blockers.

Security / privacy notes

  • The script writes cutover reports under ~/.openclaw/update-cutovers by default.
  • Captured command output is passed through sanitize_sensitive.
  • Config capture is redacted through the same sanitizer.
  • The script may include an excerpt from OPENCLAW_HACK_AUDIT_LOG if supplied/found; operators should still avoid pointing that variable at files containing private incident details they do not want in local cutover reports.
  • No generated local cutover reports are committed.

Risks / limitations

  • This PR still uses a hack-audit-log style review input for local workarounds. A follow-up PR will merge that concept more cleanly into the Remediation Board and bug-fixing workflow.
  • Host-specific channel smoke tests still need to be run by the operator; the script can run an agent sentinel with --smoke, but it cannot know every deployment’s critical Telegram/Slack/BlueBubbles/etc. scenario.
  • macOS app compatibility is checked by captured bundle version text matching the target version. This is intentionally conservative but can be tightened later if OpenClaw exposes a first-class app/gateway compatibility probe.
  • The script is a gate and evidence recorder; it does not replace release-note review or human approval for production upgrades.

Checklist

  • Small focused PR
  • Bash style matches repo (set -euo pipefail, shared lib.sh, clear logging)
  • Tests added for non-trivial behavior
  • bash tests/run.sh passes
  • git diff --check passes
  • No secrets or local generated state included
  • No new restart-capable watchdog
  • README/SKILL updated

@darbsllim

Copy link
Copy Markdown
Contributor

This PR is merging our pre-upgrade process that helps me decide if it's worth upgrading to fix ongoing bugs I'm experiencing, or if upgrading will break my openclaw setup.

I also maintain a log of hacks and shims that I may want/need to remove/disable before upgrading.

I also use this pre-upgrade check as a way to decide if I should skip the upgrade and wait for a more stable release.

@cathrynlavery

cathrynlavery commented May 20, 2026

Copy link
Copy Markdown
Owner

Thanks for this — the controlled cutover guardrail is a good direction, but I found a blocker.

Request changes

  • scripts/update-cutover.sh: --preflight can execute openclaw update. The preflight report heredoc is unquoted and contains Markdown backticks, so bash treats the backticked text as command substitution while generating the report. In a local fake-openclaw run, --preflight invoked openclaw update, which violates the script’s stated safety contract.
  • bash tests/run.sh fails on this PR head. The post-cutover success test trips the temp-runtime guard because the fake openclaw binary lives under a temp dir.
  • shellcheck scripts/update-cutover.sh tests/run.sh also flags the same heredoc/backtick area.

Suggested fix: make the report heredoc safe for literal Markdown backticks, add a regression test that fails if --preflight invokes openclaw update, and adjust the test harness so the “post passes when version matches” case doesn’t use a temp-backed fake runtime path unless that is what it is testing.

@cathrynlavery

cathrynlavery commented May 20, 2026

Copy link
Copy Markdown
Owner

Addressed the earlier review blocker on this PR.

Verification:

  • git diff --check
  • bash -n scripts/*.sh tests/run.sh
  • bash tests/run.sh

Please re-review when available.

@cathrynlavery

Copy link
Copy Markdown
Owner

Thanks — this is a useful direction and I do not see spam/backlink concerns here. The guardrail would be valuable for safer OpenClaw update cutovers.

Before merging, can you tighten a few workflow-safety details?

  1. --post should require the existing preflight cutover directory, or otherwise verify that the selected directory already has the expected before/ baseline / CUTOVER.md.

    • Right now post mode can silently create a fresh timestamped directory if --cutover-dir is omitted, which undermines the before/after evidence guarantee.
  2. Preflight is described as read-only, but it runs openclaw channels status --probe.

    • Please either make the probe opt-in during preflight, switch to a non-probing capture, or explicitly document that this is an active probe which may touch channel integrations.
  3. Please parse --help before requiring openclaw / python3.

    • On machines without OpenClaw installed, --help should still show usage rather than failing dependency checks.

Non-blocking: it would also be helpful if preflight summarized failed captures instead of only writing .exit files.

Once those are addressed, this looks reasonable to accept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants