Add controlled update cutover guardrail by sene1337 · Pull Request #13 · cathrynlavery/openclaw-ops

sene1337 · 2026-05-15T14:09:42Z

Title

Add controlled update cutover guardrail for production/custom-runtime upgrades

Summary

This PR adds scripts/update-cutover.sh, a pre/post update guardrail for OpenClaw upgrades that need more than the existing post-update repair flow.

The new script intentionally does not run openclaw update. Instead, it wraps the update with evidence capture and decision gates:

preflight baseline capture before the update
explicit official vs custom runtime lane selection
explicit macOS app scope selection (cli, app, both, or none)
config, cron, channel, gateway, launch target, binary, app-version, and temp-runtime-ref capture
generated CUTOVER.md checklist/report
hack/workaround audit prompt
post-update verification that fails on gateway/doctor/channel/cron capture failures
rollback rule when verification fails

It also updates SKILL.md, README.md, and tests.

Why this exists

openclaw-ops already has a good post-update recovery path:

bash scripts/post-update.sh

That orchestrates:

check-update.sh --fix
heal.sh
optional workspace reconcile
security-scan.sh
openclaw health --json

That is valuable, but it starts after the update has already happened.

In our own production usage, the risky failures were not only “what broke after the update?” They were often “did we make the right cutover decision before updating?”

Concrete usage history that motivated this:

We ran a locally patched OpenClaw build for an ACP-related fix and almost upgraded without first checking whether the upstream release contained the local patch. That created the risk of silently switching back to an official runtime missing behavior we depended on.
We hit macOS split-artifact drift: the npm/CLI gateway and /Applications/OpenClaw.app are separate artifacts. Updating one does not prove the other moved with it, and mismatches can create false gateway-disconnected states, metadata upgrade loops, or pairing/app confusion.
We previously consolidated a messy OpenClaw install where multiple install locations/entrypoints were fighting. That incident reinforced that the active binary path, launchd service target, and apparent CLI version all need to be captured together before trusting an update.
We have seen temp/runtime path drift (/tmp / /private/tmp) matter operationally. A cutover should explicitly check that the live service did not end up pointed at disposable paths.
Cron delivery, channel routing, and topic/message delivery have been recurring fragile surfaces. A generic health check is not enough; updates need to capture cron/channel state and require host-specific smoke tests after the fact.
We maintain local hacks/workarounds during fast-moving OpenClaw development. Upgrades should force an operator to classify each relevant workaround as keep/retire/modify rather than leaving stale hacks enabled or removing them too early.

Those incidents led us to create a local “OpenClaw Upgrade Cutover SOP” with one core rule:

Treat OpenClaw upgrades as controlled cutovers, not routine updates.

This PR turns that operational pattern into a reusable openclaw-ops script.

Research / due diligence done while creating this

Before writing the script, we compared:

the existing openclaw-ops update process (post-update.sh, check-update.sh, heal.sh, security-scan.sh)
our local cutover SOP and decision log from prior incidents
openclaw-ops architecture and contribution rules
current OpenClaw release/update pain points observed in our own gateway

Key gap analysis:

Gap in current process	Added by this PR
`post-update.sh` starts after update	`update-cutover.sh --preflight` captures before-state and forces a go/no-go gate
no explicit official/custom runtime decision	required `--lane official
macOS app/CLI split not encoded	required `--app-scope cli
release-note compatibility review left informal	generated `CUTOVER.md` asks operator to classify release risks as Safe / Needs migration / Blocker
current config/cron/channel baseline not bundled	captures redacted config, cron list, gateway/status/doctor/channel probe outputs
hack/workaround review easy to skip	generated report includes a hack audit section and optional `OPENCLAW_HACK_AUDIT_LOG` excerpt
temp runtime drift not checked	captures and verifies `/tmp` / `/private/tmp` runtime references
post verification too easy to hand-wave	post mode fails if gateway status, doctor, channels probe, or cron list capture failed
rollback discipline not explicit	generated report and failure output say to roll back instead of layering fixes

We also checked the implementation against the openclaw-ops philosophy:

scripts first, not prose-only SOPs
deterministic evidence capture
operator-visible reports
no hidden update execution
no new restart owner
no second watchdog
no secret leakage in outputs where practical (sanitize_sensitive is used for captures)
tests for non-trivial behavior changes

How this fits openclaw-ops architecture

This is deliberately a guardrail, not a new updater and not a recovery daemon.

It respects the existing architecture in docs/architecture.md:

It does not call openclaw update.
It does not call openclaw gateway restart.
It does not create a second watchdog or restart owner.
It leaves update execution and restart policy to the operator / existing OpenClaw tooling.
It complements post-update.sh rather than replacing it.

Recommended flow after this PR:

# 1. Before update: capture baseline + decision gate
bash scripts/update-cutover.sh \
  --preflight \
  --target-version v2026.X.Y \
  --lane official \
  --app-scope cli

# 2. Review generated CUTOVER.md and only then run the approved update
openclaw update

# 3. Existing post-update repair/health sequence
bash scripts/post-update.sh

# 4. Verify the same cutover dir after update
bash scripts/update-cutover.sh \
  --post \
  --target-version v2026.X.Y \
  --lane official \
  --app-scope cli \
  --cutover-dir ~/.openclaw/update-cutovers/<stamp>-v2026.X.Y

For simple installs and routine patch updates, post-update.sh remains enough. The cutover guardrail is for production gateways, macOS app/CLI installs, custom/local runtime lanes, or updates being performed to resolve a live incident.

Planned follow-up PR

This PR intentionally keeps the cutover guardrail focused.

A second PR will improve the broader remediation workflow by integrating the existing Remediation Board more directly with upgrade cutovers and bug-fixing flows.

Planned follow-up scope:

add clearer remediation item types such as incident, hack, upstream-watch, cron-error, upgrade-blocker, and security-hardening
make local hacks/workarounds first-class remediation items instead of a separate forgotten log
link remediation items to richer incident notes when a case needs narrative investigation detail
add/extend commands for observations, hypotheses, steps tried, upstream issue links, close criteria, and markdown export
teach update-cutover.sh preflight to review active remediation items of type hack, upgrade-blocker, and upstream-watch
document the ops/remediation/ model as the operator-facing home for operational debt, incidents, hacks, and fixes awaiting verification

Keeping that as PR 2 avoids overloading this PR with a larger tracking architecture change while still making the intended integration path explicit.

Implementation notes

New script:

scripts/update-cutover.sh

Updated docs:

SKILL.md
README.md

Updated tests:

tests/run.sh

Test coverage added for:

preflight gate/report generation
before-state config and cron capture
post verification success path
post verification failing on target version mismatch
post verification failing on channel probe failure
missing argument values showing usage instead of falling through

Validation

Run locally:

bash -n scripts/update-cutover.sh tests/run.sh
bash tests/run.sh
git diff --check
bash scripts/skill-audit.sh .

Results:

bash -n scripts/update-cutover.sh tests/run.sh — pass
bash tests/run.sh — pass
git diff --check — pass
bash scripts/skill-audit.sh . — pass, risk LOW

I also ran an adversarial review against the repo contribution rules and open-source release checklist. The first review found real blockers; those were fixed. The second review passed with no blockers.

Security / privacy notes

The script writes cutover reports under ~/.openclaw/update-cutovers by default.
Captured command output is passed through sanitize_sensitive.
Config capture is redacted through the same sanitizer.
The script may include an excerpt from OPENCLAW_HACK_AUDIT_LOG if supplied/found; operators should still avoid pointing that variable at files containing private incident details they do not want in local cutover reports.
No generated local cutover reports are committed.

Risks / limitations

This PR still uses a hack-audit-log style review input for local workarounds. A follow-up PR will merge that concept more cleanly into the Remediation Board and bug-fixing workflow.
Host-specific channel smoke tests still need to be run by the operator; the script can run an agent sentinel with --smoke, but it cannot know every deployment’s critical Telegram/Slack/BlueBubbles/etc. scenario.
macOS app compatibility is checked by captured bundle version text matching the target version. This is intentionally conservative but can be tightened later if OpenClaw exposes a first-class app/gateway compatibility probe.
The script is a gate and evidence recorder; it does not replace release-note review or human approval for production upgrades.

Checklist

Small focused PR
Bash style matches repo (set -euo pipefail, shared lib.sh, clear logging)
Tests added for non-trivial behavior
bash tests/run.sh passes
git diff --check passes
No secrets or local generated state included
No new restart-capable watchdog
README/SKILL updated

darbsllim · 2026-05-15T14:15:55Z

This PR is merging our pre-upgrade process that helps me decide if it's worth upgrading to fix ongoing bugs I'm experiencing, or if upgrading will break my openclaw setup.

I also maintain a log of hacks and shims that I may want/need to remove/disable before upgrading.

I also use this pre-upgrade check as a way to decide if I should skip the upgrade and wait for a more stable release.

cathrynlavery · 2026-05-20T13:55:18Z

Thanks for this — the controlled cutover guardrail is a good direction, but I found a blocker.

Request changes

scripts/update-cutover.sh: --preflight can execute openclaw update. The preflight report heredoc is unquoted and contains Markdown backticks, so bash treats the backticked text as command substitution while generating the report. In a local fake-openclaw run, --preflight invoked openclaw update, which violates the script’s stated safety contract.
bash tests/run.sh fails on this PR head. The post-cutover success test trips the temp-runtime guard because the fake openclaw binary lives under a temp dir.
shellcheck scripts/update-cutover.sh tests/run.sh also flags the same heredoc/backtick area.

Suggested fix: make the report heredoc safe for literal Markdown backticks, add a regression test that fails if --preflight invokes openclaw update, and adjust the test harness so the “post passes when version matches” case doesn’t use a temp-backed fake runtime path unless that is what it is testing.

cathrynlavery · 2026-05-20T19:14:45Z

Addressed the earlier review blocker on this PR.

Verification:

git diff --check
bash -n scripts/*.sh tests/run.sh
bash tests/run.sh

Please re-review when available.

cathrynlavery · 2026-05-26T03:17:52Z

Thanks — this is a useful direction and I do not see spam/backlink concerns here. The guardrail would be valuable for safer OpenClaw update cutovers.

Before merging, can you tighten a few workflow-safety details?

--post should require the existing preflight cutover directory, or otherwise verify that the selected directory already has the expected before/ baseline / CUTOVER.md.
- Right now post mode can silently create a fresh timestamped directory if --cutover-dir is omitted, which undermines the before/after evidence guarantee.
Preflight is described as read-only, but it runs openclaw channels status --probe.
- Please either make the probe opt-in during preflight, switch to a non-probing capture, or explicitly document that this is an active probe which may touch channel integrations.
Please parse --help before requiring openclaw / python3.
- On machines without OpenClaw installed, --help should still show usage rather than failing dependency checks.

Non-blocking: it would also be helpful if preflight summarized failed captures instead of only writing .exit files.

Once those are addressed, this looks reasonable to accept.

Add controlled update cutover guardrail

ea0b4f0

fix: harden update cutover preflight report

f7b098c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add controlled update cutover guardrail#13

Add controlled update cutover guardrail#13
sene1337 wants to merge 2 commits into
cathrynlavery:mainfrom
sene1337:feat/update-cutover-guardrail

sene1337 commented May 15, 2026

Uh oh!

darbsllim commented May 15, 2026

Uh oh!

cathrynlavery commented May 20, 2026 •

edited

Loading

Uh oh!

cathrynlavery commented May 20, 2026 •

edited

Loading

Uh oh!

cathrynlavery commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sene1337 commented May 15, 2026

Title

Summary

Why this exists

Research / due diligence done while creating this

How this fits openclaw-ops architecture

Planned follow-up PR

Implementation notes

Validation

Security / privacy notes

Risks / limitations

Checklist

Uh oh!

darbsllim commented May 15, 2026

Uh oh!

cathrynlavery commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Request changes

Uh oh!

cathrynlavery commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cathrynlavery commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cathrynlavery commented May 20, 2026 •

edited

Loading

cathrynlavery commented May 20, 2026 •

edited

Loading