Add controlled update cutover guardrail#13
Conversation
|
This PR is merging our pre-upgrade process that helps me decide if it's worth upgrading to fix ongoing bugs I'm experiencing, or if upgrading will break my openclaw setup. I also maintain a log of hacks and shims that I may want/need to remove/disable before upgrading. I also use this pre-upgrade check as a way to decide if I should skip the upgrade and wait for a more stable release. |
|
Thanks for this — the controlled cutover guardrail is a good direction, but I found a blocker. Request changes
Suggested fix: make the report heredoc safe for literal Markdown backticks, add a regression test that fails if |
|
Addressed the earlier review blocker on this PR. Verification:
Please re-review when available. |
|
Thanks — this is a useful direction and I do not see spam/backlink concerns here. The guardrail would be valuable for safer OpenClaw update cutovers. Before merging, can you tighten a few workflow-safety details?
Non-blocking: it would also be helpful if preflight summarized failed captures instead of only writing Once those are addressed, this looks reasonable to accept. |
Title
Add controlled update cutover guardrail for production/custom-runtime upgrades
Summary
This PR adds
scripts/update-cutover.sh, a pre/post update guardrail for OpenClaw upgrades that need more than the existing post-update repair flow.The new script intentionally does not run
openclaw update. Instead, it wraps the update with evidence capture and decision gates:officialvscustomruntime lane selectioncli,app,both, ornone)CUTOVER.mdchecklist/reportIt also updates
SKILL.md,README.md, and tests.Why this exists
openclaw-opsalready has a good post-update recovery path:That orchestrates:
check-update.sh --fixheal.shsecurity-scan.shopenclaw health --jsonThat is valuable, but it starts after the update has already happened.
In our own production usage, the risky failures were not only “what broke after the update?” They were often “did we make the right cutover decision before updating?”
Concrete usage history that motivated this:
/Applications/OpenClaw.appare separate artifacts. Updating one does not prove the other moved with it, and mismatches can create false gateway-disconnected states, metadata upgrade loops, or pairing/app confusion./tmp//private/tmp) matter operationally. A cutover should explicitly check that the live service did not end up pointed at disposable paths.Those incidents led us to create a local “OpenClaw Upgrade Cutover SOP” with one core rule:
This PR turns that operational pattern into a reusable openclaw-ops script.
Research / due diligence done while creating this
Before writing the script, we compared:
post-update.sh,check-update.sh,heal.sh,security-scan.sh)openclaw-opsarchitecture and contribution rulesKey gap analysis:
post-update.shstarts after updateupdate-cutover.sh --preflightcaptures before-state and forces a go/no-go gateCUTOVER.mdasks operator to classify release risks as Safe / Needs migration / BlockerOPENCLAW_HACK_AUDIT_LOGexcerpt/tmp//private/tmpruntime referencesWe also checked the implementation against the
openclaw-opsphilosophy:sanitize_sensitiveis used for captures)How this fits openclaw-ops architecture
This is deliberately a guardrail, not a new updater and not a recovery daemon.
It respects the existing architecture in
docs/architecture.md:openclaw update.openclaw gateway restart.post-update.shrather than replacing it.Recommended flow after this PR:
For simple installs and routine patch updates,
post-update.shremains enough. The cutover guardrail is for production gateways, macOS app/CLI installs, custom/local runtime lanes, or updates being performed to resolve a live incident.Planned follow-up PR
This PR intentionally keeps the cutover guardrail focused.
A second PR will improve the broader remediation workflow by integrating the existing Remediation Board more directly with upgrade cutovers and bug-fixing flows.
Planned follow-up scope:
incident,hack,upstream-watch,cron-error,upgrade-blocker, andsecurity-hardeningupdate-cutover.shpreflight to review active remediation items of typehack,upgrade-blocker, andupstream-watchops/remediation/model as the operator-facing home for operational debt, incidents, hacks, and fixes awaiting verificationKeeping that as PR 2 avoids overloading this PR with a larger tracking architecture change while still making the intended integration path explicit.
Implementation notes
New script:
scripts/update-cutover.shUpdated docs:
SKILL.mdREADME.mdUpdated tests:
tests/run.shTest coverage added for:
Validation
Run locally:
bash -n scripts/update-cutover.sh tests/run.sh bash tests/run.sh git diff --check bash scripts/skill-audit.sh .Results:
bash -n scripts/update-cutover.sh tests/run.sh— passbash tests/run.sh— passgit diff --check— passbash scripts/skill-audit.sh .— pass, risk LOWI also ran an adversarial review against the repo contribution rules and open-source release checklist. The first review found real blockers; those were fixed. The second review passed with no blockers.
Security / privacy notes
~/.openclaw/update-cutoversby default.sanitize_sensitive.OPENCLAW_HACK_AUDIT_LOGif supplied/found; operators should still avoid pointing that variable at files containing private incident details they do not want in local cutover reports.Risks / limitations
--smoke, but it cannot know every deployment’s critical Telegram/Slack/BlueBubbles/etc. scenario.Checklist
set -euo pipefail, sharedlib.sh, clear logging)bash tests/run.shpassesgit diff --checkpasses