OpenClaw Guardian

Three-layer self-healing system for OpenClaw Gateway
OpenClaw 网关三层自愈系统

Quick Start • Architecture • Scripts • Configuration • Testing

What is this / 这是什么

Running multiple AI agents 24/7 with OpenClaw is straightforward — until things break at 3 AM. macOS launchd restarts crashed processes, but it can't detect a gateway that's alive yet unresponsive, fix a corrupted config, or clean up orphaned child processes eating all your RAM.

This project fills those gaps with a battle-tested, cron-based monitoring system. It has automatically recovered from 40+ incidents over half a month of continuous operation with 6 agents.

用 OpenClaw 跑 6 个 AI Agent 全天候运行很简单——直到凌晨 3 点出事。macOS 的 launchd 只管进程是否存活，无法检测网关"假活"、修复被 Agent 改坏的配置、或清理吃光内存的孤儿进程。

本项目用 cron 定时任务实现了经过实战检验的自愈系统。半个月内自动恢复了 40 多次故障，覆盖 6 个 Agent 的全天候运行。

Blog post / 详细踩坑经历: OpenClaw 7×24 生存指南 (coming soon)

Architecture / 架构

Three layers, each with a clear responsibility:

三层架构，各司其职：

Layer	Component	Responsibility
L1 System	macOS `launchd`	Process dies → restart within 30s
L2 Application	`heartbeat-guardian.sh`	Health checks → smart repair → config rollback → orphan cleanup → OOM protection
L3 Data	Maintenance scripts	Memory compaction, cron health, Chrome repair, upgrade protection

层级	组件	职责
L1 系统级	macOS `launchd`	进程挂了 → 30s 内拉起
L2 应用级	`heartbeat-guardian.sh`	健康检查 → 智能修复 → 配置回滚 → 孤儿清理 → OOM 保护
L3 数据级	维护脚本群	记忆压缩 / Cron 健康 / Chrome 修复 / 升级保护

Key design decision: All monitoring runs via system crontab, completely independent of the OpenClaw Gateway. The guardian never depends on the thing it guards.

核心设计决策：所有运维脚本跑在系统 crontab 上，完全不依赖 OpenClaw Gateway。守护者绝不依赖被守护的对象。

Quick Start / 快速开始

Prerequisites / 前置要求

macOS with launchd (recommended) or any Unix with cron
OpenClaw installed and configured
Bash 4+ and Python 3.9+

Install / 安装

git clone https://github.com/lanyasheng/openclaw-guardian.git
cd openclaw-guardian

# Copy scripts to OpenClaw directory
cp scripts/heartbeat-guardian.sh ~/.openclaw/scripts/
cp scripts/memory_maintenance.py ~/.openclaw/scripts/
cp scripts/check_cron_health.py ~/.openclaw/scripts/
cp scripts/cleanup_heartbeat_sessions.sh ~/.openclaw/scripts/
cp scripts/post-update.sh ~/.openclaw/scripts/
cp scripts/upgrade-openclaw.sh ~/.openclaw/scripts/
cp scripts/claude-process-reaper.sh ~/.openclaw/scripts/
chmod +x ~/.openclaw/scripts/*.sh

Configure crontab / 配置定时任务

crontab -e

Add the following entries / 添加以下条目：

# L2 Heartbeat — every 5 minutes / 每 5 分钟
*/5 * * * * /bin/bash ~/.openclaw/scripts/heartbeat-guardian.sh

# L3 Memory maintenance — every Sunday 4 AM / 每周日凌晨 4 点
0 4 * * 0 python3 ~/.openclaw/scripts/memory_maintenance.py --all --broadcast

# L3 Heartbeat session cleanup — daily 4 AM / 每天凌晨 4 点
0 4 * * * /bin/bash ~/.openclaw/scripts/cleanup_heartbeat_sessions.sh >> ~/.openclaw/logs/heartbeat-cleanup.log 2>&1

# L3 Cron health — every 30 minutes / 每 30 分钟
*/30 * * * * python3 ~/.openclaw/scripts/check_cron_health.py

# L2 Claude Code zombie reaper — every 30 minutes, kills sessions >12h
# Claude Code 僵尸收割 — 每 30 分钟，清理超过 12h 的泄漏会话
*/30 * * * * /bin/bash ~/.openclaw/scripts/claude-process-reaper.sh

Verify / 验证

# Dry run (no actual repairs)
bash ~/.openclaw/scripts/heartbeat-guardian.sh --dry-run

# Run tests
bash tests/test-guardian.sh

Scripts / 脚本说明

Script	Lines	Purpose
`heartbeat-guardian.sh`	800+	Core. Step 0 system protection (orphan/OOM) → Step 0.5 log rotation + Chrome cleanup → HTTP+RPC health check → config repair → exponential backoff restart
`test-guardian.sh`	700+	47 unit tests + 5 integration tests
`memory_maintenance.py`	500+	MEMORY.md compaction + daily memory archival + learnings cleanup
`check_cron_health.py`	120+	Critical cron task status check + Chrome CDP self-repair
`cleanup_heartbeat_sessions.sh`	~60	Heartbeat session bloat prevention — archives sessions exceeding 50KB threshold + rotates session IDs
`post-update.sh`	80+	Post-upgrade restart + cron timeout recovery + delivery fix
`upgrade-openclaw.sh`	~50	Upgrade entrypoint, auto-invokes post-update
`claude-process-reaper.sh`	~100	Claude Code zombie reaper — kills leaked CLI sessions older than 12h (configurable), with SIGTERM→SIGKILL escalation and memory monitoring

脚本	行数	作用
`heartbeat-guardian.sh`	800+	核心。Step 0 系统保护(孤儿/OOM) → Step 0.5 日志轮转+Chrome清理 → HTTP+RPC 健康检查 → 配置修复 → 指数退避重启
`test-guardian.sh`	700+	47 个单元测试 + 5 个集成测试
`memory_maintenance.py`	500+	MEMORY.md 压缩 + daily memory 归档 + learnings 清理
`check_cron_health.py`	120+	关键 cron 任务状态检查 + Chrome CDP 自修复
`cleanup_heartbeat_sessions.sh`	~60	Heartbeat session 膨胀防护——超过 50KB 阈值的 session 自动归档 + 轮转 session ID
`post-update.sh`	80+	升级后重启 + 恢复 cron timeout + 修复 delivery
`upgrade-openclaw.sh`	~50	升级入口，自动调用 post-update
`claude-process-reaper.sh`	~100	Claude Code 僵尸收割器——清理超过 12h 的泄漏 CLI 会话，SIGTERM→SIGKILL 升级 + 内存监控

Configuration / 配置

Edit the variables at the top of heartbeat-guardian.sh:

编辑 heartbeat-guardian.sh 顶部的配置区：

GATEWAY_PORT=18789                         # Gateway HTTP port
HEALTH_TIMEOUT=5                           # HTTP health check timeout (seconds)
RPC_HEALTH_TIMEOUT=8                       # RPC health check timeout (seconds)
DEGRADED_ERROR_THRESHOLD=15                # Severe error threshold in err.log
DEGRADED_CONSECUTIVE_THRESHOLD=3           # Consecutive DEGRADED count before restart
MAX_TOTAL_RETRIES=6                        # Maximum retry count
CRASH_DECAY_HOURS=6                        # Fault counter auto-decay interval
BACKOFF_DELAYS=(60 120 300 600 900 1800)   # Exponential backoff (seconds)
ACTIVE_HOURS_START="08:00"                 # Active hours start (DEGRADED won't trigger restart)
ACTIVE_HOURS_END="23:00"                   # Active hours end

What it handles / 能处理的问题

Problems this system has handled in production:

以下是生产环境中实际处理过的问题：

Agent corrupts config → Gateway boot loop → auto config rollback
API quota exhausted → model fallback inconsistency across 4 priority layers
Session bloat → agent context degradation
Chrome SingletonLock / Renderer pile-up → browser automation failure
Guardian itself destabilizing Gateway → overly aggressive health checks
ACP child process orphan accumulation → OOM → macOS system crash (OpenClaw Bug #35886)
SOCKS proxy interfering with localhost health checks → false negative detection
Overly strict DEGRADED thresholds → normal operational noise triggering unnecessary restarts
Chrome orphan processes consuming 10GB+ RAM → gradual memory exhaustion
Log files growing unbounded → disk space exhaustion
Heartbeat session bloat → target: "last" causes unbounded session growth (v2026.3.13 lacks isolatedSession) → agent unresponsive due to lane blocking

Testing / 测试

# Run all 47 tests
bash tests/test-guardian.sh

# Example output:
# ✓ crash counter starts at 0
# ✓ crash counter increments
# ...
# Results: 47 passed, 0 failed, 0 skipped

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
tests		tests
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenClaw Guardian

What is this / 这是什么

Architecture / 架构

Quick Start / 快速开始

Prerequisites / 前置要求

Install / 安装

Configure crontab / 配置定时任务

Verify / 验证

Scripts / 脚本说明

Configuration / 配置

What it handles / 能处理的问题

Testing / 测试

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenClaw Guardian

What is this / 这是什么

Architecture / 架构

Quick Start / 快速开始

Prerequisites / 前置要求

Install / 安装

Configure crontab / 配置定时任务

Verify / 验证

Scripts / 脚本说明

Configuration / 配置

What it handles / 能处理的问题

Testing / 测试

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages