fix: per-token rate limiting + smarter offline detection + token prefix lookup#350
Conversation
Settings PATCH never included preferred_name, so chat source resolution (which prefers preferred_name over name) always showed the stale value. Fix: add preferred_name: resolvedName.trim() to the PATCH payload alongside name. Backend UserUpdate already accepts preferred_name and the PATCH handler writes all provided fields via model_dump(exclude_unset). No backend changes needed — UserUpdate schema and PATCH /me handler already handle preferred_name correctly. Resolution chain after fix: board_memory.py: preferred_name (now updated) → name → "User" display-name.ts: preferred_name (now updated) → name → fallback
… lookup Root cause: agents sharing a Docker bridge IP were all counted toward the same IP-based rate limit (20 req/60s). When 13 agents burst-fired heartbeats after a gateway restart, all subsequent requests hit 429 — making all agents appear unresponsive in MC even though their tokens were valid. Changes: - agent_auth.py: rate-limit by token prefix (first 8 chars) instead of client IP. Each agent now has its own independent 60 req/60s bucket regardless of network topology. - rate_limit.py: raise agent_auth limit from 20 → 60 per token per minute. Enough headroom for heartbeat + board reads + task writes without false-positive blocking. - agents.py: add last_heartbeat_at and agent_token_prefix fields (migration c7e4f2a9b1d3 already applied to production). - db_agent_state.py: store token prefix on mint/rotation for O(1) pre-filter. - provisioning_db.py: use last_heartbeat_at + 1.5× per-agent interval for offline threshold. Previously all agents used fixed 10-min OFFLINE_AFTER — which wrongly marks agents with 30–45m heartbeat cycles as offline between legitimate pings. - add migration: c7e4f2a9b1d3 adds last_heartbeat_at and agent_token_prefix columns. - add tests for new agent health API, computed status, and token lookup paths. Companion fixes (outside this repo): - ai.openclaw.token-sync.plist StartInterval: 1800 → 300 (5-min sync instead of 30m) - All agent HEARTBEAT.md files: 401 → auto-recover via sync script, then retry once Closes: agents appear offline/unresponsive after any gateway restart or token rotation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 868852d5f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rate_key = resolved[:8] | ||
| if not await agent_auth_limiter.is_allowed(rate_key): | ||
| raise HTTPException(status_code=status.HTTP_429_TOO_MANY_REQUESTS) |
There was a problem hiding this comment.
Add source-based throttling before prefix bucket check
The new limiter key is derived from resolved[:8], which is fully attacker-controlled until after token verification, so unauthenticated callers can bypass throttling by varying the first 8 characters on each request. This removes the previous per-source cap and allows unbounded auth attempts/CPU load (especially expensive while NULL-prefix fallback rows still exist and trigger PBKDF2 checks), which is a security and availability regression from the prior IP-based guard.
Useful? React with 👍 / 👎.
| m = re.match(r"^(\d+)([smh])$", every or "") | ||
| if not m: | ||
| return _FALLBACK_OFFLINE_AFTER |
There was a problem hiding this comment.
Support day cadence when computing offline threshold
The new heartbeat threshold parser only accepts [smh], so agents configured with a day-based cadence (e.g. 1d) fall back to the fixed 10-minute threshold and will be marked offline far too early. This is a functional regression because day units are already exposed in the UI (frontend/src/app/board-groups/[groupId]/page.tsx, unit option value="d"), so valid configurations can now produce incorrect offline status.
Useful? React with 👍 / 👎.
Agent scripts have historically queried ?tag=chat when polling board chat, but the API only recognised is_chat as a filter parameter — tag/tags were silently ignored, returning ALL board_memory entries instead of only chat messages. This worked accidentally when the most recent board entries were chat messages. After heavy non-chat writes (bootstrap logs, config changes, heartbeat memory) non-chat entries dominated the recency list, causing agents to read garbage results: they saw their own old untagged responses, concluded questions were already answered, and stopped posting replies. Fix: accept tag and tags as legacy aliases in both the user-facing board_memory router and the agent-scoped wrapper in agent.py. When tag=chat or tags=chat is present and is_chat is not set, resolve to is_chat=True. include_in_schema=False keeps these aliases out of the public API docs. Root cause documented in PR abhi1693#350 companion notes. Affects: all 13 agents using ?tag=chat in heartbeat curl scripts.
ensure_session called sessions.patch before sending each chat notification. When the target session was mid-processing (e.g. running an LLM call), sessions.patch serialised behind the active operation and blocked for up to 20 seconds. MC's 10-second WS timeout fired, the error was silently dropped, and the agent never received the push notification. Chat delivery previously relied on heartbeat polling (up to 10-minute delay) as a fallback. Removing the unnecessary sessions.patch makes delivery synchronous and immediate via chat.send alone. The session is always present after agent provisioning; no pre-flight patch is needed for message delivery.
Root Cause
Agents sharing a Docker bridge IP were all counted toward the same IP-based rate limit (20 req/60s). When 13+ agents burst-fired heartbeats after a gateway restart, all subsequent MC API requests returned 429 — making all agents appear unresponsive/offline even though their tokens were valid.
A second compounding issue: the offline detection threshold was a fixed 10 minutes for all agents. Agents with 30–45m heartbeat intervals were being incorrectly marked offline between legitimate pings.
Changes
agent_auth.py— rate limit by token prefix, not client IPAll Docker-side agents share the same bridge IP (
192.168.x.1). IP-based limiting creates false-positive 429s the moment more than 20 requests arrive from any one gateway. Rate key is now the first 8 chars of the agent token — each agent gets its own independent bucket.rate_limit.py— raise limit from 20 → 60 per token per minute1 req/second sustained per agent. Covers heartbeat + board reads + task writes without blocking.
provisioning_db.py+agents.py— per-agent offline thresholdNew
last_heartbeat_atcolumn tracks the actual heartbeat timestamp.with_computed_statusnow uses 1.5× the agent's configured heartbeat interval as the offline threshold.db_agent_state.py— store token prefix on mint/rotationO(1) DB pre-filter before PBKDF2. Eliminates O(N × PBKDF2) linear scan at auth time.
Migration
c7e4f2a9b1d3Adds
last_heartbeat_atandagent_token_prefix. Already applied to production.Testing
test_agent_token_lookup.py— prefix fast path + null-prefix fallbacktest_agent_computed_status.py— per-interval offline threshold edge casestest_agent_health_api.py— heartbeat API updateslast_heartbeat_at