fix: per-token rate limiting + smarter offline detection + token prefix lookup by thomasrocas · Pull Request #350 · abhi1693/openclaw-mission-control

thomasrocas · 2026-05-13T09:16:20Z

Root Cause

Agents sharing a Docker bridge IP were all counted toward the same IP-based rate limit (20 req/60s). When 13+ agents burst-fired heartbeats after a gateway restart, all subsequent MC API requests returned 429 — making all agents appear unresponsive/offline even though their tokens were valid.

A second compounding issue: the offline detection threshold was a fixed 10 minutes for all agents. Agents with 30–45m heartbeat intervals were being incorrectly marked offline between legitimate pings.

Changes

`agent_auth.py` — rate limit by token prefix, not client IP

All Docker-side agents share the same bridge IP (192.168.x.1). IP-based limiting creates false-positive 429s the moment more than 20 requests arrive from any one gateway. Rate key is now the first 8 chars of the agent token — each agent gets its own independent bucket.

`rate_limit.py` — raise limit from 20 → 60 per token per minute

1 req/second sustained per agent. Covers heartbeat + board reads + task writes without blocking.

`provisioning_db.py` + `agents.py` — per-agent offline threshold

New last_heartbeat_at column tracks the actual heartbeat timestamp. with_computed_status now uses 1.5× the agent's configured heartbeat interval as the offline threshold.

Agent on 10m heartbeat → offline after 15m (unchanged)
Agent on 45m heartbeat → offline after 67.5m (was: offline after 10m)

`db_agent_state.py` — store token prefix on mint/rotation

O(1) DB pre-filter before PBKDF2. Eliminates O(N × PBKDF2) linear scan at auth time.

Migration `c7e4f2a9b1d3`

Adds last_heartbeat_at and agent_token_prefix. Already applied to production.

Testing

test_agent_token_lookup.py — prefix fast path + null-prefix fallback
test_agent_computed_status.py — per-interval offline threshold edge cases
test_agent_health_api.py — heartbeat API updates last_heartbeat_at

Settings PATCH never included preferred_name, so chat source resolution (which prefers preferred_name over name) always showed the stale value. Fix: add preferred_name: resolvedName.trim() to the PATCH payload alongside name. Backend UserUpdate already accepts preferred_name and the PATCH handler writes all provided fields via model_dump(exclude_unset). No backend changes needed — UserUpdate schema and PATCH /me handler already handle preferred_name correctly. Resolution chain after fix: board_memory.py: preferred_name (now updated) → name → "User" display-name.ts: preferred_name (now updated) → name → fallback

… lookup Root cause: agents sharing a Docker bridge IP were all counted toward the same IP-based rate limit (20 req/60s). When 13 agents burst-fired heartbeats after a gateway restart, all subsequent requests hit 429 — making all agents appear unresponsive in MC even though their tokens were valid. Changes: - agent_auth.py: rate-limit by token prefix (first 8 chars) instead of client IP. Each agent now has its own independent 60 req/60s bucket regardless of network topology. - rate_limit.py: raise agent_auth limit from 20 → 60 per token per minute. Enough headroom for heartbeat + board reads + task writes without false-positive blocking. - agents.py: add last_heartbeat_at and agent_token_prefix fields (migration c7e4f2a9b1d3 already applied to production). - db_agent_state.py: store token prefix on mint/rotation for O(1) pre-filter. - provisioning_db.py: use last_heartbeat_at + 1.5× per-agent interval for offline threshold. Previously all agents used fixed 10-min OFFLINE_AFTER — which wrongly marks agents with 30–45m heartbeat cycles as offline between legitimate pings. - add migration: c7e4f2a9b1d3 adds last_heartbeat_at and agent_token_prefix columns. - add tests for new agent health API, computed status, and token lookup paths. Companion fixes (outside this repo): - ai.openclaw.token-sync.plist StartInterval: 1800 → 300 (5-min sync instead of 30m) - All agent HEARTBEAT.md files: 401 → auto-recover via sync script, then retry once Closes: agents appear offline/unresponsive after any gateway restart or token rotation

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 868852d5f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T09:23:58Z

+    rate_key = resolved[:8]
+    if not await agent_auth_limiter.is_allowed(rate_key):
+        raise HTTPException(status_code=status.HTTP_429_TOO_MANY_REQUESTS)


Add source-based throttling before prefix bucket check

The new limiter key is derived from resolved[:8], which is fully attacker-controlled until after token verification, so unauthenticated callers can bypass throttling by varying the first 8 characters on each request. This removes the previous per-source cap and allows unbounded auth attempts/CPU load (especially expensive while NULL-prefix fallback rows still exist and trigger PBKDF2 checks), which is a security and availability regression from the prior IP-based guard.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T09:23:58Z

+    m = re.match(r"^(\d+)([smh])$", every or "")
+    if not m:
+        return _FALLBACK_OFFLINE_AFTER


Support day cadence when computing offline threshold

The new heartbeat threshold parser only accepts [smh], so agents configured with a day-based cadence (e.g. 1d) fall back to the fixed 10-minute threshold and will be marked offline far too early. This is a functional regression because day units are already exposed in the UI (frontend/src/app/board-groups/[groupId]/page.tsx, unit option value="d"), so valid configurations can now produce incorrect offline status.

Useful? React with 👍 / 👎.

Agent scripts have historically queried ?tag=chat when polling board chat, but the API only recognised is_chat as a filter parameter — tag/tags were silently ignored, returning ALL board_memory entries instead of only chat messages. This worked accidentally when the most recent board entries were chat messages. After heavy non-chat writes (bootstrap logs, config changes, heartbeat memory) non-chat entries dominated the recency list, causing agents to read garbage results: they saw their own old untagged responses, concluded questions were already answered, and stopped posting replies. Fix: accept tag and tags as legacy aliases in both the user-facing board_memory router and the agent-scoped wrapper in agent.py. When tag=chat or tags=chat is present and is_chat is not set, resolve to is_chat=True. include_in_schema=False keeps these aliases out of the public API docs. Root cause documented in PR abhi1693#350 companion notes. Affects: all 13 agents using ?tag=chat in heartbeat curl scripts.

ensure_session called sessions.patch before sending each chat notification. When the target session was mid-processing (e.g. running an LLM call), sessions.patch serialised behind the active operation and blocked for up to 20 seconds. MC's 10-second WS timeout fired, the error was silently dropped, and the agent never received the push notification. Chat delivery previously relied on heartbeat polling (up to 10-minute delay) as a fallback. Removing the unnecessary sessions.patch makes delivery synchronous and immediate via chat.send alone. The session is always present after agent provisioning; no pre-flight patch is needed for message delivery.

Thomas Rocas added 3 commits April 13, 2026 08:45

Merge remote-tracking branch 'origin/master'

9270d5b

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Thomas Rocas added 2 commits May 13, 2026 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: per-token rate limiting + smarter offline detection + token prefix lookup#350

fix: per-token rate limiting + smarter offline detection + token prefix lookup#350
thomasrocas wants to merge 5 commits into
abhi1693:masterfrom
thomasrocas:fix/agent-auth-ratelimit-and-token-recovery

thomasrocas commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomasrocas commented May 13, 2026

Root Cause

Changes

agent_auth.py — rate limit by token prefix, not client IP

rate_limit.py — raise limit from 20 → 60 per token per minute

provisioning_db.py + agents.py — per-agent offline threshold

db_agent_state.py — store token prefix on mint/rotation

Migration c7e4f2a9b1d3

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`agent_auth.py` — rate limit by token prefix, not client IP

`rate_limit.py` — raise limit from 20 → 60 per token per minute

`provisioning_db.py` + `agents.py` — per-agent offline threshold

`db_agent_state.py` — store token prefix on mint/rotation

Migration `c7e4f2a9b1d3`