Skip to content

fix: per-token rate limiting + smarter offline detection + token prefix lookup#350

Open
thomasrocas wants to merge 5 commits into
abhi1693:masterfrom
thomasrocas:fix/agent-auth-ratelimit-and-token-recovery
Open

fix: per-token rate limiting + smarter offline detection + token prefix lookup#350
thomasrocas wants to merge 5 commits into
abhi1693:masterfrom
thomasrocas:fix/agent-auth-ratelimit-and-token-recovery

Conversation

@thomasrocas
Copy link
Copy Markdown

Root Cause

Agents sharing a Docker bridge IP were all counted toward the same IP-based rate limit (20 req/60s). When 13+ agents burst-fired heartbeats after a gateway restart, all subsequent MC API requests returned 429 — making all agents appear unresponsive/offline even though their tokens were valid.

A second compounding issue: the offline detection threshold was a fixed 10 minutes for all agents. Agents with 30–45m heartbeat intervals were being incorrectly marked offline between legitimate pings.

Changes

agent_auth.py — rate limit by token prefix, not client IP

All Docker-side agents share the same bridge IP (192.168.x.1). IP-based limiting creates false-positive 429s the moment more than 20 requests arrive from any one gateway. Rate key is now the first 8 chars of the agent token — each agent gets its own independent bucket.

rate_limit.py — raise limit from 20 → 60 per token per minute

1 req/second sustained per agent. Covers heartbeat + board reads + task writes without blocking.

provisioning_db.py + agents.py — per-agent offline threshold

New last_heartbeat_at column tracks the actual heartbeat timestamp. with_computed_status now uses 1.5× the agent's configured heartbeat interval as the offline threshold.

  • Agent on 10m heartbeat → offline after 15m (unchanged)
  • Agent on 45m heartbeat → offline after 67.5m (was: offline after 10m)

db_agent_state.py — store token prefix on mint/rotation

O(1) DB pre-filter before PBKDF2. Eliminates O(N × PBKDF2) linear scan at auth time.

Migration c7e4f2a9b1d3

Adds last_heartbeat_at and agent_token_prefix. Already applied to production.

Testing

  • test_agent_token_lookup.py — prefix fast path + null-prefix fallback
  • test_agent_computed_status.py — per-interval offline threshold edge cases
  • test_agent_health_api.py — heartbeat API updates last_heartbeat_at

Thomas Rocas added 3 commits April 13, 2026 08:45
Settings PATCH never included preferred_name, so chat source resolution
(which prefers preferred_name over name) always showed the stale value.

Fix: add preferred_name: resolvedName.trim() to the PATCH payload
alongside name. Backend UserUpdate already accepts preferred_name and
the PATCH handler writes all provided fields via model_dump(exclude_unset).

No backend changes needed — UserUpdate schema and PATCH /me handler
already handle preferred_name correctly.

Resolution chain after fix:
  board_memory.py: preferred_name (now updated) → name → "User"
  display-name.ts: preferred_name (now updated) → name → fallback
… lookup

Root cause: agents sharing a Docker bridge IP were all counted toward the same
IP-based rate limit (20 req/60s). When 13 agents burst-fired heartbeats after a
gateway restart, all subsequent requests hit 429 — making all agents appear
unresponsive in MC even though their tokens were valid.

Changes:
- agent_auth.py: rate-limit by token prefix (first 8 chars) instead of client IP.
  Each agent now has its own independent 60 req/60s bucket regardless of network
  topology.
- rate_limit.py: raise agent_auth limit from 20 → 60 per token per minute. Enough
  headroom for heartbeat + board reads + task writes without false-positive blocking.
- agents.py: add last_heartbeat_at and agent_token_prefix fields (migration c7e4f2a9b1d3
  already applied to production).
- db_agent_state.py: store token prefix on mint/rotation for O(1) pre-filter.
- provisioning_db.py: use last_heartbeat_at + 1.5× per-agent interval for offline
  threshold. Previously all agents used fixed 10-min OFFLINE_AFTER — which wrongly
  marks agents with 30–45m heartbeat cycles as offline between legitimate pings.
- add migration: c7e4f2a9b1d3 adds last_heartbeat_at and agent_token_prefix columns.
- add tests for new agent health API, computed status, and token lookup paths.

Companion fixes (outside this repo):
- ai.openclaw.token-sync.plist StartInterval: 1800 → 300 (5-min sync instead of 30m)
- All agent HEARTBEAT.md files: 401 → auto-recover via sync script, then retry once

Closes: agents appear offline/unresponsive after any gateway restart or token rotation
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 868852d5f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +157 to +159
rate_key = resolved[:8]
if not await agent_auth_limiter.is_allowed(rate_key):
raise HTTPException(status_code=status.HTTP_429_TOO_MANY_REQUESTS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add source-based throttling before prefix bucket check

The new limiter key is derived from resolved[:8], which is fully attacker-controlled until after token verification, so unauthenticated callers can bypass throttling by varying the first 8 characters on each request. This removes the previous per-source cap and allows unbounded auth attempts/CPU load (especially expensive while NULL-prefix fallback rows still exist and trigger PBKDF2 checks), which is a security and availability regression from the prior IP-based guard.

Useful? React with 👍 / 👎.

Comment on lines +67 to +69
m = re.match(r"^(\d+)([smh])$", every or "")
if not m:
return _FALLBACK_OFFLINE_AFTER
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Support day cadence when computing offline threshold

The new heartbeat threshold parser only accepts [smh], so agents configured with a day-based cadence (e.g. 1d) fall back to the fixed 10-minute threshold and will be marked offline far too early. This is a functional regression because day units are already exposed in the UI (frontend/src/app/board-groups/[groupId]/page.tsx, unit option value="d"), so valid configurations can now produce incorrect offline status.

Useful? React with 👍 / 👎.

Thomas Rocas added 2 commits May 13, 2026 02:55
Agent scripts have historically queried ?tag=chat when polling board chat, but
the API only recognised is_chat as a filter parameter — tag/tags were silently
ignored, returning ALL board_memory entries instead of only chat messages.

This worked accidentally when the most recent board entries were chat messages.
After heavy non-chat writes (bootstrap logs, config changes, heartbeat memory)
non-chat entries dominated the recency list, causing agents to read garbage
results: they saw their own old untagged responses, concluded questions were
already answered, and stopped posting replies.

Fix: accept tag and tags as legacy aliases in both the user-facing board_memory
router and the agent-scoped wrapper in agent.py. When tag=chat or tags=chat is
present and is_chat is not set, resolve to is_chat=True. include_in_schema=False
keeps these aliases out of the public API docs.

Root cause documented in PR abhi1693#350 companion notes.
Affects: all 13 agents using ?tag=chat in heartbeat curl scripts.
ensure_session called sessions.patch before sending each chat notification.
When the target session was mid-processing (e.g. running an LLM call),
sessions.patch serialised behind the active operation and blocked for up to
20 seconds. MC's 10-second WS timeout fired, the error was silently dropped,
and the agent never received the push notification.

Chat delivery previously relied on heartbeat polling (up to 10-minute delay)
as a fallback. Removing the unnecessary sessions.patch makes delivery
synchronous and immediate via chat.send alone.

The session is always present after agent provisioning; no pre-flight patch
is needed for message delivery.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant