feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction by johnmathews · Pull Request #2618 · nanocoai/nanoclaw

johnmathews · 2026-05-25T22:04:23Z

Summary

Restores a set of capabilities that existed in NanoClaw v1 but were not carried over into the v2 trunk:

Image attachments as multimodal content blocks (Anthropic Messages API base64 image blocks), delivered on a separate user turn after the text prompt.
Voice attachments transcribed host-side with OpenAI Whisper, rendered inline by the formatter so the agent reads the spoken words directly (no whisper in the container).
PDF attachments extracted host-side with pdftotext, rendered inline (CDATA-wrapped) so the agent reads the written text directly.
Inbound reactions via chat.onReaction (the underlying Chat SDK already supports it; the bridge just wasn't subscribed). Reactions land as kind='chat-sdk' inbound rows with isMention=false so trigger-required channels accumulate without waking. A new mcp__nanoclaw__query_reactions tool lets the agent ask which messages currently carry which reactions.

This is the same feature set v1 shipped — the port is intentionally narrow to that scope.

What each new module does

Path	Role
`src/transcription.ts` (+test)	Host-side OpenAI Whisper wrapper. `isTranscribableMime`, `transcribeAudio`, typed `TranscriptionError`. Failures are non-fatal — stamped on the attachment as `transcriptionError` so the agent sees why instead of getting a silent voice note.
`src/pdf-extract.ts` (+test)	Host-side `pdftotext` wrapper. `isPdfMime`, `extractPdfText`, typed `PdfExtractionError`. Same non-fatal error contract.
`container/agent-runner/src/multimodal.ts` (+test)	Loads image attachments from `/workspace/<localPath>` and emits Anthropic Messages-API base64 image blocks. Refuses path escapes (`../`, absolute paths outside `/workspace`); honors per-attachment `skipMultimodal`; skips unsupported mime types and oversize images.
`container/agent-runner/src/mcp-tools/core.ts` — `query_reactions`	Reads reaction rows from the session's `inbound.db`. Optional `target_message_id` filter; newest-first; default limit 50, hard cap 500. Preserves `added=false` so the agent can see removals.
`src/channels/chat-sdk-bridge.ts` — `chat.onReaction` subscription	Synthesises reaction events as inbound rows via the new `buildReactionInbound()`. Adapters that don't expose `onReaction` log info and skip — no crash. Best-effort self-reaction filter (skips rows with empty `userId`).

Capability gating (backward-compatible at runtime)

The provider interface picks up two new surfaces:

interface AgentProvider {
  readonly supportsMultimodalContent: boolean;
  // ...existing fields...
}

interface AgentQuery {
  pushBlocks(blocks: ContentBlock[]): void;
  // ...existing fields...
}

The poll-loop calls query.pushBlocks(...) only when provider.supportsMultimodalContent === true, so providers without the multimodal path no-op transparently at runtime. Trunk providers (claude, mock) opt in. Any out-of-tree provider (e.g. opencode on the providers branch) needs a one-line supportsMultimodalContent = false and a stub pushBlocks(): void {} to satisfy the interface — that's the only call-site change required.

Tests

Host (vitest): 358 passed across 34 files. New coverage: transcription (12), pdf-extract (10), chat-sdk-bridge buildReactionInbound (4).
Container (bun:test): 122 passed / 254 expect() calls / 10 files. New coverage: multimodal image extraction (10), query_reactions tool (6), formatter rendering of transcription / pdf text / errors (6), provider push paths for claude and mock.
Container typecheck (tsc -p container/agent-runner/tsconfig.json --noEmit): clean.
Host build (tsc): clean.

Notes for maintainers

The reactions and the multimodal halves are independent at the wire level — they only share the type widening on AgentProvider. If you would prefer this split into two PRs (one for supportsMultimodalContent + image/voice/PDF, one for chat.onReaction + query_reactions), I'm happy to do that — just let me know which way you'd like to take it.

A few things I deliberately did not include in this PR, but happy to add if you'd like them on the same change:

A provider_supports_multimodal_content setter on the host-side config so operators can flip it on a per-group basis. Right now it's a provider-class constant.
Configurable Whisper / pdftotext invocation (model name, language hint, page limit, etc.). Currently hardcoded defaults; the v1 implementation was the same.
Docs page describing the attachment lifecycle end-to-end. The new modules carry inline rationale comments but there's no top-level docs entry yet.

🤖 Generated with Claude Code

…on to v2 Multimodal (W4.x-multimodal): - Provider interface widened: AgentQuery.pushBlocks(ContentBlock[]) + AgentProvider.supportsMultimodalContent flag. Mirrors v1's pattern of one text turn followed by a separate multimodal turn. - New container/agent-runner/src/multimodal.ts loads image attachments from /workspace/<localPath> and emits Anthropic Messages-API base64 image blocks. Refuses path escapes; honors per-attachment skipMultimodal. - Poll-loop calls pushBlocks after the initial query and for every follow-up batch, gated by the capability flag. - New src/transcription.ts (host-side OpenAI Whisper) and src/pdf-extract.ts (host-side pdftotext) preprocess voice and PDF attachments inside the bridge before the row reaches the container. Failures are non-fatal — the formatter surfaces the error to the agent rather than dropping the attachment. - Formatter renders transcription, transcriptionError, extractedText (CDATA-wrapped), and pdfExtractionError inline so the agent reads the spoken/written words directly. Reactions (W4.x-reactions-inbound): - chat-sdk-bridge subscribes to chat.onReaction when the adapter exposes it. Reaction events are synthesized as kind='chat-sdk' inbound rows via buildReactionInbound() — isMention=false so trigger-required channels accumulate rather than wake. Self-reaction filter skips empty userId events. - New mcp__nanoclaw__query_reactions tool reads the session's reaction rows from inbound.db; optional target_message_id filter; newest-first; default limit 50, hard cap 500. Preserves added=false for removals. Tests: - Host: +28 (transcription 12, pdf-extract 10, chat-sdk-bridge 4, formatter ports 6 covered under container) — 459 pass total. - Container: +18 (multimodal 10, queryReactions 6, formatter 6 of which are part of the container suite, mock/claude push paths) — 118 pass total / 251 expects. v2 service: restarted, /health 200 healthy=true, all channels connected. Container source is bind-mounted; no ./container/build.sh required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

johnmathews requested review from gabi-simons and gavrielc as code owners May 25, 2026 22:04

This was referenced May 26, 2026

🦞 OpenClaw 生态日报 2026-05-26 JohnGao818/big_model_radar#7

Open

🦞 OpenClaw 生态日报 2026-05-26 ivanweng2077/big_model_radar#92

Open

🦞 OpenClaw 生态日报 2026-05-26 JohnGao818/big_model_radar#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618

feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618
johnmathews wants to merge 1 commit into
nanocoai:mainfrom
johnmathews:feat/multimodal-reactions-port

johnmathews commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnmathews commented May 25, 2026

Summary

What each new module does

Capability gating (backward-compatible at runtime)

Tests

Notes for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant