Skip to content

feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618

Open
johnmathews wants to merge 1 commit into
nanocoai:mainfrom
johnmathews:feat/multimodal-reactions-port
Open

feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618
johnmathews wants to merge 1 commit into
nanocoai:mainfrom
johnmathews:feat/multimodal-reactions-port

Conversation

@johnmathews
Copy link
Copy Markdown

Summary

Restores a set of capabilities that existed in NanoClaw v1 but were not carried over into the v2 trunk:

  • Image attachments as multimodal content blocks (Anthropic Messages API base64 image blocks), delivered on a separate user turn after the text prompt.
  • Voice attachments transcribed host-side with OpenAI Whisper, rendered inline by the formatter so the agent reads the spoken words directly (no whisper in the container).
  • PDF attachments extracted host-side with pdftotext, rendered inline (CDATA-wrapped) so the agent reads the written text directly.
  • Inbound reactions via chat.onReaction (the underlying Chat SDK already supports it; the bridge just wasn't subscribed). Reactions land as kind='chat-sdk' inbound rows with isMention=false so trigger-required channels accumulate without waking. A new mcp__nanoclaw__query_reactions tool lets the agent ask which messages currently carry which reactions.

This is the same feature set v1 shipped — the port is intentionally narrow to that scope.

What each new module does

Path Role
src/transcription.ts (+test) Host-side OpenAI Whisper wrapper. isTranscribableMime, transcribeAudio, typed TranscriptionError. Failures are non-fatal — stamped on the attachment as transcriptionError so the agent sees why instead of getting a silent voice note.
src/pdf-extract.ts (+test) Host-side pdftotext wrapper. isPdfMime, extractPdfText, typed PdfExtractionError. Same non-fatal error contract.
container/agent-runner/src/multimodal.ts (+test) Loads image attachments from /workspace/<localPath> and emits Anthropic Messages-API base64 image blocks. Refuses path escapes (../, absolute paths outside /workspace); honors per-attachment skipMultimodal; skips unsupported mime types and oversize images.
container/agent-runner/src/mcp-tools/core.tsquery_reactions Reads reaction rows from the session's inbound.db. Optional target_message_id filter; newest-first; default limit 50, hard cap 500. Preserves added=false so the agent can see removals.
src/channels/chat-sdk-bridge.tschat.onReaction subscription Synthesises reaction events as inbound rows via the new buildReactionInbound(). Adapters that don't expose onReaction log info and skip — no crash. Best-effort self-reaction filter (skips rows with empty userId).

Capability gating (backward-compatible at runtime)

The provider interface picks up two new surfaces:

interface AgentProvider {
  readonly supportsMultimodalContent: boolean;
  // ...existing fields...
}

interface AgentQuery {
  pushBlocks(blocks: ContentBlock[]): void;
  // ...existing fields...
}

The poll-loop calls query.pushBlocks(...) only when provider.supportsMultimodalContent === true, so providers without the multimodal path no-op transparently at runtime. Trunk providers (claude, mock) opt in. Any out-of-tree provider (e.g. opencode on the providers branch) needs a one-line supportsMultimodalContent = false and a stub pushBlocks(): void {} to satisfy the interface — that's the only call-site change required.

Tests

  • Host (vitest): 358 passed across 34 files. New coverage: transcription (12), pdf-extract (10), chat-sdk-bridge buildReactionInbound (4).
  • Container (bun:test): 122 passed / 254 expect() calls / 10 files. New coverage: multimodal image extraction (10), query_reactions tool (6), formatter rendering of transcription / pdf text / errors (6), provider push paths for claude and mock.
  • Container typecheck (tsc -p container/agent-runner/tsconfig.json --noEmit): clean.
  • Host build (tsc): clean.

Notes for maintainers

The reactions and the multimodal halves are independent at the wire level — they only share the type widening on AgentProvider. If you would prefer this split into two PRs (one for supportsMultimodalContent + image/voice/PDF, one for chat.onReaction + query_reactions), I'm happy to do that — just let me know which way you'd like to take it.

A few things I deliberately did not include in this PR, but happy to add if you'd like them on the same change:

  • A provider_supports_multimodal_content setter on the host-side config so operators can flip it on a per-group basis. Right now it's a provider-class constant.
  • Configurable Whisper / pdftotext invocation (model name, language hint, page limit, etc.). Currently hardcoded defaults; the v1 implementation was the same.
  • Docs page describing the attachment lifecycle end-to-end. The new modules carry inline rationale comments but there's no top-level docs entry yet.

🤖 Generated with Claude Code

…on to v2

Multimodal (W4.x-multimodal):
- Provider interface widened: AgentQuery.pushBlocks(ContentBlock[]) +
  AgentProvider.supportsMultimodalContent flag. Mirrors v1's pattern of
  one text turn followed by a separate multimodal turn.
- New container/agent-runner/src/multimodal.ts loads image attachments
  from /workspace/<localPath> and emits Anthropic Messages-API base64
  image blocks. Refuses path escapes; honors per-attachment skipMultimodal.
- Poll-loop calls pushBlocks after the initial query and for every
  follow-up batch, gated by the capability flag.
- New src/transcription.ts (host-side OpenAI Whisper) and src/pdf-extract.ts
  (host-side pdftotext) preprocess voice and PDF attachments inside the
  bridge before the row reaches the container. Failures are non-fatal —
  the formatter surfaces the error to the agent rather than dropping the
  attachment.
- Formatter renders transcription, transcriptionError, extractedText
  (CDATA-wrapped), and pdfExtractionError inline so the agent reads the
  spoken/written words directly.

Reactions (W4.x-reactions-inbound):
- chat-sdk-bridge subscribes to chat.onReaction when the adapter exposes
  it. Reaction events are synthesized as kind='chat-sdk' inbound rows
  via buildReactionInbound() — isMention=false so trigger-required
  channels accumulate rather than wake. Self-reaction filter skips empty
  userId events.
- New mcp__nanoclaw__query_reactions tool reads the session's reaction
  rows from inbound.db; optional target_message_id filter; newest-first;
  default limit 50, hard cap 500. Preserves added=false for removals.

Tests:
- Host: +28 (transcription 12, pdf-extract 10, chat-sdk-bridge 4,
  formatter ports 6 covered under container) — 459 pass total.
- Container: +18 (multimodal 10, queryReactions 6, formatter 6 of which
  are part of the container suite, mock/claude push paths) — 118 pass
  total / 251 expects.

v2 service: restarted, /health 200 healthy=true, all channels connected.
Container source is bind-mounted; no ./container/build.sh required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant