feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618
Open
johnmathews wants to merge 1 commit into
Open
feat(multimodal,reactions): restore v1 image/voice/PDF + chat.onReaction#2618johnmathews wants to merge 1 commit into
johnmathews wants to merge 1 commit into
Conversation
…on to v2 Multimodal (W4.x-multimodal): - Provider interface widened: AgentQuery.pushBlocks(ContentBlock[]) + AgentProvider.supportsMultimodalContent flag. Mirrors v1's pattern of one text turn followed by a separate multimodal turn. - New container/agent-runner/src/multimodal.ts loads image attachments from /workspace/<localPath> and emits Anthropic Messages-API base64 image blocks. Refuses path escapes; honors per-attachment skipMultimodal. - Poll-loop calls pushBlocks after the initial query and for every follow-up batch, gated by the capability flag. - New src/transcription.ts (host-side OpenAI Whisper) and src/pdf-extract.ts (host-side pdftotext) preprocess voice and PDF attachments inside the bridge before the row reaches the container. Failures are non-fatal — the formatter surfaces the error to the agent rather than dropping the attachment. - Formatter renders transcription, transcriptionError, extractedText (CDATA-wrapped), and pdfExtractionError inline so the agent reads the spoken/written words directly. Reactions (W4.x-reactions-inbound): - chat-sdk-bridge subscribes to chat.onReaction when the adapter exposes it. Reaction events are synthesized as kind='chat-sdk' inbound rows via buildReactionInbound() — isMention=false so trigger-required channels accumulate rather than wake. Self-reaction filter skips empty userId events. - New mcp__nanoclaw__query_reactions tool reads the session's reaction rows from inbound.db; optional target_message_id filter; newest-first; default limit 50, hard cap 500. Preserves added=false for removals. Tests: - Host: +28 (transcription 12, pdf-extract 10, chat-sdk-bridge 4, formatter ports 6 covered under container) — 459 pass total. - Container: +18 (multimodal 10, queryReactions 6, formatter 6 of which are part of the container suite, mock/claude push paths) — 118 pass total / 251 expects. v2 service: restarted, /health 200 healthy=true, all channels connected. Container source is bind-mounted; no ./container/build.sh required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores a set of capabilities that existed in NanoClaw v1 but were not carried over into the v2 trunk:
pdftotext, rendered inline (CDATA-wrapped) so the agent reads the written text directly.chat.onReaction(the underlying Chat SDK already supports it; the bridge just wasn't subscribed). Reactions land askind='chat-sdk'inbound rows withisMention=falseso trigger-required channels accumulate without waking. A newmcp__nanoclaw__query_reactionstool lets the agent ask which messages currently carry which reactions.This is the same feature set v1 shipped — the port is intentionally narrow to that scope.
What each new module does
src/transcription.ts(+test)isTranscribableMime,transcribeAudio, typedTranscriptionError. Failures are non-fatal — stamped on the attachment astranscriptionErrorso the agent sees why instead of getting a silent voice note.src/pdf-extract.ts(+test)pdftotextwrapper.isPdfMime,extractPdfText, typedPdfExtractionError. Same non-fatal error contract.container/agent-runner/src/multimodal.ts(+test)/workspace/<localPath>and emits Anthropic Messages-API base64 image blocks. Refuses path escapes (../, absolute paths outside/workspace); honors per-attachmentskipMultimodal; skips unsupported mime types and oversize images.container/agent-runner/src/mcp-tools/core.ts—query_reactionsinbound.db. Optionaltarget_message_idfilter; newest-first; default limit 50, hard cap 500. Preservesadded=falseso the agent can see removals.src/channels/chat-sdk-bridge.ts—chat.onReactionsubscriptionbuildReactionInbound(). Adapters that don't exposeonReactionlog info and skip — no crash. Best-effort self-reaction filter (skips rows with emptyuserId).Capability gating (backward-compatible at runtime)
The provider interface picks up two new surfaces:
The poll-loop calls
query.pushBlocks(...)only whenprovider.supportsMultimodalContent === true, so providers without the multimodal path no-op transparently at runtime. Trunk providers (claude,mock) opt in. Any out-of-tree provider (e.g.opencodeon theprovidersbranch) needs a one-linesupportsMultimodalContent = falseand a stubpushBlocks(): void {}to satisfy the interface — that's the only call-site change required.Tests
buildReactionInbound(4).expect()calls / 10 files. New coverage: multimodal image extraction (10),query_reactionstool (6), formatter rendering of transcription / pdf text / errors (6), provider push paths forclaudeandmock.tsc -p container/agent-runner/tsconfig.json --noEmit): clean.tsc): clean.Notes for maintainers
The reactions and the multimodal halves are independent at the wire level — they only share the type widening on
AgentProvider. If you would prefer this split into two PRs (one forsupportsMultimodalContent+ image/voice/PDF, one forchat.onReaction+query_reactions), I'm happy to do that — just let me know which way you'd like to take it.A few things I deliberately did not include in this PR, but happy to add if you'd like them on the same change:
provider_supports_multimodal_contentsetter on the host-side config so operators can flip it on a per-group basis. Right now it's a provider-class constant.🤖 Generated with Claude Code