Refactor email system into plugin architecture with multi-provider support#938
Open
alifeinbinary wants to merge 107 commits intojaredlockhart:mainfrom
Open
Refactor email system into plugin architecture with multi-provider support#938alifeinbinary wants to merge 107 commits intojaredlockhart:mainfrom
alifeinbinary wants to merge 107 commits intojaredlockhart:mainfrom
Conversation
Contributor
alifeinbinary
commented
Apr 2, 2026
- Extract email functionality into plugin system under penny/plugins/
- Establishing an architecture for plugins where service specific code is abstracted.
- Move Fastmail JMAP client to plugins/fastmail/ with FastmailPlugin class
- Move Zoho client to plugins/zoho/ with ZohoPlugin class
- Create InvoiceNinja stub plugin for future invoicing integration. Can be modified to be a standard boilerplate for future plugins.
- /zoho command is now unified under /email command with multi-provider routing support
- Single provider: /email
- Multiple providers: /email
…sues ReadEmailsTool was running fetched emails through Ollama summarization, adding latency and losing detail. The agent already has the full email content in context and can answer questions directly. Changes: - Remove OllamaClient and user_query params from ReadEmailsTool - Return raw email content joined with separators instead of summary - Remove ReadEmailsArgs Pydantic model (use kwargs directly) - Remove EMAIL_SUMMARIZE
… Discord event handlers for reconnecting. Adds extensive logging to debug Discord message reception issues: - Log intents configuration, bot user ID, gateway latency on ready - Add on_connect, on_disconnect, on_resumed gateway event handlers - Log raw MESSAGE_CREATE gateway events via on_socket_raw_receive - Log ALL messages in on_message before filtering (author, channel, content) - Log filter decisions (own message, wrong channel) with [DIAG] prefix Add validate_connectivity()
…sues ReadEmailsTool was running fetched emails through Ollama summarization, adding latency and losing detail. The agent already has the full email content in context and can answer questions directly. Changes: - Remove OllamaClient and user_query params from ReadEmailsTool - Return raw email content joined with separators instead of summary - Remove ReadEmailsArgs Pydantic model (use kwargs directly) - Remove EMAIL_SUMMARIZE
…liest (jaredlockhart#863) _find_unrolled_weeks used get_recent(limit=1) which returns the most recent daily entry, but treated it as the earliest. When the most recent entry is in the current week, first_monday == current_monday and the scan loop never executes — so no completed weeks are ever found. - Add get_earliest() to HistoryStore (ASC ordering) - Use get_earliest() in _find_unrolled_weeks - Update test to seed current-week entries alongside past weeks Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…wns (jaredlockhart#864) * Improve notification scoring, thinking distribution, and topic cooldowns - Normalize novelty and sentiment scores to [0,1] via min-max scaling before applying weights, so both dimensions contribute proportionally instead of novelty dominating due to its ~4x larger raw range - Add per-topic 24h notification cooldown: once a preference (or free thought) is notified, that topic is excluded from candidates for 24 hours - Add MAX_UNNOTIFIED_THOUGHTS config param (default 20) — thinking agent skips cycles when unnotified thoughts reach the cap - Replace random-roll thinking mode selection with distribution-based steering: compare actual free/seeded ratio against target probabilities and pick whichever type is underrepresented - Add ThoughtStore.count_unnotified() and count_unnotified_free() queries - Add THOUGHT_TOPIC_COOLDOWN_SECONDS constant (86400) - 12 new tests covering normalization, cooldown, cap, and distribution logic - All existing tests updated to monkeypatch probability constants for determinism independent of production values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move thinking distribution constants to runtime config params FREE_THINKING_PROBABILITY and NEWS_THINKING_PROBABILITY are now runtime- configurable via /config instead of hardcoded constants. The seeded probability is implicit (1 - free - news). Tests pass probabilities through make_config() instead of monkeypatching PennyConstants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ockhart#865) * Move scoring weights to runtime config params (default 50/50) NOVELTY_WEIGHT and SENTIMENT_WEIGHT are now runtime-configurable via /config instead of hardcoded constants. Default changed from 40/60 to 50/50 for equal weighting now that normalization makes both dimensions comparable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix thinking agent flooding logs when at unnotified cap When MAX_UNNOTIFIED_THOUGHTS is reached, get_prompt returned None which made execute_for_user return False. The scheduler treated that as "no work" and retried every tick (~1s), flooding the log. Move the cap check to execute_for_user and return True when skipping, so the scheduler calls mark_complete and waits for the next interval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt#866) Append -site: exclusions for blocked domains (facebook, instagram, tiktok) to the Serper query so Google filters them server-side. Previously we only filtered after download, so queries dominated by these domains returned no image at all. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The THINKING_REPORT_PROMPT was producing thoughts framed as corrections
or debunking ("it turns out X is NOT Y"), which sounds wrong in
spontaneous notifications where there's nothing to correct. Updated the
prompt to frame findings as standalone new discoveries and to discard
searches that only found something doesn't exist.
Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add topic context intro to notify prompt Thought notifications were jumping straight into details without establishing what the topic is, leaving the reader confused (e.g., "Kokoroko's new RSD-2026 vinyl..." with no mention that Kokoroko is a band). Updated NOTIFY_SYSTEM_PROMPT to instruct the model to open with a brief identifying phrase before diving into details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add full system prompt assertions for news and checkin notify modes Extends the test coverage pattern to all three notification modes. ThoughtMode already had a full prompt assertion; now NewsMode and CheckinMode do too, catching structural drift in prompt composition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#871) * Make thought notifications conversational instead of report-style The thinking report prompt produces structured content (bullets, headers, tables) which the notify agent was regurgitating verbatim. Changed the instruction from "Share what's in it — the thought IS the substance" to "Retell it conversationally — no bullet lists, no headers, no tables" so notifications read like a friend explaining what they found. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix flaky schedule test by polling for expected message content Replace wait_for_message (returns last message, vulnerable to race conditions) with wait_until + _has_message pattern that polls for the specific expected content. This matches the convention used by the rest of the test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#872) * Steer thinking agent away from troubleshooting/support content The thinking agent was searching for bug reports and support articles (e.g., "UAD plugin glitch") and surfacing them as interesting discoveries. Added guidance to look for releases, creative work, and discoveries while avoiding troubleshooting guides and bug reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add casual greeting to proactive notifications Notifications were jumping straight into content without a greeting. Added "Start with a casual greeting" to NOTIFY_SYSTEM_PROMPT, matching the pattern already used by the news notification prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#873) NEWS_NOTIFY_MAX_STEPS was 1, but the agent base class strips tools on the final step. With only 1 step, fetch_news could never execute — the model's tool call was discarded as "hallucinated on final step" and every news attempt produced an empty response that got disqualified. Bumped to 3 steps so the model can call the tool and format results. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#874) * Use thought title as image search fallback for notifications When a thought notification has no tool calls (model retells thought context directly), the image search fell back to using the full message text, producing bad image results. Now ThoughtMode extracts the first bold headline from the thought content as the image query (e.g., "Bad Cat Era 30 – A Hand-Wired EL84 Head"), which is a much better match for finding a relevant product/topic image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Strip generic prefixes from thought titles for image search Thought titles like "Briefing: Tone King Royalist" or "Here is something interesting I learned about the Vox AC15HWR1" had generic prefixes that diluted image search results. Added _clean_thought_title that strips common prefixes (Briefing:, Detailed Briefing:, etc.) and filters out completely generic titles. Tested against 100 recent thoughts: 97/100 produce good image queries after cleaning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Reduce fuzzy duplicate preference extraction The preference extraction prompt was creating near-duplicate preferences like "Tubesteader Eggnog user reviews" and "Tubesteader Eggnog 12AX7 pre-amp" when "Tubesteader pedals" already existed. These slipped past both TCR and embedding dedup because short strings with slightly different wording produce low similarity scores. Added explicit guidance that asking about reviews, specs, or details of a known item is engagement with the existing preference, not a new one. Added a concrete example matching the observed failure pattern. Dry-ran against the actual prompt that produced the duplicate — 3/3 runs correctly classified it as existing instead of new. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip questions and tasks in preference extraction The model was extracting questions and troubleshooting requests as preferences (e.g., "Running preamp into front of amp", "preamp output confusion", "pedals powered via XLink Out"). Added explicit guidance to skip questions, tasks, and troubleshooting requests. Dry-ran against 4 prompts that produced task preferences — all 4 previously-bad extractions are now suppressed or significantly reduced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Cache embeddings on thought and messagelog tables Thoughts and outgoing messages were being re-embedded from scratch on every dedup check and novelty comparison. Added embedding BLOB columns to both tables so embeddings are computed once and reused. - Migration 0014: adds embedding column to thought and messagelog - ThinkingAgent: embeds and stores at thought creation time, uses cached embeddings in dedup (skips thoughts without embeddings) - NotifyAgent: uses cached message embeddings for novelty scoring, backfills on first access - Startup backfill job extended to populate thought embeddings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Embed messages at insert time and backfill at startup Messages were being lazily backfilled in the notify agent on read. Moved embedding to send_response (insert time) so every outgoing message gets its embedding cached immediately. Added startup backfill for existing messages without embeddings, and a test assertion that thoughts get embeddings stored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The NOTIFY_NEWS prompt said "the source in parentheses" which the model interpreted as the outlet name (e.g., "(New York Times)") rather than the actual URL from the tool results. Changed to "the source URL from the tool results" so URLs are included. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chat was instructed to "Focus on ONE topic per response" and "go deep" which produced narrow answers that missed important angles (e.g., trauma/immune question only covered physical trauma, ignored PTSD). Changed to "Go WIDE: cover as many angles as possible" with multiple search queries and follow-up searches for comprehensive answers. Thinking mode stays go-deep (autonomous exploration of one thread). Chat mode is now go-wide (user wants the full picture). Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#879) * Skip daily history entries already covered by weekly rollups The history context was including both weekly rollups AND their constituent daily entries, causing duplicate topics in the system prompt. Now _format_daily_entries checks each day against the weekly rollup date ranges and skips days that fall within a completed week. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add test for daily/weekly history overlap filtering Verifies that daily entries within a weekly rollup's date range are excluded from the history context, while daily entries outside the range are still included. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed THINKING_REPORT_PROMPT from structured report format (tables, headers, 500 words) to conversational message format (casual greeting, details, URL, 300 words). Thoughts are now stored in the shape they'll be shared, cutting context size in half. Loosened NOTIFY_SYSTEM_PROMPT to relay the thought as-is instead of re-summarizing. Old prompt: "Retell it conversationally, no bullets/ headers/tables." New prompt: "Share it with the user, don't compress or summarize, just relay in your own voice." Tested end-to-end on 3 examples: new pipeline produces notifications with equal or better detail than the original two-step process. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) The "No greetings, no sign-offs" rule was in PENNY_IDENTITY which is shared by all agents, causing proactive notifications to skip greetings even though the notify prompt said to include one. Moved the rule to CONVERSATION_PROMPT so it only applies when responding to user messages. Also removed the greeting from THINKING_REPORT_PROMPT since the notify agent now handles greetings — the stored thought shouldn't include one. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#882) * Score thoughts by cached embedding before generating notification Previously generated N candidates through the model, then scored them. Now scores the raw thoughts using cached embeddings (novelty + sentiment), picks the winner, then runs only the winner through the notify agent. With NOTIFY_CANDIDATES=5, this cuts model calls from 5 to 1 per notification cycle. Possible because thoughts are now stored in notification-ready shape with pre-computed embeddings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add integration test for embedding-based thought scoring Tests the full notification flow with 3 thought candidates: seeds DB with preferences, thoughts with embeddings, and an incoming message, then runs execute_for_user and asserts a notification was sent and exactly 1 of 3 thoughts was marked notified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Assert on full call chain in embedding scoring test Verify every edge of the score-then-generate flow: - 1 Ollama chat call (winner only, not all candidates) - 1 embed call (outgoing message at send time, not during scoring) - 1 serper image search - Message delivered via Signal - 2 of 3 thoughts remain unnotified - 1 thought marked notified in DB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Simplify image search to use thought content directly The bold-title extraction and prefix-cleaning logic was built for the old structured report format. With conversational thoughts, bold titles are rare. Now uses first 300 chars of thought content as the image query — the subject name consistently appears in the first sentence or two, and serper is smart enough to extract it. Removed dead code: _clean_thought_title, _is_generic_title, _TITLE_STRIP_PREFIXES. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add thought title for dedup and image search The thinking report prompt now emits a 'Topic: <title>' line that gets parsed and stored separately. Titles are short (e.g., "Tubesteader Beekeeper pedal") so they embed closely for duplicates and work well as image search queries. Key changes: - Migration 0015: adds title column to thought table - THINKING_REPORT_PROMPT: emits 'Topic: ...' on last line - ThinkingAgent: parses title, embeds title (not content), stores both - Thought dedup: now global (all thoughts, not per-preference) using TCR_OR_EMBEDDING on titles — catches cross-preference duplicates - Image search: uses thought.title when available - New runtime config: THOUGHT_DEDUP_TCR_THRESHOLD (default 0.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Separate title and content embeddings on thoughts Title embedding for dedup (short string, high discrimination), content embedding for novelty/sentiment scoring (full message vs messages/preferences). Both computed at creation time and cached. - Added title_embedding column to thought table (migration 0015) - ThinkingAgent stores both embeddings at creation - Dedup uses title_embedding, scoring uses embedding (content) - Added THOUGHT_DEDUP_TCR_THRESHOLD runtime config param (0.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#884) OR strategy produced false positives from common short words ("2026", "AI", "agent") matching via TCR on short titles. Switched to AND (both TCR >= 0.6 AND embedding >= 0.6 required) which eliminates all false positives while catching real duplicates. Also lowercase titles before embedding so casing doesn't affect similarity (e.g., "THE GHOST IN THE SHELL" vs "Ghost in the Shell" was 0.381, now 0.652 after lowercasing). Lowered THOUGHT_DEDUP_EMBEDDING_THRESHOLD default from 0.80 to 0.60 since title embeddings score lower than full-content embeddings. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed (jaredlockhart#896) * Add browser extension with WebSocket server and dev tooling Browser sidebar extension connects to Penny via WebSocket (echo-only for now). Adds web-ext dev setup with auto-reload, exposes port 9090 from Docker, and wires up BROWSER_ENABLED config to start the server alongside Signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add multi-channel architecture with device routing and shared history ChannelManager implements MessageChannel as a routing proxy — all agents, scheduler, and commands interact with it instead of a single channel. Messages from any device (Signal, browser) resolve to the same user identity, giving full conversation continuity across channels. New: Device table + DeviceStore, ChannelManager, BrowserChannel (full MessageChannel), migration 0016, ChannelType enum, browser sidebar device registration flow. 418 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add browser HTML formatting, image URLs, reconnect indicator, and single-user fix BrowserChannel.prepare_outgoing converts markdown to HTML (bold, italic, code, links, tables-to-bullets). Images use URLs via search_image_url instead of base64 download, rendered as <img> tags prepended to messages. Sidebar shows reconnecting spinner. Background agents use get_primary_sender from UserInfo instead of mining MessageLog for user identity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Set up TypeScript, typed protocol, light/dark theme, and streamlined UI Converts browser extension to TypeScript with strict mode. Shared protocol.ts defines typed constants and discriminated unions for the WebSocket protocol. CSS refactored to custom properties with prefers-color-scheme for automatic light/dark support. Header removed, status indicator is now a minimal dot at bottom-right of messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Persist chat history in browser local storage with smart scrolling Messages stored in browser.storage.local (capped at 200) and rehydrated on sidebar open. New messages scroll to show the top of the message; rehydration jumps to bottom instantly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move WebSocket to background script, sidebar uses runtime messaging Background script owns the server connection and persists across sidebar open/close. Sidebar communicates via browser.runtime messaging with typed RuntimeMessage protocol. Connection state synced on sidebar open via port. Smart scroll: short messages anchor at bottom, long messages show top first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add browse_url tool with hidden tab, content extraction, and domain permissions First browser tool: browse_url opens a hidden tab with full web engine and user session, injects a content script to extract visible text, then the server summarizes it in a sandboxed model call before the agent sees it. Domain permission flow: unknown domains prompt the user via sidebar dialog, decisions stored for future calls. Tool available dynamically to chat and thinking agents when a browser is connected. Protocol: tool_request/tool_response RPC over WebSocket with correlation IDs. BrowserChannel resolves asyncio Futures when responses arrive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix single-user identity resolution for commands, reactions, and startup Commands, reactions, and command logs now resolve device identifiers to the primary user sender via _resolve_user_sender. Startup announcement uses get_primary_sender and skips when no message history exists. Tests added for user sender resolution and startup skip behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix /draw in browser by handling raw base64 and data URI attachments _prepend_images now supports three attachment formats: HTTP URLs, data URIs, and raw base64 (wrapped as data:image/png). Previously only HTTP URLs were rendered, so /draw output was silently dropped in the browser sidebar. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add active tab context injection for browser sidebar messages Background script extracts visible text from the active tab on tab switch and page load, holds it in a buffer, and attaches it to chat messages. Server injects it into the chat agent's system prompt as a "Current Browser Page" context section. Truncated to 5,000 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix scroll positioning by re-scrolling after image load scrollIntoView fires before images render, so offsetHeight is wrong for messages with images. Now re-scrolls on each img load event to account for the final dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace content extraction with Defuddle, inject page context as synthetic tool call Content script now uses Defuddle for smart page extraction (strips nav, sidebars, boilerplate) with CSS heuristic and TreeWalker fallbacks. Bundled via esbuild since content scripts can't use imports. Page context injected as a synthetic browse_url tool call + result in the message history instead of system prompt. The model sees a pre-completed tool exchange and answers from it directly. System prompt carries a minimal hint (title + URL) to disambiguate "this page" references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add page context toggle, og:image extraction, and flush image styling Sidebar shows current page title with checkbox to include page content. Content script extracts og:image metadata. Responses to page-context messages show the page image and "In response to" link inside the message bubble. All images in Penny messages now render flush to bubble edges with matching border-radius. Input disabled while waiting for response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update browser extension architecture doc with implementation status Reflects all completed work: multi-channel architecture, device table, browse_url tool, active tab context, Defuddle extraction, permission flow, TypeScript protocol, page context toggle, and additional features not in the original plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add thoughts feed page with new/archive tabs, image URLs, and modal viewer Feed page renders thoughts as a card grid with images, titles, seed topic bylines, and HTML-formatted content (via server-side prepare_outgoing). New/Archive tabs split by notified_at. Clickable cards open a modal with full content. Sidebar nav bar links to feed page. image_url stored on Thought model at creation time. Startup backfill populates existing thoughts in parallel batches. Migration 0017 adds image_url column. Seed topic resolved from preference FK for bylines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add thought reactions, unnotified count, Font Awesome icons, and periodic polling Thumbs up/down on feed cards and modal overlay — logs reaction as incoming message with parent_id to synthetic outgoing (same pipeline as Signal reactions for preference extraction), marks thought notified, fades card. Font Awesome installed locally (no CDN). Sidebar nav shows unnotified thought count. Background polls thoughts every 5 minutes for fresh count. Reaction buttons float on card corners with hover color effects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Penny logo with transparent background, extension icons, and Signal avatar penny.png made transparent and resized to 48px/96px for extension icons. Added to README header. Signal profile picture set via signal-cli-rest-api PUT /v1/profiles endpoint. New `make signal-avatar` target for setting it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Penny logo, SVG icons, thought reactions, feed polish, and image backfill Logo: penny.svg traced from PNG via potrace, auto-cropped, rendered to 16/32/48/96px PNGs from SVG for crisp icons at all sizes. Added to README, sidebar nav, feed page header, and extension manifest. Feed: thumbs up/down reactions log to preference extraction pipeline, Font Awesome icons (local), periodic thought polling, unnotified count in sidebar nav, seed topic bylines, modal viewer with reactions, server-side markdown-to-HTML for thought content. Infrastructure: thought.image_url stored at creation time, startup backfill for existing thoughts, migration 0017, make signal-avatar target. 5-minute thought poll interval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update architecture doc with feed page, reactions, logo, and new features Documents feed page implementation (card grid, new/archive tabs, modal, reactions pipeline, image URLs at creation time), logo/SVG workflow, Font Awesome, thought count polling, and updated directory structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CLAUDE.md with browser extension, multi-channel, and new commands Documents browser extension directory structure, dev workflow, config vars (BROWSER_ENABLED/HOST/PORT), make signal-avatar target, single-user model, and design doc references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update README with browser extension, multi-channel, and feed page Adds Browser Extension section documenting sidebar chat, active tab context, browse_url tool, thoughts feed, and multi-device support. Updates overview to mention browser channel and shared history. Adds Firefox badge, browser config vars, and make signal-avatar. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use PageContext Pydantic model instead of raw dicts throughout PageContext defined in channels/base.py (alongside IncomingMessage), imported by browser/models.py. All page context references use typed model attributes instead of dict.get() calls. Renamed abbreviated variable names (ctx → context). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move inline imports to top level and batch seed topic query All inline imports of penny modules moved to top-level imports. Inline imports only remain for optional external packages (github_api) inside try/except guards. Seed topic resolution uses batch get_by_ids query instead of N individual queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Sanitize all web content at the BrowserChannel boundary All page content from the browser is sanitized through a sandboxed model call in BrowserChannel before reaching any downstream consumer. Both browse_url tool responses and active tab context go through the same _sanitize_page_content method — comprehensive rewrite preserving URLs, structure, and details. BrowseUrlTool no longer does its own summarization; it receives pre-sanitized content from the channel. Single enforcement point: consumers can't accidentally bypass sanitization because it happens at the channel boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move sanitize prompt and constants to proper files, add favicon, fix title color PAGE_SANITIZE_PROMPT moved to Prompt class. TOOL_REQUEST_TIMEOUT and MAX_PAGE_CONTENT_CHARS moved to PennyConstants. Feed page gets favicon and black title instead of purple. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Start typing indicator before page content sanitization Typing indicator now fires before the sandboxed summarization step so the user sees immediate feedback while page content is being processed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Increase tool timeouts for browse_url + sanitization chain Browser tool request timeout bumped from 30s to 60s. Overall tool timeout bumped from 60s to 120s to accommodate the full chain: browser round-trip + page load + content extraction + sanitization model call. IMDB pages were timing out at 60s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add tests for page content sanitization and BrowseUrlTool passthrough Tests cover: sandboxed sanitization happy path, fallback when no model client, fallback on model failure, content truncation at max chars, BrowseUrlTool returning pre-sanitized content directly, and empty content handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Show newest thoughts first on the feed page Added get_newest() method to ThoughtStore that returns newest-first ordering. Feed page handler uses it instead of reversing get_recent(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Only recheck page context toggle when URL actually changes Prevents background tab update events from resetting the toggle when the user unchecked it on the same page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add TODO section to architecture doc for deferred work Browse_url page headers, sender column cleanup, domain allowlist UI, and tool rate limiting noted for future PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#899) * Add Likes/Dislikes tabs to browser extension sidebar Adds two new tabs to the sidebar for managing preferences directly from the browser. Each tab lists preferences with mention counts and an × to delete, plus an input at the bottom to add new ones. The connection status indicator is now in the nav bar so it's visible on all tabs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove sandboxed model summarization step for web page content The sandboxed model call (40s on 20B) wasn't providing meaningful security — domain allowlist and no-code-execution already close the real attack surface. Small models (gemma3:1b, qwen2.5:1.5b) hallucinate facts making them worse than passing through Defuddle-extracted content directly. Defuddle already strips nav/boilerplate at the source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…es (jaredlockhart#900) * Store thought valence from reactions and filter thinking by preferences Thumb reactions on thoughts now store valence (1/-1) directly on the thought row instead of extracting a mention=1 preference. This cleans up the preference table (which previously had noisy thought-title entries) and provides a foundation for future thought-based scoring. The thinking agent now gates new thought storage behind a mention-weighted preference filter: if qualifying positive preferences exist (mention>1), a thought must score >= 0 against them before being stored. Inactive when no signal exists yet. Notification scoring is simplified to pure novelty (no sentiment) since the thought loop filter already gates on preference alignment. Key changes: - migration 0018: add thought.valence column - ThoughtStore: set_valence() and get_valenced() - similarity: replace compute_sentiment_score with compute_mention_weighted_sentiment - BrowserChannel: store valence on thought, remove synthetic message creation - HistoryAgent: route thought reactions to set_valence, mark processed immediately - ThinkingAgent: _passes_preference_filter gates new thought storage - NotifyAgent: pure novelty scoring (_select_most_novel @staticmethod) - config_params: remove NOVELTY_WEIGHT and SENTIMENT_WEIGHT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Backfill thought valence from existing reactions in migration The migration now walks messagelog to find emoji reactions that point to notification messages (thought_id IS NOT NULL) and sets the corresponding thought.valence = 1 or -1. Only fills NULL valence to avoid overwriting a later reaction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove reaction-based preference extraction from history agent Preference extraction now runs only on text messages. Reactions are processed solely for thought valence (set_valence on thought reactions) and then marked as processed — no LLM call, no preference created. Removes: ExtractedTopic, ExtractedTopics models, _extract_reaction_preferences, _build_reaction_items, _extract_reaction_topics, _store_reaction_preferences, _classify_reaction_emoji, and REACTION_TOPIC_EXTRACTION_PROMPT. Replaces with: _process_reactions (thought valence only) + _emoji_to_int_valence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Wire PREFERENCE_MENTION_THRESHOLD into sentiment scoring compute_mention_weighted_sentiment now takes an explicit min_mentions parameter (no default) so the threshold is always sourced from config. _passes_preference_filter reads PREFERENCE_MENTION_THRESHOLD and passes it through, keeping seed-topic eligibility and sentiment filtering in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix has_signal gate to check any qualifying preference, add negative-only test The preference filter gate was only checking positive preferences, meaning thoughts would slip through unfiltered if a user had only negative prefs qualifying for the mention threshold. Now checks any qualifying preference (positive or negative), and adds a test confirming the filter activates with negative-only qualifying prefs (score = 0 - 1 = -1 → filtered). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix stale docstring in _passes_preference_filter Gate activates on any qualifying preference (positive or negative), not just positive ones — updated after the has_signal fix in cae119f. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…hart#901) - Browser sends a heartbeat to the server on every URL navigation, resetting the idle timer so proactive notifications are suppressed while the user is actively browsing. - PeriodicSchedule gains requires_idle flag (default True). History and thinking agents set requires_idle=False so they run on their own wall-clock timers independent of user activity. Only NotifyAgent remains idle-gated. - BackgroundScheduler.notify_activity() resets _last_message_time without touching schedule intervals, used by the heartbeat handler. - Test fixtures suppress independent schedules via long intervals in DEFAULT_TEST_RUNTIME_OVERRIDES (previously the idle gate did this implicitly). Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The backfill fired search_image_url for every thought with a NULL image_url on startup. On first deploy after migration 0017 it ran 565 concurrent Serper calls, exhausting the API quota and breaking Signal notification images for the rest of the day. All existing thoughts now have image_url set (NULL or empty string), so the backfill was a no-op going forward. New thoughts get image_url assigned at creation time via ThinkingAgent. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#905) * Add browser extension settings panel with icons, domains, and config - Restructure nav into two-tier header: logo/title/thoughts-link/gear in top bar; Chat tab below. Thoughts is now a link button, not a tab. - Add FontAwesome icons throughout sidebar and feed interaction points - Add settings panel (gear icon) that takes over the sidebar: - Likes/Dislikes tabs (moved from main nav) - Domains tab: list, toggle allow/deny, delete, and add new entries from browser.storage.local — pure frontend, no backend needed - Config tab: all runtime ConfigParams rendered from live Python registry (key, description, type, current value, default); edits write to runtime_config DB via new config_request/config_update WebSocket messages; green toast confirms save - Animated typing indicator (staggered dots) and two-tier nav CSS - Fix feed card image corners clipping reaction buttons (border-radius on image directly instead of overflow:hidden on card) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix ruff import ordering in _handle_config_update Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add tests for config_request and config_update browser channel handlers Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lockhart#906) Proactive messages (thoughts, news, check-ins) have no parent_id and were being merged into large assistant blobs in the conversation context window — up to 20K chars from a day's worth of notifications. They don't belong there: they're already represented via the thought section in the system prompt, and history rollups cover what was discussed. Only user messages and direct replies (parent_id set) are now included in get_messages_since. Conversation turns stay properly ordered since threaded replies are always logged after the messages they reply to. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix spinners getting stuck on prompt log runs The old markRunActive used a timestamp comparison that could miss clearing the class when prompts arrived in bursts. Replace with a simple debounced timer — each new prompt clears the previous timeout and starts a fresh 10s countdown. Clears all timers on re-render. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Simplify to single active run tracking Only one agent runs at a time, so track a single activeRunId instead of a map. When a new run starts, the previous one is immediately deactivated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…riming (jaredlockhart#950) The thinking agent was injecting full thought bodies (often 1000+ chars each) into its system prompt as "Recent Background Thinking." This primed the model to re-search the same topics, producing duplicates that got discarded. Two changes: - Include only thought titles (bullet list) instead of full bodies - Replace vague "check your recent thoughts" with explicit instructions to find a DIFFERENT angle and avoid anything closely related to listed topics Dry-run tested against 2 seeds x 3 runs each — model now searches for new content instead of re-exploring the same topics. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t#951) * Broadcast run outcome to browser so badge appears without refresh When a thinking run completes, set_run_outcome updates the DB but never notified the browser extension. The outcome badge (Stored/Discard) only appeared after a full page refresh. Added a run_outcome_update WebSocket message that fires from the set_run_outcome callback, relayed through background → prompts page, where it inserts the badge into the existing run row in real time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix run outcome badge vertical padding — equal spacing above and below Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Polish prompt log UI: badge layout, spinner timeout, agent labels - Move outcome badge to its own row with reduced header padding via :has() - Increase active run spinner timeout from 10s to 30s - Add title-case agent labels (Thinking, Chat, History, Notify, Startup) in both run rows and filter dropdown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ests (jaredlockhart#952) Browse tool retried with fixed 1s delays (2 retries = 2s), but browser reconnection takes ~3s, so 26% of tool calls failed. Changed to exponential backoff (1s, 2s, 4s, 8s) with 4 retries for 15s coverage. Tests were hitting real retry sleeps because the browse provider wasn't mocked — the running_penny fixture now injects a mock browse provider on all agents and the /test command factory so no test depends on a real browser connection. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y empty (jaredlockhart#953) * Filter search snippets from summary input, require image, retry on empty Three changes to reduce wasted thinking runs: 1. Filter page reads: summary step now only gets page-read content (## browse: sections), not search snippets (## search: sections). Search results are titles+links only — summarizing from them produces shallow output. Aborts if no page reads were captured. 2. Require image: aborts if the thinking loop didn't capture any images from browsed pages — thoughts without images have no feed visual. 3. Retry empty summaries: when the model returns empty content (thinking tokens but no output — 18% of runs), retry within the existing URL validation loop instead of immediately abandoning the run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move browse section format constants to PennyConstants BROWSE_KIND_SEARCH, BROWSE_KIND_PAGE, and SECTION_SEPARATOR were duplicated as string literals across browse.py, thinking.py, base.py, and read_emails.py. Consolidated into PennyConstants and referenced everywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Search last 5 lines for Topic: and retry if missing _parse_title was checking end-of-string only, but the model sometimes puts sources or emoji after the Topic: line. Changed to search the last 5 lines with re.MULTILINE. Also retries the summary when Topic: is missing entirely (alongside empty and hallucinated URL retries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sidebar previously had inline settings panels and links to separate feed and prompts pages. Now the sidebar is minimal (logo + chat) and clicking the Penny logo opens a single full page with tabs for: Thoughts, Prompts, Schedules, Likes, Dislikes, Domains, Config. Changes: - New browser/page/ directory with consolidated HTML, CSS, and TS - Sidebar stripped to registration + chat only (~1300 lines removed) - All panels use consistent card-row styling matching the prompts page - Preference response now includes source field (manual/extracted) - Config page reads tool-use state from storage on activation - Input bars moved to top of each panel - Metadata spread across full width: preference source + mention count, schedule cron expression inline, config default values shown Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…racking (jaredlockhart#955) 1. Filter strictly by BROWSE_PAGE_HEADER (positive match) instead of excluding BROWSE_SEARCH_HEADER. Only actual page reads reach summary. 2. Include the original seed prompt in the summary system prompt as "Original research goal" so the model knows what to extract. 3. Every exit path in after_run now records a run_outcome — no silent abandonment. Refactored the report/too-short/empty branching to ensure exactly one outcome per run: - no search results, no page reads, no image → early abort - no thought generated → summary returned empty after all retries - too short → report below MIN_THOUGHT_WORDS - duplicate, matches dislike → dedup/filter checks - Stored → success 4. Added tests for search-only abort and empty summary outcome. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…edlockhart#956) UI improvements: - Config items: two-row layout with full-width inputs, defaults on header row - Prompt rows: show response snippet (text or tool call args) instead of prompt_type, wrench/chat icon before snippet - Run summary wrapper for hover highlight (excludes expanded prompts) - Compressed meta column widths, date right-aligned next to stats - Spinner dismissed on run outcome (run complete) - Sidebar tool-use icon right margin - Spinner timeout increased to 60s Backend changes: - Thinking prompt_type now shows seed topic title-cased instead of "seeded" - Notify prompt_type shows thought title instead of "thought" - Malformed tool call arguments (2% of calls) now extracted via regex fallback instead of wrapped in {"raw": ...} which caused silent failures Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t#957) * Send thought notifications directly without LLM rewrite Thoughts are already user-facing quality from the thinking agent's summary step. The notify agent was running them through an agentic loop that just rewrote the same content, burning tokens for no gain. Now _send_best_candidate builds a NotifyCandidate directly from the thought's content and image, bypassing the agentic loop entirely. Checkin mode still uses the agentic loop (it generates new content). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Polish extension page: sub-tabs, agent icons, toasts, bigger logo - Prompts: replace agent filter dropdown with underlined sub-tabs (All, Thinking, Chat, Notify, History) with FontAwesome icons - Thoughts: sub-tabs (New, Archive) use same underlined style - Agent labels on run rows now include matching icons - Toast confirmations on add for likes, dislikes, schedules, domains - Penny icon bumped to 48px on full page Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#958) (jaredlockhart#958) Replace the removed static conversation history dump (7-8k tokens of topic bullets) with on-demand semantic retrieval. Incoming user messages are embedded and the top 10 most similar past messages are injected into the ChatAgent system prompt as dated quotes, sorted chronologically. This restores conversational continuity (Penny knows what you're referring to) without bloating the context or confusing the model with irrelevant history. Dry-run testing showed zero false positives at 0.5 similarity threshold across 302 real messages. Also strengthens the chat prompt to clarify that Related Past Messages are conversation context only, not a source of facts — preventing the model from hallucinating details about previously discussed topics. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hart#959) * Add knowledge extraction and retrieval from browse results Extract and summarize web pages browsed by Penny into a knowledge table, then inject the most relevant entries into chat context using exponentially-decayed weighted conversation scoring (decay=0.5). The HistoryAgent incrementally scans prompt logs for browse tool results, summarizes each page into a dense prose paragraph (8-12 sentences) via LLM, embeds the summary, and upserts by URL. Revisited URLs get their summaries aggregated with new content. Chat context retrieval embeds the full conversation history (not just the last message) and computes weighted similarity scores against knowledge entries, where recent messages contribute more than older ones. This handles both "vague message needs prior context" and "topic pivot" scenarios. Related messages also now use weighted conversation scoring instead of single-message embedding. Validated via prototyping: knowledge context fixed a real production bug where the model confused a storm glass with an eggnog pedal due to lack of factual grounding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address code review: constants, imports, timestamps, exceptions - Extract "Title: " magic string to PennyConstants.BROWSE_TITLE_PREFIX - Move sqlmodel/RuntimeConfig imports to top of history.py - Switch knowledge watermark from prompt ID to timestamp (datetime columns for ordering, IDs for joins only) - Switch get_prompts_after to use timestamp-based filtering and ordering - Narrow except Exception to SQLAlchemyError in KnowledgeStore and LlmError in _embed_conversation - Replace hasattr cache pattern with typed class attribute (None default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add PR review guide and gitignore QUALITY-REVIEW.md Add docs/pr-review-guide.md as the canonical checklist for code reviews. Exclude the disposable QUALITY-REVIEW.md working copy from git. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove conversation history rollup system Daily/weekly topic summarization is replaced by knowledge extraction (factual page summaries) and embedding-based related message retrieval (raw messages scored by weighted conversation similarity). - Delete HistoryStore and ConversationHistory model - Drop conversationhistory table (migration 0024) - Remove HistoryDuration enum, HISTORY_MAX_STEPS, MAX_WEEKLY_ROLLUPS - Remove _history_section and all formatting helpers from Agent base - Remove SUMMARIZE_TO_BULLETS prompt, dead config params - Simplify HistoryAgent to: knowledge extraction + preference extraction - Refactor _build_conversation to count-based (last N messages, no time boundary) and _build_related_messages to exclude by conversation IDs - Delete _conversation_start and _midnight_today (rollup artifacts) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up stale docs and dead code from rollup removal - Remove _history_section from CLAUDE.md building blocks list - Remove HistoryDuration from constants.py description - Update HistoryAgent description (no longer summarizes) - Remove 4 dead MessageStore methods: get_messages_in_range, get_reactions_in_range, get_latest_message_time_in_range, get_first_message_time (only used by removed rollup code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove stateful cache and RuntimeConfig watermark, squash migrations - Pass conversation embeddings as parameter instead of caching on instance (_cached_conversation_embeddings removed) - Replace RuntimeConfig watermark with knowledge table FK join (get_latest_prompt_timestamp derives watermark from domain data) - Squash migrations 0023+0024 into single 0023 (one migration per PR) - Add review checklist items: single migration per PR, no RuntimeConfig for application state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Share knowledge summarization rules between new and update prompts Extract shared rules (_KNOWLEDGE_RULES) and format into both KNOWLEDGE_SUMMARIZE and KNOWLEDGE_AGGREGATE, so the update prompt gets the same detailed include/exclude guidance as the new prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Filter browse prompts at query level, tune extraction rate - Rename get_prompts_after to get_prompts_with_browse_after: adds LIKE filter on messages column for browse header, so only prompts with browse results are returned. Eliminates the stuck-watermark problem where batches of non-browse prompts would re-scan indefinitely. - Increase KNOWLEDGE_EXTRACTION_BATCH_LIMIT from 3 to 20 (each browse result takes ~23s, 20 per cycle is ~8min of model time) - Decrease HISTORY_INTERVAL from 3600s to 900s (15min cycles) - Add test: prompts without browse results are skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) Require Title: + URL: lines in browse sections before summarizing. This rejects all error shapes found in production: browser disconnects, timeouts, blocked domains, Cloudflare challenges, no browser connected, failed to read, and empty extractions. Add 9 unit tests covering every error shape from production data plus the healthy case and empty body edge case. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt#961) Thinking cycles were discarding 28% of runs as "no page reads" due to two issues: (1) the model passed URLs as {"url": "..."} instead of in the queries array, and (2) search-only loops where the model kept re-searching without ever browsing a page. Fixes: - Simplify thinking system prompt with explicit URL-in-queries example - Add browse nudge in search result headers - Inject user message "now browse a URL" after search-only tool results (Python-space detection + model nudge hybrid) - Update chat prompt with same queries-array guidance - Add after_step conversation parameter for subclass message injection Also: - Remove dead browser extension pages (feed/, prompts/) - Reorder prompt log UI: thinking appears before response - Update CLAUDE.md directory listing Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…edlockhart#962) The chat agent was scoring past messages with exponentially-decayed weighted scoring against the entire conversation window — same as knowledge retrieval. That caused derailment when retrieved past turns matched the conversation drift more than the live question, so the model would latch onto a stale prior topic instead of answering what was just asked. Knowledge retrieval still uses weighted decay (factual context should follow topic drift). Message retrieval now scores by pure cosine to the current user message only, minus a centrality penalty: adjusted = cosine_to_current - α * centrality where centrality is the candidate's mean cosine to the rest of the corpus. The penalty (α=0.5) suppresses generic centroid-magnet boilerplate (greetings, generic "what are some recent X" framings) that was leaking into every unrelated query. Selection is adaptive: a cluster-strength gate (top5_mean/top20_mean ≥ 1.15) suppresses flat noise plateaus entirely, then `cutoff = max(top5_mean × 0.85, 0.25)` combines a relative band with an empirical absolute floor. Strong clusters return many messages, weak clusters return few, no cluster returns nothing. Candidates are deduped by content text first. Centrality is cached per-sender in memory (lazy on first retrieval, drifts as new messages arrive — acceptable trade-off for the MVP). Revisit with a DB column or background refresh if precision degrades or the corpus grows past a few thousand messages. Tuning calibrated empirically on a held-out set of recent questions covering several scenario classes (recurring/strong, recurring/mid, novel/weak-context, novel/no-context, subgenre confusion). Outcomes: mid-cluster cases gained ~30+ percentage points of precision as the centrality penalty pushed centroid-magnet noise out of top results; densely-discussed recurring topics return more matches; suppression behavior preserved on cases where the cluster gate should fire. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#963) Browse failures used to flow back as `result.text = "Failed to extract page content"` wrapped in valid-looking `Title:`/`URL:` headers, so the python side accepted them as healthy browse results. The history agent then summarized those literal error strings into refusal-shaped knowledge entries that poisoned future aggregation calls. Browser side: extract_text.ts now exposes an `extracted: boolean` flag and drops the string fallback. browse_url.ts requires `extracted === true` in pollForContent and throws on failure instead of returning a fake-success object. The thrown error propagates through the existing `WsOutgoingToolResponse.error` channel. Python side: BrowseTool._read_page raises (ConnectionError or RuntimeError) instead of returning bare strings. BrowseTool.execute formats the exception path under a new `## browse error: ` header (PennyConstants.BROWSE_ERROR_HEADER) that's structurally distinct from the success header, both readable to the model and grep-able for later analysis. HistoryAgent._parse_browse_section gains an empty-body rejection as belt-and-suspenders. Tests cover the ConnectionError path (no browser), the RuntimeError path (structured browser failure), permission denial, mixed healthy/error sections, and structural rejection in the history parser. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…khart#964) * Show in-flight progress as emoji reactions on the user's message While the chat agent is running, react to the user's incoming Signal message with 💭 (thinking), then morph the reaction to 🔍/📖 as browse tool calls fire (search vs URL read), and clear it when the agent finishes. The final response is sent via the normal send path so it keeps text + image attachments + quote-replies. Why reactions instead of an editable "thinking..." text bubble: Signal mobile/desktop clients silently drop attachments added via message edit even though the wire format technically allows them, so any in-place edited bubble that ended up with an image would lose the image at the receiver. Reactions sidestep editing entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Drop ugly del-param-to-silence-linter idiom in browser channel The browser channel's _make_handle_kwargs override had `del progress` at the top to consume the unused argument. That's dead-code dressing for a linter, not real code — just leave the argument and document why it's unused. Add this antipattern to the PR review guide so we catch it next time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Centralize progress emojis in a ProgressEmoji enum The progress emojis were scattered as raw \U... escapes across three files: a PROGRESS_INITIAL_EMOJI class attr on SignalChannel for 💭, the two-branch return on BrowseTool.to_progress_emoji for 🔍/📖, and a bare literal default on Tool.to_progress_emoji for ⚙️. Move them all to a ProgressEmoji StrEnum in penny/constants.py and reference the symbolic names everywhere. Broaden the constants rule in the PR review guide to catch this case and similar ones — the original wording only flagged module-level _PRIVATE_CONSTANT declarations and missed class-attribute siblings. Also rule out raw literals when an enum exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Dedup browse results by URL within knowledge extraction batch Each step of an agentic loop re-logs prior tool result messages, so a single browse appears in many PromptLog rows. HistoryAgent was treating each row as a fresh page and aggregating identical content N times, wasting Ollama calls and progressively distorting the stored summary through repeated KNOWLEDGE_AGGREGATE drift on the same input. Collapse browse results across the batch to one entry per URL (latest content wins) so each page is summarized at most once per cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace assert with skip; flag asserts-in-production in review guide `assert x is not None` in production code is an anti-pattern: it gets stripped under `python -O`, silently disabling the check, and is usually just there to satisfy the typechecker. Use real control flow (skip, raise, or refactor the type) instead. - history.py: replace `assert prompt.id is not None` with `if prompt.id is None: continue` in the new dedup helper - pr-review-guide.md: add a checklist item under Error Handling so future reviews flag this pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#966) Knowledge retrieval was scoring candidates with exponentially-decayed weighted similarity over the entire conversation window, with no floor. Two failure modes were showing up in production: - Topic-bearing questions after a topic shift were dragged toward the prior thread (e.g. asking about guitar pedals while the conversation had drifted to cloves would surface clove entries instead). - Greetings and off-topic chatter still got their top-N picks injected, because retrieval had no way to say "nothing here is a real match". Score each candidate as max(weighted_decay, cosine_to_current_message) and apply an absolute floor (RELATED_KNOWLEDGE_SCORE_FLOOR, default 0.34, runtime-configurable). The weighted leg preserves the vague follow-up case that motivated weighted scoring originally — asking "is it a dud?" still surfaces storm-glass entries when the thread is in the conversation window. The current-cosine leg lets a strong direct match stand on its own merit even when the conversation has drifted. The floor suppresses noise on greetings and uncovered topics. Validated against a held-out set of 13 recent chat runs: 7 cleanup wins (drop noise, keep all hits), 2 mixed wins (drop wrong topic, restore right topic), 1 greeting suppression, 1 unchanged, 2 marginal recall losses where the relevant entries score below floor and the prior baseline was already returning all-wrong entries. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dlockhart#967) The dedup-by-URL pass added in jaredlockhart#965 keys on the raw URL string, so `/page` and `/page#anchor` are treated as distinct entries even though the fragment is a client-side anchor that never affects page content. The browse tool follows in-page anchor links from search results, so this is common in practice — production logs show the same wiki and PMC article being summarized 3-4 times in a single batch under fragment variants, with separate knowledge rows written for each. Strip the fragment and lowercase scheme + host (case-insensitive per RFC 3986) before keying the dedup dict and storing the URL on the knowledge row. Path, query, and userinfo are preserved as-is since servers can be case-sensitive about them. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#968) Validating only literally-empty content lets a model emit `\n\n---` (or similar separator/punctuation/emoji-only output) and have it delivered to the user as the final answer, silently overwriting a substantive prior response. Generalize the EMPTY check to count alphabetic characters, with a low threshold that catches garbage shapes without flagging terse legit replies like "done" or "yes". Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art#969) After scoring + cutoff selects hits, pull user messages within ±5 minutes of each hit's timestamp to capture conversational follow-ups that share no entity overlap with the current message but live in the same conversation as a real hit. Single pass — neighbors are deduped by id and content and excluded if they're already in the current conversation window; they are not themselves expanded. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…jaredlockhart#970) ScheduleExecutor calls chat_agent.run() directly, bypassing handle() which is the only place _pending_page_context was ever set. Every scheduled fire crashed in _build_messages with AttributeError, so the schedule logged "Executing schedule" but never delivered a message. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art#972) * Fix Penny restart loop when signal-api is slow to come up signal-cli-rest-api takes 30-60s to start cold, but Penny was racing it on every cold boot: validate_connectivity hit a 5s timeout, raised an unhandled ConnectionError, the process exited, docker restarted it, and the loop repeated until signal-api was finally ready. The error never hit penny.log because the traceback went to stderr, making it invisible when debugging from the file logs alone. Three fixes: 1. docker-compose: signal-api now has a curl healthcheck against /v1/about, and penny waits via depends_on/service_healthy. The race is gone for compose-managed startups. Dev tooling uses --no-deps via the Makefile so make fix/check don't block on signal-api. 2. validate_connectivity now retries up to 12 times with a 5s delay (~60s budget) and logs each failed attempt at WARNING. This handles the manual `docker compose up penny` case and any mid-run signal-api hiccup. Test path can pass max_attempts=1 to keep tests fast. 3. main() catches ConnectionError on startup and logs it via the configured file logger before exiting, so any future startup connectivity failure is debuggable from penny.log alone. Constants live in PennyConstants (SIGNAL_VALIDATE_MAX_ATTEMPTS, SIGNAL_VALIDATE_RETRY_DELAY, SIGNAL_VALIDATE_HTTP_TIMEOUT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Document signal startup retry + healthcheck in penny/CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#971) * Send all unnotified thoughts to browser addon for accurate badge count The browser addon's unnotified thought badge could underreport because the server returned only the newest 50 thoughts and let the addon filter for !notified — old unnotified thoughts outside that window were silently dropped. Server now returns every unnotified thought plus a paginated slice of notified thoughts (page size 12) with a has_more flag, and the addon tracks its current notified limit so background polls don't reset the user's load-more position. Also show the user message text instead of the literal "user_message" label in the prompt log run header for chat runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Make server own thoughts page size; addon counts pages The previous revision required the server-side page size constant to be mirrored in the addon, which would inevitably drift. Now the addon only tracks how many pages it wants (`notified_pages`), and the server multiplies by `PennyConstants.BROWSER_THOUGHTS_NOTIFIED_PAGE_SIZE` to compute the actual limit. The page size lives in exactly one place. Also adds a review-guide rule against declaring the same constant in both the Python backend and the TypeScript addon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use Pydantic models for browser thoughts request/response Thoughts request and response now use BrowserThoughtsRequest / BrowserThoughtsResponse / ThoughtCard Pydantic models instead of raw dicts. Also extract a normalizeSnippet helper in page.ts so the prompt log run header and last-user-message extraction share the same whitespace-collapse + ellipsize transformation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sults (jaredlockhart#973) The chat agent's URL hallucination check only consulted the current run's tool results, so URLs the model legitimately echoed from the system prompt knowledge section or prior conversation history were flagged as hallucinated. The validator would discard a fully-formed response, retry, get garbage, exhaust the loop, and the user got nothing. Thread `messages` through `_check_response` -> `_get_source_text` so the full context (system prompt + history + tool results) is the source of truth for URL validation. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#974) * Clean up post-LLM-migration cruft and apply PR review checklist Two passes: 1. Post-migration cleanup - Delete dead penny/tests/mocks/ollama_patches.py (177 lines, replaced by llm_patches.py / MockLlmClient). - Remove unused PennyResponse.SEARCH_ERROR. - Update penny/CLAUDE.md to reflect current reality: llm/ tree (was ollama/), MockLlmClient/mock_llm fixtures, openai dep (was ollama), Python 3.14, full migration list 0001-0023, new directories (email/, zoho/, html_utils.py), new tools (content_cleaning, draft_email, list_emails, list_folders), /zoho command, Device and DomainPermission tables, dropped source_period_* columns. 2. PR review checklist - mechanical safety fixes - Replace 8 production assert statements with explicit raises or narrowing (assertions get stripped under python -O). - Hoist inline imports to module top in channels/base.py, channels/browser/channel.py, knowledge_store.py, message_store.py. - Narrow broad except Exception: catches in channels/base.py and startup.py (SQLAlchemyError, LlmError). - Replace getattr duck typing for validate_connectivity by adding a no-op base method on MessageChannel. - Import DedupStrategy/is_embedding_duplicate directly from similarity.dedup; delete the re-export shim from llm/similarity.py. - Replace + string concatenation with f-strings in agents/base.py, history.py, notify.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Wire email tool limits through runtime config The email subsystem had three invented limits with no user-configurable control: a hardcoded EMAIL_SEARCH_LIMIT=10 module constant duplicated across jmap/client.py and zoho/client.py, a list_emails tool that declared limit=10 in its Pydantic args and silently clamped any model override to 50, and a parameter the model couldn't even meaningfully use because the schema's max value was invented. Replace all three with two new runtime ConfigParams (EMAIL_SEARCH_LIMIT and EMAIL_LIST_LIMIT, both default 10), wired through JmapClient and ZohoClient constructors the same way EMAIL_BODY_MAX_LENGTH already is. The list_emails tool no longer exposes limit to the model — matching search_emails — so the user controls list size via /config and the model picks the folder. No silent clamping; no duplication; default behavior unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace fragile asyncio.sleep timing with wait_until in tests Tests in test_scheduler.py used "let several ticks pass" sleeps to verify scheduler behavior. These race on slow CI and waste time on fast machines. Replace each with wait_until polling on the actual condition (agent execute_count, mark_complete_count, cancellation flag). The negative assertion in test_foreground_during_idle now verifies immediately after spinning the scheduler task, per the "verify negatives immediately" guidance. Tests in test_permission_manager.py used hand-rolled sleep+iterate helpers to simulate user approve/deny on a pending future. Replace with a shared _resolve_pending helper that wait_untils on a non-done future before resolving — same effect, no fixed delay. test_browser_channel.py:fake_tool_response converted similarly: poll for a pending future to appear, then resolve. signal_server.py mock helpers left alone — their sleeps are inside hand-rolled polling loops (the wait_until pattern itself), not the fragile fixed-wait pattern the rule targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add structural-drift tests for agent system prompts penny/CLAUDE.md and the PR review checklist both require tests that catch when an agent's system prompt building blocks are reordered, added, or removed — but no such test existed for any agent. Add tests/agents/test_system_prompts.py with one test per prompt variant: ChatAgent, CheckinMode, ThoughtMode, ThinkingAgent. Each constructs a deterministic baseline state (profile only — no thoughts, knowledge, preferences, or related messages) and asserts on the exact ordered list of (level, title) markdown headers the prompt produces. Drift in section order, missing blocks, or extra blocks fails the test. Asserts on header structure rather than full prompt content so the tests stay maintainable when section content evolves but still catch the structural changes the rule actually exists to detect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Decompose long methods into named steps Four methods exceeded the 10-20 line guideline (hard max ~25) by a wide margin: _dispatch_to_agent (74), _run_agentic_loop (64), _call_model_validated (68), _process_tool_calls (68). Each was a kitchen-sink function mixing setup, branching, mutation, and cleanup, making it hard to follow what the orchestration actually does. channels/base.py: - Split _dispatch_to_agent into _handle_profile_required, _run_message_through_agent, and _deliver_agent_response. Top-level becomes a clean orchestrator: resolve identity → check profile → run+deliver under typing/progress/foreground bookkeeping. agents/base.py: - _run_agentic_loop now reads as a per-step decision tree, with _tools_for_step (final-step tool stripping), _absorb_tool_step_result (loop-state mutation), and _abort_if_all_tools_failed (early-exit check) extracted as named steps. - _call_model_validated extracts _invoke_model (the LLM call with error handling, narrowed from broad Exception to LlmError) and _append_retry_nudge (the bad-response + nudge append). - _process_tool_calls extracts _dedup_tool_calls, _notify_tool_start, and _collect_tool_results so the orchestrator just sequences the three phases (dedup → notify → execute → collect). Behavior preserved — refactor only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Drop legacy searchlog table The searchlog table hasn't been written to since the browser-based search migration, but the table and its three indexes were still in the schema and the SearchLog model class was still defined and exported. Add migration 0024 to drop the table and indexes (verified clean against a copy of the production DB via make migrate-test), and remove the model class plus its database/__init__.py exports. Update test_migrations.py expected table set and migration counts, and note 0024 in penny/CLAUDE.md's migration list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Eliminate remaining duck typing and broad exception in LLM/chat path Four small follow-ups from the audit punch list: 1. llm/client.py reasoning extraction. Replace ``getattr(message, "reasoning_content", None) or getattr(message, "reasoning", None)`` with a clean read from pydantic v2's ``model_extra`` dict — these fields are non-standard SDK extensions and that's exactly where pydantic stashes them. 2. agents/chat.py:caption_image. Drop the production ``assert self._vision_model_client is not None`` and replace with an explicit raise. The channel layer rejects image messages before they ever reach this method when no vision model is configured, so the raise documents the invariant without relying on assert (which gets stripped under python -O). 3. llm/similarity.py:embed_text. Narrow ``except Exception`` to ``except LlmError``. The function is best-effort by design (returns None on failure) but should still propagate non-LLM bugs instead of swallowing them. 4. channels/discord/channel.py. Drop ``getattr(message.author, "discriminator", "")`` / ``"global_name"``. Both are real attributes on every discord.py User/Member subclass — direct access is fine. Mock LLM client (tests/mocks/llm_patches.py) updated to expose ``model_extra`` dict so it matches the real SDK shape after change jaredlockhart#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Constants consolidation, dead-code purge, Pydantic default tightening Three groups of audit findings, one commit since they're all "shared values shouldn't drift" cleanup: **Constants → constants.py** The four ``*PromptType`` classes scattered across agents/chat.py, agents/thinking.py, agents/notify.py, and agents/history.py are moved to penny/constants.py as ``StrEnum`` subclasses (ChatPromptType, ThinkingPromptType, NotifyPromptType, HistoryPromptType). Their values land in promptlog.prompt_type and bubble through to the browser UI for display, so they cross module boundaries via the data flow even when not via direct import — exactly the rule's "shared value, single source of truth" target. ThinkingAgent.THOUGHT_CONTEXT_LIMIT was a class-attribute alias for PennyConstants.THOUGHT_CONTEXT_LIMIT — pure duplication. Reference the constant directly. **Dead constants purged** - channels/base.py:MAX_IMAGE_PROMPT_LENGTH = 300 — never read - constants.py:MAX_PAGE_CONTENT_CHARS = 100_000 — never read - channels/__init__.py:CHANNEL_TYPE_SIGNAL/DISCORD/BROWSER — re-export aliases of ChannelType.* with zero importers - browser/src/protocol.ts:TOOL_TIMEOUT_MS = 60_000 — never imported - browser/src/protocol.ts:MAX_EXTRACTED_CHARS = 50_000 — never imported **Cross-boundary mirror comment fix** ``browser/src/protocol.ts`` previously claimed to "mirror" penny/penny/channels/browser/models.py — exactly the smell the rule forbids. Replace with a more honest header that says only wire-format identifiers must match (because both sides need to encode/decode the same bytes), and everything else should derive from server payloads. **Pydantic optional defaults (item 9 of audit)** Empty-string defaults on ``str`` fields break null-coalescing in JS/TS — ``"" ?? fallback`` returns ``""``. Tightened these: - channels/base.py:PageContext (title, url, text) → required, browser always sends them - channels/browser/models.py:BrowserIncoming (content, sender) → required, browser always sends them - channels/discord/models.py:DiscordUser.discriminator → required; discord.py always exposes the field - llm/image_client.py:_GenerateResponse.response → ``str | None = None`` (Ollama may omit it for image responses) - llm/models.py:LlmToolCallFunction.name, LlmToolCall.id → required; a tool call without name/id is meaningless LlmMessage.content stays as ``str = ""`` because empty content is a legitimate state for tool-only assistant messages. Item 10 of audit (``del param`` statements) verified clean — none in the tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add coverage for cleanup-introduced behavior + narrow LlmClient excepts Closes the test-coverage gaps from the cleanup PR's earlier commits and narrows two more broad ``except Exception`` blocks I should've caught the first time through. **New tests for changes already in this PR:** 1. ``test_similarity.py::test_non_llm_exception_propagates`` — ``embed_text`` now narrows to ``LlmError``; verify a non-LLM exception (programmer bug) propagates instead of being swallowed as ``None``. 2. ``test_agentic_loop.py::TestModelErrorHandling`` — two cases for the agent's model-call path: - ``LlmConnectionError`` from the model results in ``AGENT_MODEL_ERROR`` (not a crash). - A non-LLM exception (programmer bug) propagates instead of being swallowed. 3. ``test_signal_vision.py::test_caption_image_raises_when_vision_client_missing`` — ``caption_image`` now raises explicit ``RuntimeError`` instead of relying on ``assert``; document the contract. 4. ``test_pydantic_models.py`` (new file) — ``ValidationError`` cases for every required-field tightening: ``PageContext``, ``BrowserIncoming`` (content + sender), ``DiscordUser.discriminator``, ``LlmToolCallFunction.name``, ``LlmToolCall.id``, ``LlmToolCall.function``. 5. ``test_zoho/test_client.py`` — two new tests verifying the constructor's ``search_limit``/``list_limit`` actually flow through to the Zoho API ``params["limit"]``. End-to-end coverage for the ``/config EMAIL_SEARCH_LIMIT`` / ``/config EMAIL_LIST_LIMIT`` runtime override that the prior PR commit only tested at the constructor-call boundary. **Bonus narrowing — found while writing jaredlockhart#2:** Test jaredlockhart#2 surfaced that ``LlmClient.chat`` and ``LlmClient.embed`` had their own broad ``except Exception`` blocks that wrapped *any* exception as ``LlmResponseError``, hiding the bug-vs-API-error distinction from every caller. Narrowed both to ``except openai.OpenAIError`` (the SDK's top-level base class). Now genuine SDK errors still get wrapped+retried, but unrelated programmer bugs propagate. Two pre-existing tests were faking LLM failures with generic Python exceptions (``RuntimeError("Ollama is down")``, ``ConnectionError``) — updated them to use real LLM error types (``LlmConnectionError``, ``openai.OpenAIError``). Same intent, accurate exception type. (Migration 0024 coverage skipped per discussion — ``make migrate-test`` already validates against a copy of prod.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Hoist inline imports out of test bodies The /quality review caught three inline ``from … import …`` statements inside test function bodies. The "no inline imports" rule has no test exception, and while doing the fix I also found four MORE pre-existing inline ``from penny.llm.client import LlmClient`` blocks in test_embeddings.py that pre-dated this PR. Cleaned all seven up: - tests/test_embeddings.py — hoisted ``import openai``, ``from penny.llm.client import LlmClient``; removed five inline copies (the audit-flagged one plus four pre-existing) - tests/test_similarity.py — hoisted ``from penny.llm.models import LlmResponseError`` - tests/channels/test_startup_announcement.py — hoisted ``from penny.llm.models import LlmConnectionError`` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Refresh README and CLAUDE.md docs for PR changes Final pass to bring user-facing docs in sync with everything that landed in this PR (and a few smaller drift items I noticed along the way): **README.md** - Python badge 3.12+ → 3.14+ (Dockerfile and pyproject already use 3.14; the badge was the last 3.12 reference). - Add /zoho to the slash commands list. - Runtime config count 23 → 30+ in the two places it appears (config_params.py now has 30 ConfigParams, including the new EMAIL_SEARCH_LIMIT and EMAIL_LIST_LIMIT added in this PR). - Add make migrate-validate to the make commands list. - Test infrastructure: "mock Ollama client" → "mock LLM client (MockLlmClient, patches openai.AsyncOpenAI)" to reflect the Ollama→OpenAI SDK migration. Drop "mock search APIs" — search is via the browser extension now, no mock search APIs exist. **CLAUDE.md (root)** - "What Is Penny": clarify that the LLM is accessed via the OpenAI SDK against an OpenAI-compatible endpoint (Ollama by default), not directly via the Ollama SDK. - docs/ directory listing: add the four files that were missing — most importantly pr-review-guide.md, the canonical PR review checklist that the /quality skill consumes. - New "PR Review Checklist" section pointing at docs/pr-review-guide.md as the source of truth for every rule the project enforces. The Code Style and Design Principles sections above are the quick reference; the guide is the full rulebook. **penny/CLAUDE.md** - Add /zoho to the Conditional Commands list (was already in the directory structure but missing from the prose). - Runtime Configuration "Groups" line: mention email body/search/list limits alongside the other Global params. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * README: comprehensive refresh — fix stale memory model, env vars, commands The first README pass was too shallow. A proper sweep against the current code surfaced significant drift in nearly every section: **Memory section was wrong** — described "daily summaries" and "weekly entries" that haven't existed since migration 0023 dropped the conversationhistory table. Replaced with the actual three-layer model: knowledge entries (per-URL page summaries with embedding-based retrieval), related-message retrieval (cosine similarity with centrality penalty + ±5-minute neighbor expansion), and preferences. **Penny's Mind diagram** had a stale "Daily & Weekly Summaries" node in the Memory subgraph and a History → Summaries edge that no longer fires. Replaced with a Knowledge node and corrected the edges. **Cognitive Cycle** bullet 2 ("summarizes conversations into daily and weekly entries") rewritten to describe what HistoryAgent actually does — knowledge extraction from browses, two-pass preference extraction from messages. **Conversations section** mentioned "via Ollama" without acknowledging the OpenAI SDK migration. Updated to clarify Penny uses the OpenAI Python SDK against any OpenAI-compatible endpoint (Ollama by default). **Preferences section** said "after each day's conversations" — the HistoryAgent runs continuously, not on a daily cycle. Rewrote to describe the actual two-pass identify-then-classify pipeline plus the mention-count threshold gate. **Commands list** was missing /commands, /debug, /unschedule, /test, and reordering for clarity. Added env var requirements per command. **Make Commands** listed `make fmt` which doesn't exist (only fix exists, which combines format + lint --fix). Removed it; added the real `make team-build` and `make browser-build` targets; corrected the `make check` description to list everything it actually runs. **.env example** was missing Zoho entirely, missing the canonical LLM_* env names (canonical post-OpenAI-SDK-migration), and missing the optional embedding/vision/image API URL/key overrides. Rewrote the block to mirror .env.example with comments explaining each. **Configuration Reference** had two completely fake env vars: OLLAMA_MAX_RETRIES and OLLAMA_RETRY_DELAY don't exist anywhere in the code anymore — llm_max_retries/llm_retry_delay are hardcoded defaults on the Config dataclass. Removed. TOOL_TIMEOUT default was documented as 60s but actual default is 120s. Fixed. Whole Ollama: subsection rewritten as LLM: section showing the new LLM_* canonical names with OLLAMA_* fallbacks called out for backwards compat. Added Browser Extension subsection. Added Zoho to API Keys. **Models table** updated to show env var per role and to mention that each model can target a different OpenAI-compatible endpoint via the corresponding _API_URL/_API_KEY overrides. **Browser Extension section** was missing six features the addon now ships: - Live in-flight tool status in chat ("Searching…", "Reading X…") - Per-addon tool-use toggle - Cross-device domain permission prompts (also answerable from Signal) - Schedule manager UI - Settings panel (domains + runtime config) - Prompt log viewer (every LLM call browseable, grouped by run id) - Signal in-flight progress as morphing emoji reactions on the user's message **Setup prerequisites** updated to acknowledge that omlx and other OpenAI-compatible endpoints work, not just Ollama. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * README: reframe from Ollama-centric to OpenAI-compatible Penny no longer has Ollama-specific runtime dependencies (other than the /draw image generation endpoint). The README still led with "Ollama" in the badge, several headings, and the configuration reference, giving the impression it's an Ollama-first project. - Replace the "Ollama" badge with "OpenAI-compatible LLM" - Conversations section: lead with "OpenAI Python SDK against any OpenAI-compatible endpoint" and list Ollama/omlx/vLLM/OpenAI as examples, not as the primary identity - Models table: explicitly state that text/vision/embedding all go through the OpenAI SDK; call out image generation as the one exception (uses Ollama's /api/generate directly) - Setup prerequisites: lead with "OpenAI-compatible LLM endpoint" and list backends as choices, not required software - Configuration Reference: drop the "OLLAMA_*" fallback noise from every line; state upfront there are no Ollama-specific dependencies; document the image generation exception clearly; move legacy OLLAMA_* names to a one-sentence footnote - .env example: "any OpenAI-compatible endpoint" framing, not "Ollama (default)"; "unauthenticated local backends" not "local Ollama" Every remaining "Ollama" reference is now either (a) listed as one example backend among several, (b) the explicitly documented /draw image generation exception, (c) the backwards-compat OLLAMA_* env name footnote, or (d) the real OLLAMA_BACKGROUND_MODEL env var that penny-team's Quality agent still reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Drop legacy OLLAMA_* env var fallbacks from code and docs No userbase to preserve backwards compatibility for — just one user. The OLLAMA_* fallback chain in config.py added complexity for no benefit. Code: - config.py: remove all `os.getenv("OLLAMA_*")` fallback calls. Each `LLM_*` env now reads directly with its own default. No more nested `os.getenv("LLM_X", os.getenv("OLLAMA_X", default))` chains. - Rename `ollama_api_url` field → `image_api_url` (it's only used by the image generation client). Reads from `LLM_IMAGE_API_URL` env. - penny.py: `config.ollama_api_url` → `config.image_api_url` - chat.py, test docstrings: OLLAMA_VISION_MODEL → LLM_VISION_MODEL Docs: - .env.example: all LLM_* names, no OLLAMA_* duplicates - CLAUDE.md (root): Ollama section rewritten as LLM section - penny/CLAUDE.md: /draw and vision refs updated to LLM_* names - README.md: drop the legacy OLLAMA_* fallback footnote; image gen now documented as `LLM_IMAGE_API_URL` not `OLLAMA_API_URL` The only remaining OLLAMA_ env var anywhere in the project is `OLLAMA_BACKGROUND_MODEL` which penny-team's Quality agent still reads — that's their code, not ours to change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) The "Global" group was a junk drawer with 10 unrelated params (email tools, browser domain mode, chat limits, embedding backfill, context window). "Schedule" had only IDLE_SECONDS. Params that a user tunes together were scattered across different groups. New grouping by what the user is actually tuning: - **Chat** (8 params): foreground conversation + retrieval context — MESSAGE_MAX_STEPS, CHAT_MAX_QUERIES, MESSAGE_CONTEXT_LIMIT, SEARCH_URL, RELATED_MESSAGES_LIMIT, RELATED_KNOWLEDGE_LIMIT, RELATED_KNOWLEDGE_SCORE_FLOOR, DOMAIN_PERMISSION_MODE - **Thinking** (7 params): inner monologue — INNER_MONOLOGUE_*, THOUGHT_DEDUP_*, MAX_UNNOTIFIED_THOUGHTS, FREE_THINKING_PROBABILITY - **History** (6 params): background extraction — HISTORY_INTERVAL, PREFERENCE_DEDUP_*, PREFERENCE_MENTION_THRESHOLD, KNOWLEDGE_EXTRACTION_BATCH_LIMIT, EMBEDDING_BACKFILL_BATCH_LIMIT - **Notify** (5 params): notification outreach + idle timing — IDLE_SECONDS (moved from Schedule), NOTIFY_CHECK_INTERVAL, NOTIFY_COOLDOWN_MIN/MAX, NOTIFY_CANDIDATES - **Email** (4 params): email tool settings — EMAIL_BODY_MAX_LENGTH, EMAIL_SEARCH_LIMIT, EMAIL_LIST_LIMIT, JMAP_REQUEST_TIMEOUT "Global" and "Schedule" groups dissolved entirely. "Inner Monologue" renamed to "Thinking" for clarity. IDLE_SECONDS description updated from "Global idle threshold" to "Seconds of silence before background agents become eligible" since it now lives in the Notify group. Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…age (jaredlockhart#976) Each preference row now shows how many thoughts were seeded by it, and rows with thoughts are expandable to show a list with title, date, image thumbnail, and content. Thoughts are lazy-loaded via a new WebSocket message pair (preference_thoughts_request/response). Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es (jaredlockhart#977) Image-only messages logged with empty content produced a conversation of just "[HH:MM] " — truthy, so the empty-guard passed, the LLM was called, returned no preferences, and did_work=False prevented marking the message processed. Same loop fired for any unprocessed message that legitimately yielded zero preferences. Observed 7 identical identification calls against one empty message across a single minute in promptlog. Fix splits identification failure (retry) from empty results (done): - _format_messages skips messages with empty/whitespace content - _extract_text_preferences returns True when the attempt completes, False only when identification itself fails (exception / unparseable JSON) - _extract_preferences_from_content returns True on any completed pass, False only when _identify_preference_topics returns None Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.