Skip to content

Refactor email system into plugin architecture with multi-provider support#938

Open
alifeinbinary wants to merge 107 commits intojaredlockhart:mainfrom
alifeinbinary:plugin-system
Open

Refactor email system into plugin architecture with multi-provider support#938
alifeinbinary wants to merge 107 commits intojaredlockhart:mainfrom
alifeinbinary:plugin-system

Conversation

@alifeinbinary
Copy link
Copy Markdown
Contributor

  • Extract email functionality into plugin system under penny/plugins/
  • Establishing an architecture for plugins where service specific code is abstracted.
  • Move Fastmail JMAP client to plugins/fastmail/ with FastmailPlugin class
  • Move Zoho client to plugins/zoho/ with ZohoPlugin class
  • Create InvoiceNinja stub plugin for future invoicing integration. Can be modified to be a standard boilerplate for future plugins.
  • /zoho command is now unified under /email command with multi-provider routing support
    • Single provider: /email
    • Multiple providers: /email

alifeinbinary and others added 30 commits March 25, 2026 06:20
…sues

ReadEmailsTool was running fetched emails through Ollama summarization, adding latency and losing detail. The agent already has the full email content in context and can answer questions directly.

Changes:
- Remove OllamaClient and user_query params from ReadEmailsTool
- Return raw email content joined with separators instead of summary
- Remove ReadEmailsArgs Pydantic model (use kwargs directly)
- Remove EMAIL_SUMMARIZE
… Discord event handlers for reconnecting.

Adds extensive logging to debug Discord message reception issues:
- Log intents configuration, bot user ID, gateway latency on ready
- Add on_connect, on_disconnect, on_resumed gateway event handlers
- Log raw MESSAGE_CREATE gateway events via on_socket_raw_receive
- Log ALL messages in on_message before filtering (author, channel, content)
- Log filter decisions (own message, wrong channel) with [DIAG] prefix

Add validate_connectivity()
…sues

ReadEmailsTool was running fetched emails through Ollama summarization, adding latency and losing detail. The agent already has the full email content in context and can answer questions directly.

Changes:
- Remove OllamaClient and user_query params from ReadEmailsTool
- Return raw email content joined with separators instead of summary
- Remove ReadEmailsArgs Pydantic model (use kwargs directly)
- Remove EMAIL_SUMMARIZE
…liest (jaredlockhart#863)

_find_unrolled_weeks used get_recent(limit=1) which returns the most
recent daily entry, but treated it as the earliest. When the most recent
entry is in the current week, first_monday == current_monday and the
scan loop never executes — so no completed weeks are ever found.

- Add get_earliest() to HistoryStore (ASC ordering)
- Use get_earliest() in _find_unrolled_weeks
- Update test to seed current-week entries alongside past weeks

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…wns (jaredlockhart#864)

* Improve notification scoring, thinking distribution, and topic cooldowns

- Normalize novelty and sentiment scores to [0,1] via min-max scaling before
  applying weights, so both dimensions contribute proportionally instead of
  novelty dominating due to its ~4x larger raw range
- Add per-topic 24h notification cooldown: once a preference (or free thought)
  is notified, that topic is excluded from candidates for 24 hours
- Add MAX_UNNOTIFIED_THOUGHTS config param (default 20) — thinking agent skips
  cycles when unnotified thoughts reach the cap
- Replace random-roll thinking mode selection with distribution-based steering:
  compare actual free/seeded ratio against target probabilities and pick
  whichever type is underrepresented
- Add ThoughtStore.count_unnotified() and count_unnotified_free() queries
- Add THOUGHT_TOPIC_COOLDOWN_SECONDS constant (86400)
- 12 new tests covering normalization, cooldown, cap, and distribution logic
- All existing tests updated to monkeypatch probability constants for
  determinism independent of production values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move thinking distribution constants to runtime config params

FREE_THINKING_PROBABILITY and NEWS_THINKING_PROBABILITY are now runtime-
configurable via /config instead of hardcoded constants. The seeded
probability is implicit (1 - free - news). Tests pass probabilities
through make_config() instead of monkeypatching PennyConstants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ockhart#865)

* Move scoring weights to runtime config params (default 50/50)

NOVELTY_WEIGHT and SENTIMENT_WEIGHT are now runtime-configurable via
/config instead of hardcoded constants. Default changed from 40/60 to
50/50 for equal weighting now that normalization makes both dimensions
comparable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix thinking agent flooding logs when at unnotified cap

When MAX_UNNOTIFIED_THOUGHTS is reached, get_prompt returned None which
made execute_for_user return False. The scheduler treated that as "no
work" and retried every tick (~1s), flooding the log.

Move the cap check to execute_for_user and return True when skipping,
so the scheduler calls mark_complete and waits for the next interval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt#866)

Append -site: exclusions for blocked domains (facebook, instagram, tiktok)
to the Serper query so Google filters them server-side. Previously we only
filtered after download, so queries dominated by these domains returned no
image at all.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The THINKING_REPORT_PROMPT was producing thoughts framed as corrections
or debunking ("it turns out X is NOT Y"), which sounds wrong in
spontaneous notifications where there's nothing to correct. Updated the
prompt to frame findings as standalone new discoveries and to discard
searches that only found something doesn't exist.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add topic context intro to notify prompt

Thought notifications were jumping straight into details without
establishing what the topic is, leaving the reader confused (e.g.,
"Kokoroko's new RSD-2026 vinyl..." with no mention that Kokoroko is
a band). Updated NOTIFY_SYSTEM_PROMPT to instruct the model to open
with a brief identifying phrase before diving into details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add full system prompt assertions for news and checkin notify modes

Extends the test coverage pattern to all three notification modes.
ThoughtMode already had a full prompt assertion; now NewsMode and
CheckinMode do too, catching structural drift in prompt composition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#871)

* Make thought notifications conversational instead of report-style

The thinking report prompt produces structured content (bullets, headers,
tables) which the notify agent was regurgitating verbatim. Changed the
instruction from "Share what's in it — the thought IS the substance"
to "Retell it conversationally — no bullet lists, no headers, no tables"
so notifications read like a friend explaining what they found.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix flaky schedule test by polling for expected message content

Replace wait_for_message (returns last message, vulnerable to race
conditions) with wait_until + _has_message pattern that polls for
the specific expected content. This matches the convention used by
the rest of the test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#872)

* Steer thinking agent away from troubleshooting/support content

The thinking agent was searching for bug reports and support articles
(e.g., "UAD plugin glitch") and surfacing them as interesting
discoveries. Added guidance to look for releases, creative work, and
discoveries while avoiding troubleshooting guides and bug reports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add casual greeting to proactive notifications

Notifications were jumping straight into content without a greeting.
Added "Start with a casual greeting" to NOTIFY_SYSTEM_PROMPT,
matching the pattern already used by the news notification prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#873)

NEWS_NOTIFY_MAX_STEPS was 1, but the agent base class strips tools on
the final step. With only 1 step, fetch_news could never execute —
the model's tool call was discarded as "hallucinated on final step"
and every news attempt produced an empty response that got
disqualified. Bumped to 3 steps so the model can call the tool and
format results.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#874)

* Use thought title as image search fallback for notifications

When a thought notification has no tool calls (model retells thought
context directly), the image search fell back to using the full
message text, producing bad image results. Now ThoughtMode extracts
the first bold headline from the thought content as the image query
(e.g., "Bad Cat Era 30 – A Hand-Wired EL84 Head"), which is a much
better match for finding a relevant product/topic image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Strip generic prefixes from thought titles for image search

Thought titles like "Briefing: Tone King Royalist" or "Here is
something interesting I learned about the Vox AC15HWR1" had generic
prefixes that diluted image search results. Added _clean_thought_title
that strips common prefixes (Briefing:, Detailed Briefing:, etc.)
and filters out completely generic titles. Tested against 100 recent
thoughts: 97/100 produce good image queries after cleaning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Reduce fuzzy duplicate preference extraction

The preference extraction prompt was creating near-duplicate preferences
like "Tubesteader Eggnog user reviews" and "Tubesteader Eggnog 12AX7
pre-amp" when "Tubesteader pedals" already existed. These slipped past
both TCR and embedding dedup because short strings with slightly
different wording produce low similarity scores.

Added explicit guidance that asking about reviews, specs, or details of
a known item is engagement with the existing preference, not a new one.
Added a concrete example matching the observed failure pattern.

Dry-ran against the actual prompt that produced the duplicate — 3/3 runs
correctly classified it as existing instead of new.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Skip questions and tasks in preference extraction

The model was extracting questions and troubleshooting requests as
preferences (e.g., "Running preamp into front of amp", "preamp output
confusion", "pedals powered via XLink Out"). Added explicit guidance
to skip questions, tasks, and troubleshooting requests.

Dry-ran against 4 prompts that produced task preferences — all 4
previously-bad extractions are now suppressed or significantly reduced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Cache embeddings on thought and messagelog tables

Thoughts and outgoing messages were being re-embedded from scratch on
every dedup check and novelty comparison. Added embedding BLOB columns
to both tables so embeddings are computed once and reused.

- Migration 0014: adds embedding column to thought and messagelog
- ThinkingAgent: embeds and stores at thought creation time, uses
  cached embeddings in dedup (skips thoughts without embeddings)
- NotifyAgent: uses cached message embeddings for novelty scoring,
  backfills on first access
- Startup backfill job extended to populate thought embeddings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Embed messages at insert time and backfill at startup

Messages were being lazily backfilled in the notify agent on read.
Moved embedding to send_response (insert time) so every outgoing
message gets its embedding cached immediately. Added startup backfill
for existing messages without embeddings, and a test assertion that
thoughts get embeddings stored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The NOTIFY_NEWS prompt said "the source in parentheses" which the
model interpreted as the outlet name (e.g., "(New York Times)") rather
than the actual URL from the tool results. Changed to "the source URL
from the tool results" so URLs are included.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chat was instructed to "Focus on ONE topic per response" and "go deep"
which produced narrow answers that missed important angles (e.g.,
trauma/immune question only covered physical trauma, ignored PTSD).
Changed to "Go WIDE: cover as many angles as possible" with multiple
search queries and follow-up searches for comprehensive answers.

Thinking mode stays go-deep (autonomous exploration of one thread).
Chat mode is now go-wide (user wants the full picture).

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#879)

* Skip daily history entries already covered by weekly rollups

The history context was including both weekly rollups AND their
constituent daily entries, causing duplicate topics in the system
prompt. Now _format_daily_entries checks each day against the weekly
rollup date ranges and skips days that fall within a completed week.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add test for daily/weekly history overlap filtering

Verifies that daily entries within a weekly rollup's date range are
excluded from the history context, while daily entries outside the
range are still included.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed THINKING_REPORT_PROMPT from structured report format (tables,
headers, 500 words) to conversational message format (casual greeting,
details, URL, 300 words). Thoughts are now stored in the shape they'll
be shared, cutting context size in half.

Loosened NOTIFY_SYSTEM_PROMPT to relay the thought as-is instead of
re-summarizing. Old prompt: "Retell it conversationally, no bullets/
headers/tables." New prompt: "Share it with the user, don't compress
or summarize, just relay in your own voice."

Tested end-to-end on 3 examples: new pipeline produces notifications
with equal or better detail than the original two-step process.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

The "No greetings, no sign-offs" rule was in PENNY_IDENTITY which is
shared by all agents, causing proactive notifications to skip greetings
even though the notify prompt said to include one. Moved the rule to
CONVERSATION_PROMPT so it only applies when responding to user messages.

Also removed the greeting from THINKING_REPORT_PROMPT since the notify
agent now handles greetings — the stored thought shouldn't include one.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#882)

* Score thoughts by cached embedding before generating notification

Previously generated N candidates through the model, then scored them.
Now scores the raw thoughts using cached embeddings (novelty +
sentiment), picks the winner, then runs only the winner through the
notify agent. With NOTIFY_CANDIDATES=5, this cuts model calls from
5 to 1 per notification cycle.

Possible because thoughts are now stored in notification-ready shape
with pre-computed embeddings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add integration test for embedding-based thought scoring

Tests the full notification flow with 3 thought candidates: seeds DB
with preferences, thoughts with embeddings, and an incoming message,
then runs execute_for_user and asserts a notification was sent and
exactly 1 of 3 thoughts was marked notified.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Assert on full call chain in embedding scoring test

Verify every edge of the score-then-generate flow:
- 1 Ollama chat call (winner only, not all candidates)
- 1 embed call (outgoing message at send time, not during scoring)
- 1 serper image search
- Message delivered via Signal
- 2 of 3 thoughts remain unnotified
- 1 thought marked notified in DB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Simplify image search to use thought content directly

The bold-title extraction and prefix-cleaning logic was built for
the old structured report format. With conversational thoughts, bold
titles are rare. Now uses first 300 chars of thought content as the
image query — the subject name consistently appears in the first
sentence or two, and serper is smart enough to extract it.

Removed dead code: _clean_thought_title, _is_generic_title,
_TITLE_STRIP_PREFIXES.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add thought title for dedup and image search

The thinking report prompt now emits a 'Topic: <title>' line that gets
parsed and stored separately. Titles are short (e.g., "Tubesteader
Beekeeper pedal") so they embed closely for duplicates and work well
as image search queries.

Key changes:
- Migration 0015: adds title column to thought table
- THINKING_REPORT_PROMPT: emits 'Topic: ...' on last line
- ThinkingAgent: parses title, embeds title (not content), stores both
- Thought dedup: now global (all thoughts, not per-preference) using
  TCR_OR_EMBEDDING on titles — catches cross-preference duplicates
- Image search: uses thought.title when available
- New runtime config: THOUGHT_DEDUP_TCR_THRESHOLD (default 0.6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Separate title and content embeddings on thoughts

Title embedding for dedup (short string, high discrimination),
content embedding for novelty/sentiment scoring (full message vs
messages/preferences). Both computed at creation time and cached.

- Added title_embedding column to thought table (migration 0015)
- ThinkingAgent stores both embeddings at creation
- Dedup uses title_embedding, scoring uses embedding (content)
- Added THOUGHT_DEDUP_TCR_THRESHOLD runtime config param (0.6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckhart#884)

OR strategy produced false positives from common short words ("2026",
"AI", "agent") matching via TCR on short titles. Switched to AND
(both TCR >= 0.6 AND embedding >= 0.6 required) which eliminates
all false positives while catching real duplicates.

Also lowercase titles before embedding so casing doesn't affect
similarity (e.g., "THE GHOST IN THE SHELL" vs "Ghost in the Shell"
was 0.381, now 0.652 after lowercasing).

Lowered THOUGHT_DEDUP_EMBEDDING_THRESHOLD default from 0.80 to 0.60
since title embeddings score lower than full-content embeddings.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed (jaredlockhart#896)

* Add browser extension with WebSocket server and dev tooling

Browser sidebar extension connects to Penny via WebSocket (echo-only for now).
Adds web-ext dev setup with auto-reload, exposes port 9090 from Docker,
and wires up BROWSER_ENABLED config to start the server alongside Signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add multi-channel architecture with device routing and shared history

ChannelManager implements MessageChannel as a routing proxy — all agents,
scheduler, and commands interact with it instead of a single channel.
Messages from any device (Signal, browser) resolve to the same user
identity, giving full conversation continuity across channels.

New: Device table + DeviceStore, ChannelManager, BrowserChannel (full
MessageChannel), migration 0016, ChannelType enum, browser sidebar
device registration flow. 418 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add browser HTML formatting, image URLs, reconnect indicator, and single-user fix

BrowserChannel.prepare_outgoing converts markdown to HTML (bold, italic,
code, links, tables-to-bullets). Images use URLs via search_image_url
instead of base64 download, rendered as <img> tags prepended to messages.
Sidebar shows reconnecting spinner. Background agents use get_primary_sender
from UserInfo instead of mining MessageLog for user identity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Set up TypeScript, typed protocol, light/dark theme, and streamlined UI

Converts browser extension to TypeScript with strict mode. Shared
protocol.ts defines typed constants and discriminated unions for the
WebSocket protocol. CSS refactored to custom properties with
prefers-color-scheme for automatic light/dark support. Header removed,
status indicator is now a minimal dot at bottom-right of messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Persist chat history in browser local storage with smart scrolling

Messages stored in browser.storage.local (capped at 200) and rehydrated
on sidebar open. New messages scroll to show the top of the message;
rehydration jumps to bottom instantly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move WebSocket to background script, sidebar uses runtime messaging

Background script owns the server connection and persists across sidebar
open/close. Sidebar communicates via browser.runtime messaging with typed
RuntimeMessage protocol. Connection state synced on sidebar open via port.
Smart scroll: short messages anchor at bottom, long messages show top first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add browse_url tool with hidden tab, content extraction, and domain permissions

First browser tool: browse_url opens a hidden tab with full web engine and
user session, injects a content script to extract visible text, then the
server summarizes it in a sandboxed model call before the agent sees it.

Domain permission flow: unknown domains prompt the user via sidebar dialog,
decisions stored for future calls. Tool available dynamically to chat and
thinking agents when a browser is connected.

Protocol: tool_request/tool_response RPC over WebSocket with correlation IDs.
BrowserChannel resolves asyncio Futures when responses arrive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix single-user identity resolution for commands, reactions, and startup

Commands, reactions, and command logs now resolve device identifiers to the
primary user sender via _resolve_user_sender. Startup announcement uses
get_primary_sender and skips when no message history exists. Tests added
for user sender resolution and startup skip behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix /draw in browser by handling raw base64 and data URI attachments

_prepend_images now supports three attachment formats: HTTP URLs, data URIs,
and raw base64 (wrapped as data:image/png). Previously only HTTP URLs were
rendered, so /draw output was silently dropped in the browser sidebar.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add active tab context injection for browser sidebar messages

Background script extracts visible text from the active tab on tab switch
and page load, holds it in a buffer, and attaches it to chat messages.
Server injects it into the chat agent's system prompt as a
"Current Browser Page" context section. Truncated to 5,000 chars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix scroll positioning by re-scrolling after image load

scrollIntoView fires before images render, so offsetHeight is wrong
for messages with images. Now re-scrolls on each img load event to
account for the final dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace content extraction with Defuddle, inject page context as synthetic tool call

Content script now uses Defuddle for smart page extraction (strips nav,
sidebars, boilerplate) with CSS heuristic and TreeWalker fallbacks.
Bundled via esbuild since content scripts can't use imports.

Page context injected as a synthetic browse_url tool call + result in the
message history instead of system prompt. The model sees a pre-completed
tool exchange and answers from it directly. System prompt carries a minimal
hint (title + URL) to disambiguate "this page" references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add page context toggle, og:image extraction, and flush image styling

Sidebar shows current page title with checkbox to include page content.
Content script extracts og:image metadata. Responses to page-context
messages show the page image and "In response to" link inside the message
bubble. All images in Penny messages now render flush to bubble edges
with matching border-radius. Input disabled while waiting for response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update browser extension architecture doc with implementation status

Reflects all completed work: multi-channel architecture, device table,
browse_url tool, active tab context, Defuddle extraction, permission
flow, TypeScript protocol, page context toggle, and additional features
not in the original plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add thoughts feed page with new/archive tabs, image URLs, and modal viewer

Feed page renders thoughts as a card grid with images, titles, seed topic
bylines, and HTML-formatted content (via server-side prepare_outgoing).
New/Archive tabs split by notified_at. Clickable cards open a modal with
full content. Sidebar nav bar links to feed page.

image_url stored on Thought model at creation time. Startup backfill
populates existing thoughts in parallel batches. Migration 0017 adds
image_url column. Seed topic resolved from preference FK for bylines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add thought reactions, unnotified count, Font Awesome icons, and periodic polling

Thumbs up/down on feed cards and modal overlay — logs reaction as incoming
message with parent_id to synthetic outgoing (same pipeline as Signal
reactions for preference extraction), marks thought notified, fades card.

Font Awesome installed locally (no CDN). Sidebar nav shows unnotified
thought count. Background polls thoughts every 5 minutes for fresh count.
Reaction buttons float on card corners with hover color effects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add Penny logo with transparent background, extension icons, and Signal avatar

penny.png made transparent and resized to 48px/96px for extension icons.
Added to README header. Signal profile picture set via signal-cli-rest-api
PUT /v1/profiles endpoint. New `make signal-avatar` target for setting it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add Penny logo, SVG icons, thought reactions, feed polish, and image backfill

Logo: penny.svg traced from PNG via potrace, auto-cropped, rendered to
16/32/48/96px PNGs from SVG for crisp icons at all sizes. Added to
README, sidebar nav, feed page header, and extension manifest.

Feed: thumbs up/down reactions log to preference extraction pipeline,
Font Awesome icons (local), periodic thought polling, unnotified count
in sidebar nav, seed topic bylines, modal viewer with reactions,
server-side markdown-to-HTML for thought content.

Infrastructure: thought.image_url stored at creation time, startup
backfill for existing thoughts, migration 0017, make signal-avatar
target. 5-minute thought poll interval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update architecture doc with feed page, reactions, logo, and new features

Documents feed page implementation (card grid, new/archive tabs, modal,
reactions pipeline, image URLs at creation time), logo/SVG workflow,
Font Awesome, thought count polling, and updated directory structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update CLAUDE.md with browser extension, multi-channel, and new commands

Documents browser extension directory structure, dev workflow, config
vars (BROWSER_ENABLED/HOST/PORT), make signal-avatar target, single-user
model, and design doc references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update README with browser extension, multi-channel, and feed page

Adds Browser Extension section documenting sidebar chat, active tab
context, browse_url tool, thoughts feed, and multi-device support.
Updates overview to mention browser channel and shared history.
Adds Firefox badge, browser config vars, and make signal-avatar.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use PageContext Pydantic model instead of raw dicts throughout

PageContext defined in channels/base.py (alongside IncomingMessage),
imported by browser/models.py. All page context references use typed
model attributes instead of dict.get() calls. Renamed abbreviated
variable names (ctx → context).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move inline imports to top level and batch seed topic query

All inline imports of penny modules moved to top-level imports.
Inline imports only remain for optional external packages (github_api)
inside try/except guards. Seed topic resolution uses batch get_by_ids
query instead of N individual queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Sanitize all web content at the BrowserChannel boundary

All page content from the browser is sanitized through a sandboxed model
call in BrowserChannel before reaching any downstream consumer. Both
browse_url tool responses and active tab context go through the same
_sanitize_page_content method — comprehensive rewrite preserving URLs,
structure, and details. BrowseUrlTool no longer does its own
summarization; it receives pre-sanitized content from the channel.

Single enforcement point: consumers can't accidentally bypass
sanitization because it happens at the channel boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move sanitize prompt and constants to proper files, add favicon, fix title color

PAGE_SANITIZE_PROMPT moved to Prompt class. TOOL_REQUEST_TIMEOUT and
MAX_PAGE_CONTENT_CHARS moved to PennyConstants. Feed page gets favicon
and black title instead of purple.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Start typing indicator before page content sanitization

Typing indicator now fires before the sandboxed summarization step so
the user sees immediate feedback while page content is being processed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Increase tool timeouts for browse_url + sanitization chain

Browser tool request timeout bumped from 30s to 60s. Overall tool
timeout bumped from 60s to 120s to accommodate the full chain:
browser round-trip + page load + content extraction + sanitization
model call. IMDB pages were timing out at 60s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add tests for page content sanitization and BrowseUrlTool passthrough

Tests cover: sandboxed sanitization happy path, fallback when no model
client, fallback on model failure, content truncation at max chars,
BrowseUrlTool returning pre-sanitized content directly, and empty
content handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Show newest thoughts first on the feed page

Added get_newest() method to ThoughtStore that returns newest-first
ordering. Feed page handler uses it instead of reversing get_recent().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Only recheck page context toggle when URL actually changes

Prevents background tab update events from resetting the toggle when
the user unchecked it on the same page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add TODO section to architecture doc for deferred work

Browse_url page headers, sender column cleanup, domain allowlist UI,
and tool rate limiting noted for future PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#899)

* Add Likes/Dislikes tabs to browser extension sidebar

Adds two new tabs to the sidebar for managing preferences directly from
the browser. Each tab lists preferences with mention counts and an × to
delete, plus an input at the bottom to add new ones. The connection
status indicator is now in the nav bar so it's visible on all tabs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove sandboxed model summarization step for web page content

The sandboxed model call (40s on 20B) wasn't providing meaningful
security — domain allowlist and no-code-execution already close the
real attack surface. Small models (gemma3:1b, qwen2.5:1.5b) hallucinate
facts making them worse than passing through Defuddle-extracted content
directly. Defuddle already strips nav/boilerplate at the source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…es (jaredlockhart#900)

* Store thought valence from reactions and filter thinking by preferences

Thumb reactions on thoughts now store valence (1/-1) directly on the
thought row instead of extracting a mention=1 preference. This cleans up
the preference table (which previously had noisy thought-title entries)
and provides a foundation for future thought-based scoring.

The thinking agent now gates new thought storage behind a mention-weighted
preference filter: if qualifying positive preferences exist (mention>1),
a thought must score >= 0 against them before being stored. Inactive
when no signal exists yet.

Notification scoring is simplified to pure novelty (no sentiment) since
the thought loop filter already gates on preference alignment.

Key changes:
- migration 0018: add thought.valence column
- ThoughtStore: set_valence() and get_valenced()
- similarity: replace compute_sentiment_score with compute_mention_weighted_sentiment
- BrowserChannel: store valence on thought, remove synthetic message creation
- HistoryAgent: route thought reactions to set_valence, mark processed immediately
- ThinkingAgent: _passes_preference_filter gates new thought storage
- NotifyAgent: pure novelty scoring (_select_most_novel @staticmethod)
- config_params: remove NOVELTY_WEIGHT and SENTIMENT_WEIGHT

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Backfill thought valence from existing reactions in migration

The migration now walks messagelog to find emoji reactions that point
to notification messages (thought_id IS NOT NULL) and sets the
corresponding thought.valence = 1 or -1. Only fills NULL valence
to avoid overwriting a later reaction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove reaction-based preference extraction from history agent

Preference extraction now runs only on text messages. Reactions are
processed solely for thought valence (set_valence on thought reactions)
and then marked as processed — no LLM call, no preference created.

Removes: ExtractedTopic, ExtractedTopics models, _extract_reaction_preferences,
_build_reaction_items, _extract_reaction_topics, _store_reaction_preferences,
_classify_reaction_emoji, and REACTION_TOPIC_EXTRACTION_PROMPT.

Replaces with: _process_reactions (thought valence only) + _emoji_to_int_valence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Wire PREFERENCE_MENTION_THRESHOLD into sentiment scoring

compute_mention_weighted_sentiment now takes an explicit min_mentions
parameter (no default) so the threshold is always sourced from config.
_passes_preference_filter reads PREFERENCE_MENTION_THRESHOLD and passes
it through, keeping seed-topic eligibility and sentiment filtering in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix has_signal gate to check any qualifying preference, add negative-only test

The preference filter gate was only checking positive preferences, meaning
thoughts would slip through unfiltered if a user had only negative prefs
qualifying for the mention threshold. Now checks any qualifying preference
(positive or negative), and adds a test confirming the filter activates with
negative-only qualifying prefs (score = 0 - 1 = -1 → filtered).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix stale docstring in _passes_preference_filter

Gate activates on any qualifying preference (positive or negative),
not just positive ones — updated after the has_signal fix in cae119f.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…hart#901)

- Browser sends a heartbeat to the server on every URL navigation,
  resetting the idle timer so proactive notifications are suppressed
  while the user is actively browsing.

- PeriodicSchedule gains requires_idle flag (default True). History
  and thinking agents set requires_idle=False so they run on their own
  wall-clock timers independent of user activity. Only NotifyAgent
  remains idle-gated.

- BackgroundScheduler.notify_activity() resets _last_message_time
  without touching schedule intervals, used by the heartbeat handler.

- Test fixtures suppress independent schedules via long intervals in
  DEFAULT_TEST_RUNTIME_OVERRIDES (previously the idle gate did this
  implicitly).

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The backfill fired search_image_url for every thought with a NULL
image_url on startup. On first deploy after migration 0017 it ran
565 concurrent Serper calls, exhausting the API quota and breaking
Signal notification images for the rest of the day.

All existing thoughts now have image_url set (NULL or empty string),
so the backfill was a no-op going forward. New thoughts get image_url
assigned at creation time via ThinkingAgent.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#905)

* Add browser extension settings panel with icons, domains, and config

- Restructure nav into two-tier header: logo/title/thoughts-link/gear in
  top bar; Chat tab below. Thoughts is now a link button, not a tab.
- Add FontAwesome icons throughout sidebar and feed interaction points
- Add settings panel (gear icon) that takes over the sidebar:
  - Likes/Dislikes tabs (moved from main nav)
  - Domains tab: list, toggle allow/deny, delete, and add new entries
    from browser.storage.local — pure frontend, no backend needed
  - Config tab: all runtime ConfigParams rendered from live Python
    registry (key, description, type, current value, default); edits
    write to runtime_config DB via new config_request/config_update
    WebSocket messages; green toast confirms save
- Animated typing indicator (staggered dots) and two-tier nav CSS
- Fix feed card image corners clipping reaction buttons (border-radius
  on image directly instead of overflow:hidden on card)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix ruff import ordering in _handle_config_update

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add tests for config_request and config_update browser channel handlers

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lockhart#906)

Proactive messages (thoughts, news, check-ins) have no parent_id and
were being merged into large assistant blobs in the conversation context
window — up to 20K chars from a day's worth of notifications. They
don't belong there: they're already represented via the thought section
in the system prompt, and history rollups cover what was discussed.

Only user messages and direct replies (parent_id set) are now included
in get_messages_since. Conversation turns stay properly ordered since
threaded replies are always logged after the messages they reply to.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
penny-team bot and others added 29 commits April 15, 2026 01:18
* Fix spinners getting stuck on prompt log runs

The old markRunActive used a timestamp comparison that could miss
clearing the class when prompts arrived in bursts. Replace with a
simple debounced timer — each new prompt clears the previous timeout
and starts a fresh 10s countdown. Clears all timers on re-render.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Simplify to single active run tracking

Only one agent runs at a time, so track a single activeRunId instead
of a map. When a new run starts, the previous one is immediately
deactivated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…riming (jaredlockhart#950)

The thinking agent was injecting full thought bodies (often 1000+ chars each)
into its system prompt as "Recent Background Thinking." This primed the model
to re-search the same topics, producing duplicates that got discarded.

Two changes:
- Include only thought titles (bullet list) instead of full bodies
- Replace vague "check your recent thoughts" with explicit instructions to
  find a DIFFERENT angle and avoid anything closely related to listed topics

Dry-run tested against 2 seeds x 3 runs each — model now searches for new
content instead of re-exploring the same topics.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t#951)

* Broadcast run outcome to browser so badge appears without refresh

When a thinking run completes, set_run_outcome updates the DB but never
notified the browser extension. The outcome badge (Stored/Discard) only
appeared after a full page refresh.

Added a run_outcome_update WebSocket message that fires from the
set_run_outcome callback, relayed through background → prompts page,
where it inserts the badge into the existing run row in real time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix run outcome badge vertical padding — equal spacing above and below

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Polish prompt log UI: badge layout, spinner timeout, agent labels

- Move outcome badge to its own row with reduced header padding via :has()
- Increase active run spinner timeout from 10s to 30s
- Add title-case agent labels (Thinking, Chat, History, Notify, Startup)
  in both run rows and filter dropdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ests (jaredlockhart#952)

Browse tool retried with fixed 1s delays (2 retries = 2s), but browser
reconnection takes ~3s, so 26% of tool calls failed. Changed to
exponential backoff (1s, 2s, 4s, 8s) with 4 retries for 15s coverage.

Tests were hitting real retry sleeps because the browse provider wasn't
mocked — the running_penny fixture now injects a mock browse provider
on all agents and the /test command factory so no test depends on a
real browser connection.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y empty (jaredlockhart#953)

* Filter search snippets from summary input, require image, retry on empty

Three changes to reduce wasted thinking runs:

1. Filter page reads: summary step now only gets page-read content
   (## browse: sections), not search snippets (## search: sections).
   Search results are titles+links only — summarizing from them produces
   shallow output. Aborts if no page reads were captured.

2. Require image: aborts if the thinking loop didn't capture any images
   from browsed pages — thoughts without images have no feed visual.

3. Retry empty summaries: when the model returns empty content (thinking
   tokens but no output — 18% of runs), retry within the existing URL
   validation loop instead of immediately abandoning the run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move browse section format constants to PennyConstants

BROWSE_KIND_SEARCH, BROWSE_KIND_PAGE, and SECTION_SEPARATOR were
duplicated as string literals across browse.py, thinking.py, base.py,
and read_emails.py. Consolidated into PennyConstants and referenced
everywhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Search last 5 lines for Topic: and retry if missing

_parse_title was checking end-of-string only, but the model sometimes
puts sources or emoji after the Topic: line. Changed to search the last
5 lines with re.MULTILINE. Also retries the summary when Topic: is
missing entirely (alongside empty and hallucinated URL retries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sidebar previously had inline settings panels and links to separate
feed and prompts pages. Now the sidebar is minimal (logo + chat) and
clicking the Penny logo opens a single full page with tabs for:
Thoughts, Prompts, Schedules, Likes, Dislikes, Domains, Config.

Changes:
- New browser/page/ directory with consolidated HTML, CSS, and TS
- Sidebar stripped to registration + chat only (~1300 lines removed)
- All panels use consistent card-row styling matching the prompts page
- Preference response now includes source field (manual/extracted)
- Config page reads tool-use state from storage on activation
- Input bars moved to top of each panel
- Metadata spread across full width: preference source + mention count,
  schedule cron expression inline, config default values shown

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…racking (jaredlockhart#955)

1. Filter strictly by BROWSE_PAGE_HEADER (positive match) instead of
   excluding BROWSE_SEARCH_HEADER. Only actual page reads reach summary.

2. Include the original seed prompt in the summary system prompt as
   "Original research goal" so the model knows what to extract.

3. Every exit path in after_run now records a run_outcome — no silent
   abandonment. Refactored the report/too-short/empty branching to
   ensure exactly one outcome per run:
   - no search results, no page reads, no image → early abort
   - no thought generated → summary returned empty after all retries
   - too short → report below MIN_THOUGHT_WORDS
   - duplicate, matches dislike → dedup/filter checks
   - Stored → success

4. Added tests for search-only abort and empty summary outcome.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…edlockhart#956)

UI improvements:
- Config items: two-row layout with full-width inputs, defaults on header row
- Prompt rows: show response snippet (text or tool call args) instead of
  prompt_type, wrench/chat icon before snippet
- Run summary wrapper for hover highlight (excludes expanded prompts)
- Compressed meta column widths, date right-aligned next to stats
- Spinner dismissed on run outcome (run complete)
- Sidebar tool-use icon right margin
- Spinner timeout increased to 60s

Backend changes:
- Thinking prompt_type now shows seed topic title-cased instead of "seeded"
- Notify prompt_type shows thought title instead of "thought"
- Malformed tool call arguments (2% of calls) now extracted via regex
  fallback instead of wrapped in {"raw": ...} which caused silent failures

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t#957)

* Send thought notifications directly without LLM rewrite

Thoughts are already user-facing quality from the thinking agent's
summary step. The notify agent was running them through an agentic
loop that just rewrote the same content, burning tokens for no gain.

Now _send_best_candidate builds a NotifyCandidate directly from the
thought's content and image, bypassing the agentic loop entirely.
Checkin mode still uses the agentic loop (it generates new content).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Polish extension page: sub-tabs, agent icons, toasts, bigger logo

- Prompts: replace agent filter dropdown with underlined sub-tabs
  (All, Thinking, Chat, Notify, History) with FontAwesome icons
- Thoughts: sub-tabs (New, Archive) use same underlined style
- Agent labels on run rows now include matching icons
- Toast confirmations on add for likes, dislikes, schedules, domains
- Penny icon bumped to 48px on full page

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lockhart#958) (jaredlockhart#958)

Replace the removed static conversation history dump (7-8k tokens of
topic bullets) with on-demand semantic retrieval. Incoming user messages
are embedded and the top 10 most similar past messages are injected into
the ChatAgent system prompt as dated quotes, sorted chronologically.

This restores conversational continuity (Penny knows what you're
referring to) without bloating the context or confusing the model with
irrelevant history. Dry-run testing showed zero false positives at 0.5
similarity threshold across 302 real messages.

Also strengthens the chat prompt to clarify that Related Past Messages
are conversation context only, not a source of facts — preventing the
model from hallucinating details about previously discussed topics.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hart#959)

* Add knowledge extraction and retrieval from browse results

Extract and summarize web pages browsed by Penny into a knowledge table,
then inject the most relevant entries into chat context using
exponentially-decayed weighted conversation scoring (decay=0.5).

The HistoryAgent incrementally scans prompt logs for browse tool results,
summarizes each page into a dense prose paragraph (8-12 sentences) via
LLM, embeds the summary, and upserts by URL. Revisited URLs get their
summaries aggregated with new content.

Chat context retrieval embeds the full conversation history (not just the
last message) and computes weighted similarity scores against knowledge
entries, where recent messages contribute more than older ones. This
handles both "vague message needs prior context" and "topic pivot"
scenarios. Related messages also now use weighted conversation scoring
instead of single-message embedding.

Validated via prototyping: knowledge context fixed a real production bug
where the model confused a storm glass with an eggnog pedal due to lack
of factual grounding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Address code review: constants, imports, timestamps, exceptions

- Extract "Title: " magic string to PennyConstants.BROWSE_TITLE_PREFIX
- Move sqlmodel/RuntimeConfig imports to top of history.py
- Switch knowledge watermark from prompt ID to timestamp (datetime
  columns for ordering, IDs for joins only)
- Switch get_prompts_after to use timestamp-based filtering and ordering
- Narrow except Exception to SQLAlchemyError in KnowledgeStore and
  LlmError in _embed_conversation
- Replace hasattr cache pattern with typed class attribute (None default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add PR review guide and gitignore QUALITY-REVIEW.md

Add docs/pr-review-guide.md as the canonical checklist for code reviews.
Exclude the disposable QUALITY-REVIEW.md working copy from git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove conversation history rollup system

Daily/weekly topic summarization is replaced by knowledge extraction
(factual page summaries) and embedding-based related message retrieval
(raw messages scored by weighted conversation similarity).

- Delete HistoryStore and ConversationHistory model
- Drop conversationhistory table (migration 0024)
- Remove HistoryDuration enum, HISTORY_MAX_STEPS, MAX_WEEKLY_ROLLUPS
- Remove _history_section and all formatting helpers from Agent base
- Remove SUMMARIZE_TO_BULLETS prompt, dead config params
- Simplify HistoryAgent to: knowledge extraction + preference extraction
- Refactor _build_conversation to count-based (last N messages, no time
  boundary) and _build_related_messages to exclude by conversation IDs
- Delete _conversation_start and _midnight_today (rollup artifacts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean up stale docs and dead code from rollup removal

- Remove _history_section from CLAUDE.md building blocks list
- Remove HistoryDuration from constants.py description
- Update HistoryAgent description (no longer summarizes)
- Remove 4 dead MessageStore methods: get_messages_in_range,
  get_reactions_in_range, get_latest_message_time_in_range,
  get_first_message_time (only used by removed rollup code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove stateful cache and RuntimeConfig watermark, squash migrations

- Pass conversation embeddings as parameter instead of caching on
  instance (_cached_conversation_embeddings removed)
- Replace RuntimeConfig watermark with knowledge table FK join
  (get_latest_prompt_timestamp derives watermark from domain data)
- Squash migrations 0023+0024 into single 0023 (one migration per PR)
- Add review checklist items: single migration per PR, no RuntimeConfig
  for application state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Share knowledge summarization rules between new and update prompts

Extract shared rules (_KNOWLEDGE_RULES) and format into both
KNOWLEDGE_SUMMARIZE and KNOWLEDGE_AGGREGATE, so the update prompt
gets the same detailed include/exclude guidance as the new prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Filter browse prompts at query level, tune extraction rate

- Rename get_prompts_after to get_prompts_with_browse_after: adds LIKE
  filter on messages column for browse header, so only prompts with
  browse results are returned. Eliminates the stuck-watermark problem
  where batches of non-browse prompts would re-scan indefinitely.
- Increase KNOWLEDGE_EXTRACTION_BATCH_LIMIT from 3 to 20 (each browse
  result takes ~23s, 20 per cycle is ~8min of model time)
- Decrease HISTORY_INTERVAL from 3600s to 900s (15min cycles)
- Add test: prompts without browse results are skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

Require Title: + URL: lines in browse sections before summarizing.
This rejects all error shapes found in production: browser disconnects,
timeouts, blocked domains, Cloudflare challenges, no browser connected,
failed to read, and empty extractions.

Add 9 unit tests covering every error shape from production data plus
the healthy case and empty body edge case.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt#961)

Thinking cycles were discarding 28% of runs as "no page reads" due to
two issues: (1) the model passed URLs as {"url": "..."} instead of in
the queries array, and (2) search-only loops where the model kept
re-searching without ever browsing a page.

Fixes:
- Simplify thinking system prompt with explicit URL-in-queries example
- Add browse nudge in search result headers
- Inject user message "now browse a URL" after search-only tool results
  (Python-space detection + model nudge hybrid)
- Update chat prompt with same queries-array guidance
- Add after_step conversation parameter for subclass message injection

Also:
- Remove dead browser extension pages (feed/, prompts/)
- Reorder prompt log UI: thinking appears before response
- Update CLAUDE.md directory listing

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…edlockhart#962)

The chat agent was scoring past messages with exponentially-decayed
weighted scoring against the entire conversation window — same as
knowledge retrieval. That caused derailment when retrieved past
turns matched the conversation drift more than the live question,
so the model would latch onto a stale prior topic instead of
answering what was just asked.

Knowledge retrieval still uses weighted decay (factual context
should follow topic drift). Message retrieval now scores by pure
cosine to the current user message only, minus a centrality penalty:

  adjusted = cosine_to_current - α * centrality

where centrality is the candidate's mean cosine to the rest of the
corpus. The penalty (α=0.5) suppresses generic centroid-magnet
boilerplate (greetings, generic "what are some recent X" framings)
that was leaking into every unrelated query.

Selection is adaptive: a cluster-strength gate (top5_mean/top20_mean
≥ 1.15) suppresses flat noise plateaus entirely, then
`cutoff = max(top5_mean × 0.85, 0.25)` combines a relative band with
an empirical absolute floor. Strong clusters return many messages,
weak clusters return few, no cluster returns nothing. Candidates are
deduped by content text first.

Centrality is cached per-sender in memory (lazy on first retrieval,
drifts as new messages arrive — acceptable trade-off for the MVP).
Revisit with a DB column or background refresh if precision degrades
or the corpus grows past a few thousand messages.

Tuning calibrated empirically on a held-out set of recent questions
covering several scenario classes (recurring/strong, recurring/mid,
novel/weak-context, novel/no-context, subgenre confusion). Outcomes:
mid-cluster cases gained ~30+ percentage points of precision as the
centrality penalty pushed centroid-magnet noise out of top results;
densely-discussed recurring topics return more matches; suppression
behavior preserved on cases where the cluster gate should fire.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aredlockhart#963)

Browse failures used to flow back as `result.text = "Failed to extract
page content"` wrapped in valid-looking `Title:`/`URL:` headers, so the
python side accepted them as healthy browse results. The history agent
then summarized those literal error strings into refusal-shaped knowledge
entries that poisoned future aggregation calls.

Browser side: extract_text.ts now exposes an `extracted: boolean` flag
and drops the string fallback. browse_url.ts requires `extracted === true`
in pollForContent and throws on failure instead of returning a fake-success
object. The thrown error propagates through the existing
`WsOutgoingToolResponse.error` channel.

Python side: BrowseTool._read_page raises (ConnectionError or RuntimeError)
instead of returning bare strings. BrowseTool.execute formats the exception
path under a new `## browse error: ` header (PennyConstants.BROWSE_ERROR_HEADER)
that's structurally distinct from the success header, both readable to the
model and grep-able for later analysis. HistoryAgent._parse_browse_section
gains an empty-body rejection as belt-and-suspenders.

Tests cover the ConnectionError path (no browser), the RuntimeError path
(structured browser failure), permission denial, mixed healthy/error
sections, and structural rejection in the history parser.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…khart#964)

* Show in-flight progress as emoji reactions on the user's message

While the chat agent is running, react to the user's incoming Signal
message with 💭 (thinking), then morph the reaction to 🔍/📖 as browse
tool calls fire (search vs URL read), and clear it when the agent
finishes. The final response is sent via the normal send path so it
keeps text + image attachments + quote-replies.

Why reactions instead of an editable "thinking..." text bubble: Signal
mobile/desktop clients silently drop attachments added via message edit
even though the wire format technically allows them, so any in-place
edited bubble that ended up with an image would lose the image at the
receiver. Reactions sidestep editing entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Drop ugly del-param-to-silence-linter idiom in browser channel

The browser channel's _make_handle_kwargs override had `del progress` at
the top to consume the unused argument. That's dead-code dressing for a
linter, not real code — just leave the argument and document why it's
unused. Add this antipattern to the PR review guide so we catch it next
time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Centralize progress emojis in a ProgressEmoji enum

The progress emojis were scattered as raw \U... escapes across three
files: a PROGRESS_INITIAL_EMOJI class attr on SignalChannel for 💭, the
two-branch return on BrowseTool.to_progress_emoji for 🔍/📖, and a bare
literal default on Tool.to_progress_emoji for ⚙️. Move them all to a
ProgressEmoji StrEnum in penny/constants.py and reference the symbolic
names everywhere.

Broaden the constants rule in the PR review guide to catch this case
and similar ones — the original wording only flagged module-level
_PRIVATE_CONSTANT declarations and missed class-attribute siblings.
Also rule out raw literals when an enum exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Dedup browse results by URL within knowledge extraction batch

Each step of an agentic loop re-logs prior tool result messages, so a
single browse appears in many PromptLog rows. HistoryAgent was treating
each row as a fresh page and aggregating identical content N times,
wasting Ollama calls and progressively distorting the stored summary
through repeated KNOWLEDGE_AGGREGATE drift on the same input.

Collapse browse results across the batch to one entry per URL (latest
content wins) so each page is summarized at most once per cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace assert with skip; flag asserts-in-production in review guide

`assert x is not None` in production code is an anti-pattern: it gets
stripped under `python -O`, silently disabling the check, and is usually
just there to satisfy the typechecker. Use real control flow (skip,
raise, or refactor the type) instead.

- history.py: replace `assert prompt.id is not None` with `if prompt.id
  is None: continue` in the new dedup helper
- pr-review-guide.md: add a checklist item under Error Handling so
  future reviews flag this pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#966)

Knowledge retrieval was scoring candidates with exponentially-decayed
weighted similarity over the entire conversation window, with no floor.
Two failure modes were showing up in production:

- Topic-bearing questions after a topic shift were dragged toward the
  prior thread (e.g. asking about guitar pedals while the conversation
  had drifted to cloves would surface clove entries instead).
- Greetings and off-topic chatter still got their top-N picks injected,
  because retrieval had no way to say "nothing here is a real match".

Score each candidate as max(weighted_decay, cosine_to_current_message)
and apply an absolute floor (RELATED_KNOWLEDGE_SCORE_FLOOR, default
0.34, runtime-configurable). The weighted leg preserves the vague
follow-up case that motivated weighted scoring originally — asking
"is it a dud?" still surfaces storm-glass entries when the thread is
in the conversation window. The current-cosine leg lets a strong
direct match stand on its own merit even when the conversation has
drifted. The floor suppresses noise on greetings and uncovered topics.

Validated against a held-out set of 13 recent chat runs: 7 cleanup
wins (drop noise, keep all hits), 2 mixed wins (drop wrong topic,
restore right topic), 1 greeting suppression, 1 unchanged, 2 marginal
recall losses where the relevant entries score below floor and the
prior baseline was already returning all-wrong entries.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dlockhart#967)

The dedup-by-URL pass added in jaredlockhart#965 keys on the raw URL string, so
`/page` and `/page#anchor` are treated as distinct entries even though
the fragment is a client-side anchor that never affects page content.
The browse tool follows in-page anchor links from search results, so
this is common in practice — production logs show the same wiki and
PMC article being summarized 3-4 times in a single batch under
fragment variants, with separate knowledge rows written for each.

Strip the fragment and lowercase scheme + host (case-insensitive per
RFC 3986) before keying the dedup dict and storing the URL on the
knowledge row. Path, query, and userinfo are preserved as-is since
servers can be case-sensitive about them.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…redlockhart#968)

Validating only literally-empty content lets a model emit `\n\n---`
(or similar separator/punctuation/emoji-only output) and have it
delivered to the user as the final answer, silently overwriting a
substantive prior response. Generalize the EMPTY check to count
alphabetic characters, with a low threshold that catches garbage
shapes without flagging terse legit replies like "done" or "yes".

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art#969)

After scoring + cutoff selects hits, pull user messages within ±5
minutes of each hit's timestamp to capture conversational follow-ups
that share no entity overlap with the current message but live in the
same conversation as a real hit. Single pass — neighbors are deduped
by id and content and excluded if they're already in the current
conversation window; they are not themselves expanded.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…jaredlockhart#970)

ScheduleExecutor calls chat_agent.run() directly, bypassing handle()
which is the only place _pending_page_context was ever set. Every
scheduled fire crashed in _build_messages with AttributeError, so the
schedule logged "Executing schedule" but never delivered a message.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art#972)

* Fix Penny restart loop when signal-api is slow to come up

signal-cli-rest-api takes 30-60s to start cold, but Penny was racing it
on every cold boot: validate_connectivity hit a 5s timeout, raised an
unhandled ConnectionError, the process exited, docker restarted it, and
the loop repeated until signal-api was finally ready. The error never
hit penny.log because the traceback went to stderr, making it invisible
when debugging from the file logs alone.

Three fixes:

1. docker-compose: signal-api now has a curl healthcheck against
   /v1/about, and penny waits via depends_on/service_healthy. The race
   is gone for compose-managed startups. Dev tooling uses --no-deps via
   the Makefile so make fix/check don't block on signal-api.

2. validate_connectivity now retries up to 12 times with a 5s delay
   (~60s budget) and logs each failed attempt at WARNING. This handles
   the manual `docker compose up penny` case and any mid-run signal-api
   hiccup. Test path can pass max_attempts=1 to keep tests fast.

3. main() catches ConnectionError on startup and logs it via the
   configured file logger before exiting, so any future startup
   connectivity failure is debuggable from penny.log alone.

Constants live in PennyConstants (SIGNAL_VALIDATE_MAX_ATTEMPTS,
SIGNAL_VALIDATE_RETRY_DELAY, SIGNAL_VALIDATE_HTTP_TIMEOUT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Document signal startup retry + healthcheck in penny/CLAUDE.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#971)

* Send all unnotified thoughts to browser addon for accurate badge count

The browser addon's unnotified thought badge could underreport because the
server returned only the newest 50 thoughts and let the addon filter for
!notified — old unnotified thoughts outside that window were silently dropped.
Server now returns every unnotified thought plus a paginated slice of notified
thoughts (page size 12) with a has_more flag, and the addon tracks its current
notified limit so background polls don't reset the user's load-more position.

Also show the user message text instead of the literal "user_message" label
in the prompt log run header for chat runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Make server own thoughts page size; addon counts pages

The previous revision required the server-side page size constant to be
mirrored in the addon, which would inevitably drift. Now the addon only
tracks how many pages it wants (`notified_pages`), and the server multiplies
by `PennyConstants.BROWSER_THOUGHTS_NOTIFIED_PAGE_SIZE` to compute the actual
limit. The page size lives in exactly one place.

Also adds a review-guide rule against declaring the same constant in both
the Python backend and the TypeScript addon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use Pydantic models for browser thoughts request/response

Thoughts request and response now use BrowserThoughtsRequest /
BrowserThoughtsResponse / ThoughtCard Pydantic models instead of raw
dicts. Also extract a normalizeSnippet helper in page.ts so the prompt
log run header and last-user-message extraction share the same
whitespace-collapse + ellipsize transformation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sults (jaredlockhart#973)

The chat agent's URL hallucination check only consulted the current run's
tool results, so URLs the model legitimately echoed from the system prompt
knowledge section or prior conversation history were flagged as
hallucinated. The validator would discard a fully-formed response, retry,
get garbage, exhaust the loop, and the user got nothing.

Thread `messages` through `_check_response` -> `_get_source_text` so the
full context (system prompt + history + tool results) is the source of
truth for URL validation.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaredlockhart#974)

* Clean up post-LLM-migration cruft and apply PR review checklist

Two passes:

1. Post-migration cleanup
   - Delete dead penny/tests/mocks/ollama_patches.py (177 lines, replaced
     by llm_patches.py / MockLlmClient).
   - Remove unused PennyResponse.SEARCH_ERROR.
   - Update penny/CLAUDE.md to reflect current reality: llm/ tree (was
     ollama/), MockLlmClient/mock_llm fixtures, openai dep (was ollama),
     Python 3.14, full migration list 0001-0023, new directories
     (email/, zoho/, html_utils.py), new tools (content_cleaning,
     draft_email, list_emails, list_folders), /zoho command, Device and
     DomainPermission tables, dropped source_period_* columns.

2. PR review checklist - mechanical safety fixes
   - Replace 8 production assert statements with explicit raises or
     narrowing (assertions get stripped under python -O).
   - Hoist inline imports to module top in channels/base.py,
     channels/browser/channel.py, knowledge_store.py, message_store.py.
   - Narrow broad except Exception: catches in channels/base.py and
     startup.py (SQLAlchemyError, LlmError).
   - Replace getattr duck typing for validate_connectivity by adding a
     no-op base method on MessageChannel.
   - Import DedupStrategy/is_embedding_duplicate directly from
     similarity.dedup; delete the re-export shim from llm/similarity.py.
   - Replace + string concatenation with f-strings in agents/base.py,
     history.py, notify.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Wire email tool limits through runtime config

The email subsystem had three invented limits with no user-configurable
control: a hardcoded EMAIL_SEARCH_LIMIT=10 module constant duplicated
across jmap/client.py and zoho/client.py, a list_emails tool that
declared limit=10 in its Pydantic args and silently clamped any model
override to 50, and a parameter the model couldn't even meaningfully
use because the schema's max value was invented.

Replace all three with two new runtime ConfigParams (EMAIL_SEARCH_LIMIT
and EMAIL_LIST_LIMIT, both default 10), wired through JmapClient and
ZohoClient constructors the same way EMAIL_BODY_MAX_LENGTH already is.
The list_emails tool no longer exposes limit to the model — matching
search_emails — so the user controls list size via /config and the
model picks the folder. No silent clamping; no duplication; default
behavior unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace fragile asyncio.sleep timing with wait_until in tests

Tests in test_scheduler.py used "let several ticks pass" sleeps to
verify scheduler behavior. These race on slow CI and waste time on
fast machines. Replace each with wait_until polling on the actual
condition (agent execute_count, mark_complete_count, cancellation
flag). The negative assertion in test_foreground_during_idle now
verifies immediately after spinning the scheduler task, per the
"verify negatives immediately" guidance.

Tests in test_permission_manager.py used hand-rolled sleep+iterate
helpers to simulate user approve/deny on a pending future. Replace
with a shared _resolve_pending helper that wait_untils on a non-done
future before resolving — same effect, no fixed delay.

test_browser_channel.py:fake_tool_response converted similarly: poll
for a pending future to appear, then resolve.

signal_server.py mock helpers left alone — their sleeps are inside
hand-rolled polling loops (the wait_until pattern itself), not the
fragile fixed-wait pattern the rule targets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add structural-drift tests for agent system prompts

penny/CLAUDE.md and the PR review checklist both require tests that
catch when an agent's system prompt building blocks are reordered,
added, or removed — but no such test existed for any agent.

Add tests/agents/test_system_prompts.py with one test per prompt
variant: ChatAgent, CheckinMode, ThoughtMode, ThinkingAgent. Each
constructs a deterministic baseline state (profile only — no
thoughts, knowledge, preferences, or related messages) and asserts on
the exact ordered list of (level, title) markdown headers the prompt
produces. Drift in section order, missing blocks, or extra blocks
fails the test.

Asserts on header structure rather than full prompt content so the
tests stay maintainable when section content evolves but still catch
the structural changes the rule actually exists to detect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Decompose long methods into named steps

Four methods exceeded the 10-20 line guideline (hard max ~25) by a wide
margin: _dispatch_to_agent (74), _run_agentic_loop (64), _call_model_validated
(68), _process_tool_calls (68). Each was a kitchen-sink function mixing
setup, branching, mutation, and cleanup, making it hard to follow what
the orchestration actually does.

channels/base.py:
- Split _dispatch_to_agent into _handle_profile_required,
  _run_message_through_agent, and _deliver_agent_response. Top-level
  becomes a clean orchestrator: resolve identity → check profile →
  run+deliver under typing/progress/foreground bookkeeping.

agents/base.py:
- _run_agentic_loop now reads as a per-step decision tree, with
  _tools_for_step (final-step tool stripping), _absorb_tool_step_result
  (loop-state mutation), and _abort_if_all_tools_failed (early-exit
  check) extracted as named steps.
- _call_model_validated extracts _invoke_model (the LLM call with
  error handling, narrowed from broad Exception to LlmError) and
  _append_retry_nudge (the bad-response + nudge append).
- _process_tool_calls extracts _dedup_tool_calls, _notify_tool_start,
  and _collect_tool_results so the orchestrator just sequences the
  three phases (dedup → notify → execute → collect).

Behavior preserved — refactor only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Drop legacy searchlog table

The searchlog table hasn't been written to since the browser-based search
migration, but the table and its three indexes were still in the schema
and the SearchLog model class was still defined and exported. Add
migration 0024 to drop the table and indexes (verified clean against a
copy of the production DB via make migrate-test), and remove the model
class plus its database/__init__.py exports.

Update test_migrations.py expected table set and migration counts, and
note 0024 in penny/CLAUDE.md's migration list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Eliminate remaining duck typing and broad exception in LLM/chat path

Four small follow-ups from the audit punch list:

1. llm/client.py reasoning extraction. Replace
   ``getattr(message, "reasoning_content", None) or getattr(message,
   "reasoning", None)`` with a clean read from pydantic v2's
   ``model_extra`` dict — these fields are non-standard SDK extensions
   and that's exactly where pydantic stashes them.

2. agents/chat.py:caption_image. Drop the production
   ``assert self._vision_model_client is not None`` and replace with an
   explicit raise. The channel layer rejects image messages before they
   ever reach this method when no vision model is configured, so the
   raise documents the invariant without relying on assert (which gets
   stripped under python -O).

3. llm/similarity.py:embed_text. Narrow ``except Exception`` to
   ``except LlmError``. The function is best-effort by design (returns
   None on failure) but should still propagate non-LLM bugs instead of
   swallowing them.

4. channels/discord/channel.py. Drop ``getattr(message.author,
   "discriminator", "")`` / ``"global_name"``. Both are real
   attributes on every discord.py User/Member subclass — direct access
   is fine.

Mock LLM client (tests/mocks/llm_patches.py) updated to expose
``model_extra`` dict so it matches the real SDK shape after change jaredlockhart#1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Constants consolidation, dead-code purge, Pydantic default tightening

Three groups of audit findings, one commit since they're all "shared
values shouldn't drift" cleanup:

**Constants → constants.py**
The four ``*PromptType`` classes scattered across agents/chat.py,
agents/thinking.py, agents/notify.py, and agents/history.py are moved
to penny/constants.py as ``StrEnum`` subclasses (ChatPromptType,
ThinkingPromptType, NotifyPromptType, HistoryPromptType). Their values
land in promptlog.prompt_type and bubble through to the browser UI for
display, so they cross module boundaries via the data flow even when
not via direct import — exactly the rule's "shared value, single
source of truth" target.

ThinkingAgent.THOUGHT_CONTEXT_LIMIT was a class-attribute alias for
PennyConstants.THOUGHT_CONTEXT_LIMIT — pure duplication. Reference the
constant directly.

**Dead constants purged**
- channels/base.py:MAX_IMAGE_PROMPT_LENGTH = 300 — never read
- constants.py:MAX_PAGE_CONTENT_CHARS = 100_000 — never read
- channels/__init__.py:CHANNEL_TYPE_SIGNAL/DISCORD/BROWSER — re-export
  aliases of ChannelType.* with zero importers
- browser/src/protocol.ts:TOOL_TIMEOUT_MS = 60_000 — never imported
- browser/src/protocol.ts:MAX_EXTRACTED_CHARS = 50_000 — never imported

**Cross-boundary mirror comment fix**
``browser/src/protocol.ts`` previously claimed to "mirror"
penny/penny/channels/browser/models.py — exactly the smell the rule
forbids. Replace with a more honest header that says only wire-format
identifiers must match (because both sides need to encode/decode the
same bytes), and everything else should derive from server payloads.

**Pydantic optional defaults (item 9 of audit)**
Empty-string defaults on ``str`` fields break null-coalescing in
JS/TS — ``"" ?? fallback`` returns ``""``. Tightened these:
- channels/base.py:PageContext (title, url, text) → required, browser
  always sends them
- channels/browser/models.py:BrowserIncoming (content, sender) →
  required, browser always sends them
- channels/discord/models.py:DiscordUser.discriminator → required;
  discord.py always exposes the field
- llm/image_client.py:_GenerateResponse.response → ``str | None = None``
  (Ollama may omit it for image responses)
- llm/models.py:LlmToolCallFunction.name, LlmToolCall.id → required;
  a tool call without name/id is meaningless

LlmMessage.content stays as ``str = ""`` because empty content is a
legitimate state for tool-only assistant messages.

Item 10 of audit (``del param`` statements) verified clean — none in
the tree.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add coverage for cleanup-introduced behavior + narrow LlmClient excepts

Closes the test-coverage gaps from the cleanup PR's earlier commits and
narrows two more broad ``except Exception`` blocks I should've caught
the first time through.

**New tests for changes already in this PR:**

1. ``test_similarity.py::test_non_llm_exception_propagates`` —
   ``embed_text`` now narrows to ``LlmError``; verify a non-LLM
   exception (programmer bug) propagates instead of being swallowed
   as ``None``.

2. ``test_agentic_loop.py::TestModelErrorHandling`` — two cases for
   the agent's model-call path:
   - ``LlmConnectionError`` from the model results in
     ``AGENT_MODEL_ERROR`` (not a crash).
   - A non-LLM exception (programmer bug) propagates instead of being
     swallowed.

3. ``test_signal_vision.py::test_caption_image_raises_when_vision_client_missing`` —
   ``caption_image`` now raises explicit ``RuntimeError`` instead of
   relying on ``assert``; document the contract.

4. ``test_pydantic_models.py`` (new file) — ``ValidationError`` cases
   for every required-field tightening: ``PageContext``,
   ``BrowserIncoming`` (content + sender), ``DiscordUser.discriminator``,
   ``LlmToolCallFunction.name``, ``LlmToolCall.id``, ``LlmToolCall.function``.

5. ``test_zoho/test_client.py`` — two new tests verifying the
   constructor's ``search_limit``/``list_limit`` actually flow through
   to the Zoho API ``params["limit"]``. End-to-end coverage for the
   ``/config EMAIL_SEARCH_LIMIT`` / ``/config EMAIL_LIST_LIMIT``
   runtime override that the prior PR commit only tested at the
   constructor-call boundary.

**Bonus narrowing — found while writing jaredlockhart#2:**

Test jaredlockhart#2 surfaced that ``LlmClient.chat`` and ``LlmClient.embed`` had
their own broad ``except Exception`` blocks that wrapped *any*
exception as ``LlmResponseError``, hiding the bug-vs-API-error
distinction from every caller. Narrowed both to ``except
openai.OpenAIError`` (the SDK's top-level base class). Now genuine
SDK errors still get wrapped+retried, but unrelated programmer bugs
propagate.

Two pre-existing tests were faking LLM failures with generic Python
exceptions (``RuntimeError("Ollama is down")``, ``ConnectionError``)
— updated them to use real LLM error types (``LlmConnectionError``,
``openai.OpenAIError``). Same intent, accurate exception type.

(Migration 0024 coverage skipped per discussion — ``make migrate-test``
already validates against a copy of prod.)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Hoist inline imports out of test bodies

The /quality review caught three inline ``from … import …`` statements
inside test function bodies. The "no inline imports" rule has no test
exception, and while doing the fix I also found four MORE pre-existing
inline ``from penny.llm.client import LlmClient`` blocks in
test_embeddings.py that pre-dated this PR. Cleaned all seven up:

- tests/test_embeddings.py — hoisted ``import openai``,
  ``from penny.llm.client import LlmClient``; removed five inline
  copies (the audit-flagged one plus four pre-existing)
- tests/test_similarity.py — hoisted
  ``from penny.llm.models import LlmResponseError``
- tests/channels/test_startup_announcement.py — hoisted
  ``from penny.llm.models import LlmConnectionError``

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Refresh README and CLAUDE.md docs for PR changes

Final pass to bring user-facing docs in sync with everything that
landed in this PR (and a few smaller drift items I noticed along the
way):

**README.md**
- Python badge 3.12+ → 3.14+ (Dockerfile and pyproject already use
  3.14; the badge was the last 3.12 reference).
- Add /zoho to the slash commands list.
- Runtime config count 23 → 30+ in the two places it appears
  (config_params.py now has 30 ConfigParams, including the new
  EMAIL_SEARCH_LIMIT and EMAIL_LIST_LIMIT added in this PR).
- Add make migrate-validate to the make commands list.
- Test infrastructure: "mock Ollama client" → "mock LLM client
  (MockLlmClient, patches openai.AsyncOpenAI)" to reflect the
  Ollama→OpenAI SDK migration. Drop "mock search APIs" — search is
  via the browser extension now, no mock search APIs exist.

**CLAUDE.md (root)**
- "What Is Penny": clarify that the LLM is accessed via the OpenAI
  SDK against an OpenAI-compatible endpoint (Ollama by default), not
  directly via the Ollama SDK.
- docs/ directory listing: add the four files that were missing —
  most importantly pr-review-guide.md, the canonical PR review
  checklist that the /quality skill consumes.
- New "PR Review Checklist" section pointing at docs/pr-review-guide.md
  as the source of truth for every rule the project enforces. The
  Code Style and Design Principles sections above are the quick
  reference; the guide is the full rulebook.

**penny/CLAUDE.md**
- Add /zoho to the Conditional Commands list (was already in the
  directory structure but missing from the prose).
- Runtime Configuration "Groups" line: mention email body/search/list
  limits alongside the other Global params.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* README: comprehensive refresh — fix stale memory model, env vars, commands

The first README pass was too shallow. A proper sweep against the
current code surfaced significant drift in nearly every section:

**Memory section was wrong** — described "daily summaries" and
"weekly entries" that haven't existed since migration 0023 dropped
the conversationhistory table. Replaced with the actual three-layer
model: knowledge entries (per-URL page summaries with embedding-based
retrieval), related-message retrieval (cosine similarity with
centrality penalty + ±5-minute neighbor expansion), and preferences.

**Penny's Mind diagram** had a stale "Daily & Weekly Summaries" node
in the Memory subgraph and a History → Summaries edge that no longer
fires. Replaced with a Knowledge node and corrected the edges.

**Cognitive Cycle** bullet 2 ("summarizes conversations into daily and
weekly entries") rewritten to describe what HistoryAgent actually
does — knowledge extraction from browses, two-pass preference
extraction from messages.

**Conversations section** mentioned "via Ollama" without acknowledging
the OpenAI SDK migration. Updated to clarify Penny uses the OpenAI
Python SDK against any OpenAI-compatible endpoint (Ollama by default).

**Preferences section** said "after each day's conversations" — the
HistoryAgent runs continuously, not on a daily cycle. Rewrote to
describe the actual two-pass identify-then-classify pipeline plus the
mention-count threshold gate.

**Commands list** was missing /commands, /debug, /unschedule, /test,
and reordering for clarity. Added env var requirements per command.

**Make Commands** listed `make fmt` which doesn't exist (only fix
exists, which combines format + lint --fix). Removed it; added the
real `make team-build` and `make browser-build` targets; corrected
the `make check` description to list everything it actually runs.

**.env example** was missing Zoho entirely, missing the canonical
LLM_* env names (canonical post-OpenAI-SDK-migration), and missing
the optional embedding/vision/image API URL/key overrides. Rewrote
the block to mirror .env.example with comments explaining each.

**Configuration Reference** had two completely fake env vars:
OLLAMA_MAX_RETRIES and OLLAMA_RETRY_DELAY don't exist anywhere in
the code anymore — llm_max_retries/llm_retry_delay are hardcoded
defaults on the Config dataclass. Removed.

TOOL_TIMEOUT default was documented as 60s but actual default is
120s. Fixed.

Whole Ollama: subsection rewritten as LLM: section showing the new
LLM_* canonical names with OLLAMA_* fallbacks called out for
backwards compat. Added Browser Extension subsection. Added Zoho to
API Keys.

**Models table** updated to show env var per role and to mention
that each model can target a different OpenAI-compatible endpoint via
the corresponding _API_URL/_API_KEY overrides.

**Browser Extension section** was missing six features the addon now
ships:
- Live in-flight tool status in chat ("Searching…", "Reading X…")
- Per-addon tool-use toggle
- Cross-device domain permission prompts (also answerable from Signal)
- Schedule manager UI
- Settings panel (domains + runtime config)
- Prompt log viewer (every LLM call browseable, grouped by run id)
- Signal in-flight progress as morphing emoji reactions on the user's
  message

**Setup prerequisites** updated to acknowledge that omlx and other
OpenAI-compatible endpoints work, not just Ollama.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* README: reframe from Ollama-centric to OpenAI-compatible

Penny no longer has Ollama-specific runtime dependencies (other than
the /draw image generation endpoint). The README still led with
"Ollama" in the badge, several headings, and the configuration
reference, giving the impression it's an Ollama-first project.

- Replace the "Ollama" badge with "OpenAI-compatible LLM"
- Conversations section: lead with "OpenAI Python SDK against any
  OpenAI-compatible endpoint" and list Ollama/omlx/vLLM/OpenAI as
  examples, not as the primary identity
- Models table: explicitly state that text/vision/embedding all go
  through the OpenAI SDK; call out image generation as the one
  exception (uses Ollama's /api/generate directly)
- Setup prerequisites: lead with "OpenAI-compatible LLM endpoint" and
  list backends as choices, not required software
- Configuration Reference: drop the "OLLAMA_*" fallback noise from
  every line; state upfront there are no Ollama-specific dependencies;
  document the image generation exception clearly; move legacy OLLAMA_*
  names to a one-sentence footnote
- .env example: "any OpenAI-compatible endpoint" framing, not "Ollama
  (default)"; "unauthenticated local backends" not "local Ollama"

Every remaining "Ollama" reference is now either (a) listed as one
example backend among several, (b) the explicitly documented /draw
image generation exception, (c) the backwards-compat OLLAMA_* env name
footnote, or (d) the real OLLAMA_BACKGROUND_MODEL env var that
penny-team's Quality agent still reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Drop legacy OLLAMA_* env var fallbacks from code and docs

No userbase to preserve backwards compatibility for — just one user.
The OLLAMA_* fallback chain in config.py added complexity for no benefit.

Code:
- config.py: remove all `os.getenv("OLLAMA_*")` fallback calls. Each
  `LLM_*` env now reads directly with its own default. No more nested
  `os.getenv("LLM_X", os.getenv("OLLAMA_X", default))` chains.
- Rename `ollama_api_url` field → `image_api_url` (it's only used by
  the image generation client). Reads from `LLM_IMAGE_API_URL` env.
- penny.py: `config.ollama_api_url` → `config.image_api_url`
- chat.py, test docstrings: OLLAMA_VISION_MODEL → LLM_VISION_MODEL

Docs:
- .env.example: all LLM_* names, no OLLAMA_* duplicates
- CLAUDE.md (root): Ollama section rewritten as LLM section
- penny/CLAUDE.md: /draw and vision refs updated to LLM_* names
- README.md: drop the legacy OLLAMA_* fallback footnote; image gen
  now documented as `LLM_IMAGE_API_URL` not `OLLAMA_API_URL`

The only remaining OLLAMA_ env var anywhere in the project is
`OLLAMA_BACKGROUND_MODEL` which penny-team's Quality agent still
reads — that's their code, not ours to change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

The "Global" group was a junk drawer with 10 unrelated params (email
tools, browser domain mode, chat limits, embedding backfill, context
window). "Schedule" had only IDLE_SECONDS. Params that a user tunes
together were scattered across different groups.

New grouping by what the user is actually tuning:

- **Chat** (8 params): foreground conversation + retrieval context —
  MESSAGE_MAX_STEPS, CHAT_MAX_QUERIES, MESSAGE_CONTEXT_LIMIT,
  SEARCH_URL, RELATED_MESSAGES_LIMIT, RELATED_KNOWLEDGE_LIMIT,
  RELATED_KNOWLEDGE_SCORE_FLOOR, DOMAIN_PERMISSION_MODE

- **Thinking** (7 params): inner monologue — INNER_MONOLOGUE_*,
  THOUGHT_DEDUP_*, MAX_UNNOTIFIED_THOUGHTS, FREE_THINKING_PROBABILITY

- **History** (6 params): background extraction — HISTORY_INTERVAL,
  PREFERENCE_DEDUP_*, PREFERENCE_MENTION_THRESHOLD,
  KNOWLEDGE_EXTRACTION_BATCH_LIMIT, EMBEDDING_BACKFILL_BATCH_LIMIT

- **Notify** (5 params): notification outreach + idle timing —
  IDLE_SECONDS (moved from Schedule), NOTIFY_CHECK_INTERVAL,
  NOTIFY_COOLDOWN_MIN/MAX, NOTIFY_CANDIDATES

- **Email** (4 params): email tool settings — EMAIL_BODY_MAX_LENGTH,
  EMAIL_SEARCH_LIMIT, EMAIL_LIST_LIMIT, JMAP_REQUEST_TIMEOUT

"Global" and "Schedule" groups dissolved entirely. "Inner Monologue"
renamed to "Thinking" for clarity. IDLE_SECONDS description updated
from "Global idle threshold" to "Seconds of silence before background
agents become eligible" since it now lives in the Notify group.

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…age (jaredlockhart#976)

Each preference row now shows how many thoughts were seeded by it, and
rows with thoughts are expandable to show a list with title, date, image
thumbnail, and content. Thoughts are lazy-loaded via a new WebSocket
message pair (preference_thoughts_request/response).

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es (jaredlockhart#977)

Image-only messages logged with empty content produced a conversation of just
"[HH:MM] " — truthy, so the empty-guard passed, the LLM was called, returned
no preferences, and did_work=False prevented marking the message processed.
Same loop fired for any unprocessed message that legitimately yielded zero
preferences. Observed 7 identical identification calls against one empty
message across a single minute in promptlog.

Fix splits identification failure (retry) from empty results (done):
- _format_messages skips messages with empty/whitespace content
- _extract_text_preferences returns True when the attempt completes, False
  only when identification itself fails (exception / unparseable JSON)
- _extract_preferences_from_content returns True on any completed pass,
  False only when _identify_preference_topics returns None

Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant