Penny originated as an idea 20 years ago: a personal agent that browses the web for you, finds things you'd find interesting, and builds a feed — rather than relying on external feeds. The browser extension brings this vision to life by giving Penny direct access to a browser, while keeping the existing server architecture intact.
The extension does NOT replace the current Penny architecture. It extends it.
┌─────────────────┐
│ Penny Server │
│ (Mac, 64GB) │
│ │
│ Agents │
│ Scheduler │
│ SQLite Store │
│ Ollama (20B) │
└──┬──────┬───────┘
│ │
┌────────┘ └────────┐
│ │
┌────────▼──────┐ ┌──────────▼──────────┐
│ Signal Channel│ │ Browser Channel(s) │
│ │ │ │
│ Chat (phone, │ │ Chat (sidebar) │
│ laptop, etc) │ │ Browser tools │
│ │ │ Active tab context │
└───────────────┘ │ History reading │
│ Feed page │
└──────────────────────┘
- Penny server stays where it is: model, store, thinking loop, scheduler, all on the Mac
- Signal channel stays as-is: talk to Penny from any device
- Browser channel is additive: a Firefox extension that connects via WebSocket, providing chat + browser capabilities
- Multiple browsers can connect: PC (always-on), MacBook (intermittent), etc. Each is a device in the device table with a user-chosen label
- Shared history: all messages appear in the same conversation history regardless of channel — ask on Signal, follow up on browser, full continuity
Penny uses a ChannelManager that implements the MessageChannel interface as a routing proxy. All agents, scheduler, and commands interact with the manager — they don't know which channel they're talking to.
- Device table: each connection point (Signal phone, browser instance) is a device with a unique identifier and label
- Single-user model: Penny is a personal assistant for one person. All devices map to the same user via
UserInfo. Background agents useget_primary_sender()from UserInfo, not sender strings from MessageLog - Reply routing:
message.sender(device identifier) routes replies to the correct channel._resolve_user_sender()maps any device to the canonical user identity for DB lookups - Device registration: Signal device seeded at startup. Browser devices auto-register on first message. Sidebar prompts for a device label on first open
The browser gives Penny capabilities she can't have through APIs alone:
- Direct web browsing: open pages in hidden tabs with full rendering engine + user session, extract content with Defuddle
- Active tab context: automatically extract the page the user is currently viewing and inject it into the chat context so the user can ask questions about any page
- YouTube/video discovery: see thumbnails, durations, view counts, comments, transcripts
- Price tracking: revisit URLs and detect changes (price drops, back-in-stock)
- Rich media extraction: og:image, Open Graph data, JSON-LD structured data
- Browsing history as preferences: passive preference extraction from what the user actually does, not just what they say
- Social/forum monitoring: check Reddit threads, forums, Discord channels for interesting discussions
- Feed page: a browsable web page of Penny's discoveries, richer than Signal text notifications
browser/
src/
protocol.ts — Typed WebSocket + runtime messaging protocol
background/
background.ts — Owns WebSocket, tool dispatch, tab tracking, permissions
permissions.ts — Domain allowlist management + user prompts
tools/
browse_url.ts — Hidden tab + content extraction handler
sidebar/
sidebar.ts — Chat UI, page context toggle, message rendering
feed/
feed.ts — Thought card grid, new/archive tabs, reactions, modal
content/
extract_text.ts — Defuddle-based page extraction (bundled via esbuild)
sidebar/
sidebar.html — Chat UI markup
sidebar.css — Light/dark theme via CSS custom properties
feed/
feed.html — Thought feed page markup
feed.css — Card grid, modal, tabs, reaction buttons
icons/
icon-{16,32,48,96}.png — Extension icons rendered from SVG
penny.svg — Vector logo for in-page display
manifest.json — Permissions: storage, tabs, <all_urls>
tsconfig.json — Strict TypeScript, ES2020 target
build-content.mjs — esbuild wrapper for content script bundling
package.json — TypeScript, esbuild, web-ext, concurrently, defuddle, fontawesome
Owns the single WebSocket connection to the Penny server. Persists across sidebar open/close. Handles:
- Chat message relay between sidebar and server
- Tool request dispatch (receives
tool_requestfrom server, executes, sendstool_response) - Active tab tracking (extracts page content on tab switch/load)
- Domain permission checking and user prompts
- Connection state management and reconnection
Pure UI layer — no direct WebSocket connection. Communicates with background via browser.runtime messaging. Features:
- Chat with message persistence in
browser.storage.local(200 message cap) - Device registration on first open
- Page context toggle showing current page title + favicon with checkbox
- Permission dialog for domain access requests
- HTML rendering of Penny's responses (markdown → HTML conversion server-side)
- Flush image rendering with og:image page headers on contextual responses
- Connection status indicator (green dot / orange spinner)
- Smart scrolling (short messages anchor at bottom, long messages show top first, re-scroll on image load)
Extracts visible text from web pages using a tiered approach:
- Defuddle (primary) — smart content extraction that strips nav, sidebars, boilerplate
- CSS heuristics (fallback) — targets
<article>,<main>,[role="main"] - TreeWalker (last resort) — all visible text nodes, skipping nav/aside/footer Also extracts og:image metadata. Bundled with esbuild as an IIFE (content scripts can't use ES module imports).
browse_url— opens a URL in a hidden tab with full web engine + user session. Content extracted via Defuddle content script. Server summarizes in a sandboxed model call before the agent sees it. Domain-gated with user permission prompts.
The background script continuously extracts visible text from the active tab using Defuddle. When the user sends a message with the "Include page content" toggle checked, the page content is attached to the message. The server injects it as a synthetic browse_url tool call + result in the message history, so the model sees a pre-completed tool exchange and answers from it directly. A minimal hint (title + URL) in the system prompt disambiguates "this page" references.
search_web— use the browser's default search engine, return resultssearch_youtube— search YouTube, return video metadata (title, duration, views, channel, thumbnail)get_history— read recent browsing history (filtered, domain + title only)screenshot_page— capture current page for vision modelget_current_tab— what the user is looking at right now (partially implemented via active tab context)
The thinking agent uses browser tools autonomously when a browser is connected. All web access (search, page reading) goes through the browser extension. Tools are registered dynamically via set_browser_tools_provider() — available only when a browser has an active connection.
RPC over WebSocket with correlation IDs:
- Server sends
tool_requestwithrequest_id,toolname, andarguments - Background script receives, checks domain permissions, executes tool
- Background sends
tool_responsewith matchingrequest_idandresultorerror - Server resolves the asyncio Future and returns result to the agent loop
The BrowserChannel maintains _pending_requests: dict[str, asyncio.Future] for in-flight requests. Timeout after 30 seconds. Pending futures rejected on disconnect.
Skills are domain-specific playbooks — instructions for interacting with particular websites:
- YouTube skill: how to find videos, parse metadata, read transcripts, evaluate quality
- Reddit skill: how to find relevant threads, sort by quality, extract discussions
- Amazon/Reverb skill: how to find prices, reviews, availability, related products
- Forum skill: how to navigate threads, find expert opinions
Each skill is a document (not code) — instructions and DOM selectors that Penny reads when interacting with that domain. Skills are:
- Tied to the domain allowlist: granting access to youtube.com activates the YouTube skill
- User-extensible: write a skill document for any site
- Part of the trust boundary: defines what to extract and what to ignore
Security is the foundational design constraint, not a feature added later.
- Zero permissions by default: Penny has no access to any domain until the user explicitly grants it via an allowlist
- Collect, store, and expose nothing first: only incrementally add data access as needed
- If Penny has no access to sensitive data, there's nothing to exfiltrate
The primary threat: malicious web pages embedding instructions in content that Penny reads. Attack vectors include hidden text (CSS display:none, white-on-white), HTML comments, meta tags, YouTube descriptions/comments, forum posts, product descriptions.
Three-layer defense:
-
Pre-sanitization: Defuddle extracts main content only, stripping nav/sidebar/footer/hidden elements. The TreeWalker fallback skips elements with
display:none,visibility:hidden,opacity:0, andaria-hidden="true". -
Sandboxed summarization: raw page content from
browse_urlgoes to a constrained model call — system prompt is "Summarize this web page content." No tools, no user profile, no preferences, no history, no browsing context. Even if injected instructions survive extraction, there's nothing to exfiltrate and no tools to act with. -
Summary is the only bridge: Penny's real agent context (with tools, preferences, sensitive data) only sees the clean summary output from the sandboxed step. Untrusted content never shares a context window with sensitive data or tool access.
Note: Active tab context (when user includes page content with a message) bypasses the sandboxed summarization since the user explicitly chose to share it. The content still goes through Defuddle extraction which strips hidden/malicious elements.
- No domains accessible by default
- User explicitly grants or denies access per domain via sidebar permission dialog
- Decisions stored in
browser.storage.localfor future calls - Parent domain matching (allowing
example.comalso allowswww.example.com) - Three states: allowed, blocked, unknown (unknown triggers prompt)
- Management UI deferred (currently via browser dev tools)
- Browsing history sent to server: domain + page title only, no full URLs with query params
- Filter out sensitive categories (health, finance, adult) before sending
- Communication over localhost WebSocket only
- Model never sees raw untrusted web content via
browse_url— always extracted, sanitized, summarized first - Active tab context uses Defuddle extraction (user-initiated sharing)
- Penny never relays verbatim web page text through Signal — always summarized/rephrased
- Never pass raw cookies or session tokens to the server
- Browser tools gated by domain allowlist
- Model output cannot drive navigation directly without gating
- Tool invocations rate-limited to prevent uncontrolled spidering
- Active tab context only sent when user explicitly checks the toggle
- Each browser instance connects as a device with a user-chosen label (e.g., "firefox macbook 16")
- Device table tracks all connected endpoints with channel type, identifier, and label
- Persistent store lives on the Penny server — all browsers share it via the same user identity
- Always-on PC browser can do background research while laptop is closed
- Signal provides fallback communication when no browser is connected
- Thinking agent uses
browse_urlautonomously on any connected browser
The feed page realizes the shift from interruption-based notifications to a browsable discovery feed:
- Card grid: each thought rendered as a card with image (if available), title, seed topic byline, date, and truncated content
- New / Archive tabs: unnotified thoughts in New, notified in Archive (paginated, 12 per page)
- Modal viewer: click a card to see full content with image header
- Reactions: thumbs up/down buttons on cards and modal. Clicking one:
- Sets
notified_aton the thought (moves to Archive) - Logs a synthetic outgoing message with the thought content +
thought_idFK - Logs a reaction message (👍/👎) with
parent_idpointing to that outgoing message - The preference extraction pipeline picks up the reaction identically to Signal emoji reactions
- Fades the card out of the New tab
- Sets
- Unnotified count: sidebar nav shows
Thoughts (N)with the count of new thoughts, polled every 5 minutes - Content rendering: thought content processed through
prepare_outgoing()server-side (markdown → HTML), rendered directly — no client-side duplication - Image URLs: stored on the
Thoughtmodel (legacy field, no longer auto-populated) - Both models coexist: feed for passive browsing, Signal for high-priority discoveries
- Extension: TypeScript (strict), esbuild for content script bundling, web-ext for development
- Content extraction: Defuddle (by Obsidian creator, zero-dep core bundle) with CSS heuristic and TreeWalker fallbacks
- Communication: WebSocket between background script and Penny server,
browser.runtimemessaging between sidebar and background - Protocol: Typed discriminated unions for all messages (TypeScript
type + constpattern mirrors Python Pydantic models) - Server: existing Python/Docker architecture with ChannelManager routing proxy
- Model: Ollama on server (20B), not constrained by browser ML engine limitations
- Storage: server SQLite for persistent data,
browser.storage.localfor device label, chat history, domain allowlist - Theming: CSS custom properties with
prefers-color-schememedia query for automatic light/dark
Minimal Firefox extension: sidebar chat panel + WebSocket to Penny serverDoneAddDonebrowse_urltool: extension can fetch and sanitize a page on demand- Add
search_webtool: use browser's search engine - Browsing history reader: passive preference extraction
Feed page: render thoughts as a browsable pageDone- YouTube skill: first domain-specific skill
- Additional skills: Reddit, Amazon, forums, etc.
- Multi-channel architecture (ChannelManager, device table, single-user identity)
- Active tab context injection (Defuddle extraction + synthetic tool call)
- Page context toggle with og:image headers on responses
- TypeScript with typed protocol (discriminated unions, const objects)
- Background script owns WebSocket (sidebar uses runtime messaging)
- Domain permission dialog flow
- Chat history persistence (browser.storage.local, 200 message cap)
- HTML formatting pipeline (markdown → HTML with code, links, tables-to-bullets)
- Flush image rendering with smart scrolling
- Light/dark theme support
/drawcommand working in browser (base64 data URI rendering)- Reconnection indicator with spinner
web-extdev setup with auto-reload- Thoughts feed page with new/archive tabs, card grid, modal viewer
- Thought reactions (thumbs up/down) feeding preference extraction pipeline
- Image URLs on thought model (legacy field, no longer auto-populated)
- Seed topic bylines on thought cards (from preference FK)
- Unnotified thought count in sidebar nav (5-minute polling)
- Penny logo: SVG traced from PNG, rendered to crisp PNGs at all icon sizes
- Font Awesome icons (locally bundled, no CDN)
- Signal profile avatar via signal-cli-rest-api
- Page header on model-initiated browse_url: when Penny calls browse_url on her own, the sidebar response should show the browsed page's image/title/URL header (same as user-initiated page context). Requires the server to include the browsed URL metadata alongside the response
- Deferred cleanup — drop sender column: remove
senderfromMessageLog, migrate all queries todevice_idor remove sender filtering (single-user). ~30 call sites - Domain allowlist management UI: currently managed via browser dev tools, needs a proper settings page
- Rate limiting on tool invocations: prevent uncontrolled spidering from the thinking agent