Penny Browser Extension — Architecture & Design

Vision

Penny originated as an idea 20 years ago: a personal agent that browses the web for you, finds things you'd find interesting, and builds a feed — rather than relying on external feeds. The browser extension brings this vision to life by giving Penny direct access to a browser, while keeping the existing server architecture intact.

Architecture: Browser as a New Channel

The extension does NOT replace the current Penny architecture. It extends it.

                    ┌─────────────────┐
                    │   Penny Server  │
                    │   (Mac, 64GB)   │
                    │                 │
                    │  Agents         │
                    │  Scheduler      │
                    │  SQLite Store   │
                    │  Ollama (20B)   │
                    └──┬──────┬───────┘
                       │      │
              ┌────────┘      └────────┐
              │                        │
     ┌────────▼──────┐     ┌──────────▼──────────┐
     │ Signal Channel│     │ Browser Channel(s)  │
     │               │     │                      │
     │ Chat (phone,  │     │ Chat (sidebar)       │
     │  laptop, etc) │     │ Browser tools        │
     │               │     │ Active tab context   │
     └───────────────┘     │ History reading       │
                           │ Feed page            │
                           └──────────────────────┘

Penny server stays where it is: model, store, thinking loop, scheduler, all on the Mac
Signal channel stays as-is: talk to Penny from any device
Browser channel is additive: a Firefox extension that connects via WebSocket, providing chat + browser capabilities
Multiple browsers can connect: PC (always-on), MacBook (intermittent), etc. Each is a device in the device table with a user-chosen label
Shared history: all messages appear in the same conversation history regardless of channel — ask on Signal, follow up on browser, full continuity

Multi-Channel Architecture (implemented)

Penny uses a ChannelManager that implements the MessageChannel interface as a routing proxy. All agents, scheduler, and commands interact with the manager — they don't know which channel they're talking to.

Device table: each connection point (Signal phone, browser instance) is a device with a unique identifier and label
Single-user model: Penny is a personal assistant for one person. All devices map to the same user via UserInfo. Background agents use get_primary_sender() from UserInfo, not sender strings from MessageLog
Reply routing: message.sender (device identifier) routes replies to the correct channel. _resolve_user_sender() maps any device to the canonical user identity for DB lookups
Device registration: Signal device seeded at startup. Browser devices auto-register on first message. Sidebar prompts for a device label on first open

What the Extension Adds

The browser gives Penny capabilities she can't have through APIs alone:

Direct web browsing: open pages in hidden tabs with full rendering engine + user session, extract content with Defuddle
Active tab context: automatically extract the page the user is currently viewing and inject it into the chat context so the user can ask questions about any page
YouTube/video discovery: see thumbnails, durations, view counts, comments, transcripts
Price tracking: revisit URLs and detect changes (price drops, back-in-stock)
Rich media extraction: og:image, Open Graph data, JSON-LD structured data
Browsing history as preferences: passive preference extraction from what the user actually does, not just what they say
Social/forum monitoring: check Reddit threads, forums, Discord channels for interesting discussions
Feed page: a browsable web page of Penny's discoveries, richer than Signal text notifications

Extension Components (implemented)

browser/
  src/
    protocol.ts           — Typed WebSocket + runtime messaging protocol
    background/
      background.ts       — Owns WebSocket, tool dispatch, tab tracking, permissions
      permissions.ts      — Domain allowlist management + user prompts
      tools/
        browse_url.ts     — Hidden tab + content extraction handler
    sidebar/
      sidebar.ts          — Chat UI, page context toggle, message rendering
    feed/
      feed.ts             — Thought card grid, new/archive tabs, reactions, modal
    content/
      extract_text.ts     — Defuddle-based page extraction (bundled via esbuild)
  sidebar/
    sidebar.html          — Chat UI markup
    sidebar.css           — Light/dark theme via CSS custom properties
  feed/
    feed.html             — Thought feed page markup
    feed.css              — Card grid, modal, tabs, reaction buttons
  icons/
    icon-{16,32,48,96}.png — Extension icons rendered from SVG
  penny.svg               — Vector logo for in-page display
  manifest.json           — Permissions: storage, tabs, <all_urls>
  tsconfig.json           — Strict TypeScript, ES2020 target
  build-content.mjs       — esbuild wrapper for content script bundling
  package.json            — TypeScript, esbuild, web-ext, concurrently, defuddle, fontawesome

Background script

Owns the single WebSocket connection to the Penny server. Persists across sidebar open/close. Handles:

Chat message relay between sidebar and server
Tool request dispatch (receives tool_request from server, executes, sends tool_response)
Active tab tracking (extracts page content on tab switch/load)
Domain permission checking and user prompts
Connection state management and reconnection

Sidebar

Pure UI layer — no direct WebSocket connection. Communicates with background via browser.runtime messaging. Features:

Chat with message persistence in browser.storage.local (200 message cap)
Device registration on first open
Page context toggle showing current page title + favicon with checkbox
Permission dialog for domain access requests
HTML rendering of Penny's responses (markdown → HTML conversion server-side)
Flush image rendering with og:image page headers on contextual responses
Connection status indicator (green dot / orange spinner)
Smart scrolling (short messages anchor at bottom, long messages show top first, re-scroll on image load)

Content script

Extracts visible text from web pages using a tiered approach:

Defuddle (primary) — smart content extraction that strips nav, sidebars, boilerplate
CSS heuristics (fallback) — targets <article>, <main>, [role="main"]
TreeWalker (last resort) — all visible text nodes, skipping nav/aside/footer Also extracts og:image metadata. Bundled with esbuild as an IIFE (content scripts can't use ES module imports).

Browser Tools

Implemented

browse_url — opens a URL in a hidden tab with full web engine + user session. Content extracted via Defuddle content script. Server summarizes in a sandboxed model call before the agent sees it. Domain-gated with user permission prompts.

Active tab context (implemented, not a tool)

The background script continuously extracts visible text from the active tab using Defuddle. When the user sends a message with the "Include page content" toggle checked, the page content is attached to the message. The server injects it as a synthetic browse_url tool call + result in the message history, so the model sees a pre-completed tool exchange and answers from it directly. A minimal hint (title + URL) in the system prompt disambiguates "this page" references.

Planned

search_web — use the browser's default search engine, return results
search_youtube — search YouTube, return video metadata (title, duration, views, channel, thumbnail)
get_history — read recent browsing history (filtered, domain + title only)
screenshot_page — capture current page for vision model
get_current_tab — what the user is looking at right now (partially implemented via active tab context)

The thinking agent uses browser tools autonomously when a browser is connected. All web access (search, page reading) goes through the browser extension. Tools are registered dynamically via set_browser_tools_provider() — available only when a browser has an active connection.

Tool Request Protocol (implemented)

RPC over WebSocket with correlation IDs:

Server sends tool_request with request_id, tool name, and arguments
Background script receives, checks domain permissions, executes tool
Background sends tool_response with matching request_id and result or error
Server resolves the asyncio Future and returns result to the agent loop

The BrowserChannel maintains _pending_requests: dict[str, asyncio.Future] for in-flight requests. Timeout after 30 seconds. Pending futures rejected on disconnect.

Skills

Skills are domain-specific playbooks — instructions for interacting with particular websites:

YouTube skill: how to find videos, parse metadata, read transcripts, evaluate quality
Reddit skill: how to find relevant threads, sort by quality, extract discussions
Amazon/Reverb skill: how to find prices, reviews, availability, related products
Forum skill: how to navigate threads, find expert opinions

Each skill is a document (not code) — instructions and DOM selectors that Penny reads when interacting with that domain. Skills are:

Tied to the domain allowlist: granting access to youtube.com activates the YouTube skill
User-extensible: write a skill document for any site
Part of the trust boundary: defines what to extract and what to ignore

Security Model

Core Principles

Security is the foundational design constraint, not a feature added later.

Zero permissions by default: Penny has no access to any domain until the user explicitly grants it via an allowlist
Collect, store, and expose nothing first: only incrementally add data access as needed
If Penny has no access to sensitive data, there's nothing to exfiltrate

Prompt Injection Defense (implemented)

The primary threat: malicious web pages embedding instructions in content that Penny reads. Attack vectors include hidden text (CSS display:none, white-on-white), HTML comments, meta tags, YouTube descriptions/comments, forum posts, product descriptions.

Three-layer defense:

Pre-sanitization: Defuddle extracts main content only, stripping nav/sidebar/footer/hidden elements. The TreeWalker fallback skips elements with display:none, visibility:hidden, opacity:0, and aria-hidden="true".
Sandboxed summarization: raw page content from browse_url goes to a constrained model call — system prompt is "Summarize this web page content." No tools, no user profile, no preferences, no history, no browsing context. Even if injected instructions survive extraction, there's nothing to exfiltrate and no tools to act with.
Summary is the only bridge: Penny's real agent context (with tools, preferences, sensitive data) only sees the clean summary output from the sandboxed step. Untrusted content never shares a context window with sensitive data or tool access.

Note: Active tab context (when user includes page content with a message) bypasses the sandboxed summarization since the user explicitly chose to share it. The content still goes through Defuddle extraction which strips hidden/malicious elements.

Domain Allowlist (implemented)

No domains accessible by default
User explicitly grants or denies access per domain via sidebar permission dialog
Decisions stored in browser.storage.local for future calls
Parent domain matching (allowing example.com also allows www.example.com)
Three states: allowed, blocked, unknown (unknown triggers prompt)
Management UI deferred (currently via browser dev tools)

Data Policies

Browsing history sent to server: domain + page title only, no full URLs with query params
Filter out sensitive categories (health, finance, adult) before sending
Communication over localhost WebSocket only
Model never sees raw untrusted web content via browse_url — always extracted, sanitized, summarized first
Active tab context uses Defuddle extraction (user-initiated sharing)
Penny never relays verbatim web page text through Signal — always summarized/rephrased
Never pass raw cookies or session tokens to the server

Tool Constraints

Browser tools gated by domain allowlist
Model output cannot drive navigation directly without gating
Tool invocations rate-limited to prevent uncontrolled spidering
Active tab context only sent when user explicitly checks the toggle

Multi-Browser Continuity

Each browser instance connects as a device with a user-chosen label (e.g., "firefox macbook 16")
Device table tracks all connected endpoints with channel type, identifier, and label
Persistent store lives on the Penny server — all browsers share it via the same user identity
Always-on PC browser can do background research while laptop is closed
Signal provides fallback communication when no browser is connected
Thinking agent uses browse_url autonomously on any connected browser

Feed Page (implemented)

The feed page realizes the shift from interruption-based notifications to a browsable discovery feed:

Card grid: each thought rendered as a card with image (if available), title, seed topic byline, date, and truncated content
New / Archive tabs: unnotified thoughts in New, notified in Archive (paginated, 12 per page)
Modal viewer: click a card to see full content with image header
Reactions: thumbs up/down buttons on cards and modal. Clicking one:
- Sets notified_at on the thought (moves to Archive)
- Logs a synthetic outgoing message with the thought content + thought_id FK
- Logs a reaction message (👍/👎) with parent_id pointing to that outgoing message
- The preference extraction pipeline picks up the reaction identically to Signal emoji reactions
- Fades the card out of the New tab
Unnotified count: sidebar nav shows Thoughts (N) with the count of new thoughts, polled every 5 minutes
Content rendering: thought content processed through prepare_outgoing() server-side (markdown → HTML), rendered directly — no client-side duplication
Image URLs: stored on the Thought model (legacy field, no longer auto-populated)
Both models coexist: feed for passive browsing, Signal for high-priority discoveries

Technology

Extension: TypeScript (strict), esbuild for content script bundling, web-ext for development
Content extraction: Defuddle (by Obsidian creator, zero-dep core bundle) with CSS heuristic and TreeWalker fallbacks
Communication: WebSocket between background script and Penny server, browser.runtime messaging between sidebar and background
Protocol: Typed discriminated unions for all messages (TypeScript type + const pattern mirrors Python Pydantic models)
Server: existing Python/Docker architecture with ChannelManager routing proxy
Model: Ollama on server (20B), not constrained by browser ML engine limitations
Storage: server SQLite for persistent data, browser.storage.local for device label, chat history, domain allowlist
Theming: CSS custom properties with prefers-color-scheme media query for automatic light/dark

Incremental Build Path

~~Minimal Firefox extension: sidebar chat panel + WebSocket to Penny server~~ Done
~~Add browse_url tool: extension can fetch and sanitize a page on demand~~ Done
Add search_web tool: use browser's search engine
Browsing history reader: passive preference extraction
~~Feed page: render thoughts as a browsable page~~ Done
YouTube skill: first domain-specific skill
Additional skills: Reddit, Amazon, forums, etc.

Additional implementations (not in original plan)

Multi-channel architecture (ChannelManager, device table, single-user identity)
Active tab context injection (Defuddle extraction + synthetic tool call)
Page context toggle with og:image headers on responses
TypeScript with typed protocol (discriminated unions, const objects)
Background script owns WebSocket (sidebar uses runtime messaging)
Domain permission dialog flow
Chat history persistence (browser.storage.local, 200 message cap)
HTML formatting pipeline (markdown → HTML with code, links, tables-to-bullets)
Flush image rendering with smart scrolling
Light/dark theme support
/draw command working in browser (base64 data URI rendering)
Reconnection indicator with spinner
web-ext dev setup with auto-reload
Thoughts feed page with new/archive tabs, card grid, modal viewer
Thought reactions (thumbs up/down) feeding preference extraction pipeline
Image URLs on thought model (legacy field, no longer auto-populated)
Seed topic bylines on thought cards (from preference FK)
Unnotified thought count in sidebar nav (5-minute polling)
Penny logo: SVG traced from PNG, rendered to crisp PNGs at all icon sizes
Font Awesome icons (locally bundled, no CDN)
Signal profile avatar via signal-cli-rest-api

TODO — come back to these

Page header on model-initiated browse_url: when Penny calls browse_url on her own, the sidebar response should show the browsed page's image/title/URL header (same as user-initiated page context). Requires the server to include the browsed URL metadata alongside the response
Deferred cleanup — drop sender column: remove sender from MessageLog, migrate all queries to device_id or remove sender filtering (single-user). ~30 call sites
Domain allowlist management UI: currently managed via browser dev tools, needs a proper settings page
Rate limiting on tool invocations: prevent uncontrolled spidering from the thinking agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Penny Browser Extension — Architecture & Design

Vision

Architecture: Browser as a New Channel

Multi-Channel Architecture (implemented)

What the Extension Adds

Extension Components (implemented)

Background script

Sidebar

Content script

Browser Tools

Implemented

Active tab context (implemented, not a tool)

Planned

Tool Request Protocol (implemented)

Skills

Security Model

Core Principles

Prompt Injection Defense (implemented)

Domain Allowlist (implemented)

Data Policies

Tool Constraints

Multi-Browser Continuity

Feed Page (implemented)

Technology

Incremental Build Path

Additional implementations (not in original plan)

TODO — come back to these

FilesExpand file tree

browser-extension-architecture.md

Latest commit

History

browser-extension-architecture.md

File metadata and controls

Penny Browser Extension — Architecture & Design

Vision

Architecture: Browser as a New Channel

Multi-Channel Architecture (implemented)

What the Extension Adds

Extension Components (implemented)

Background script

Sidebar

Content script

Browser Tools

Implemented

Active tab context (implemented, not a tool)

Planned

Tool Request Protocol (implemented)

Skills

Security Model

Core Principles

Prompt Injection Defense (implemented)

Domain Allowlist (implemented)

Data Policies

Tool Constraints

Multi-Browser Continuity

Feed Page (implemented)

Technology

Incremental Build Path

Additional implementations (not in original plan)

TODO — come back to these