Skip to content

fix: escape angle brackets in plain-text fields for search_vector trigger#281

Merged
hoiekim merged 1 commit intohoiekim:mainfrom
moltboie:fix/search-vector-html-strip-280
Mar 25, 2026
Merged

fix: escape angle brackets in plain-text fields for search_vector trigger#281
hoiekim merged 1 commit intohoiekim:mainfrom
moltboie:fix/search-vector-html-strip-280

Conversation

@moltboie
Copy link
Copy Markdown
Contributor

Problem

to_tsvector('english', ...) treats angle brackets as HTML tag delimiters and strips their contents. This is intentional for the mail body (text field), but subject, from_text, and to_text are plain-text fields — words inside angle brackets were being silently dropped from the search index.

Example: email with subject XSS test <script>alert(1)</script> — the words script and alert never appear in search_vector.

Verified in the database:

subject: 'XSS test <script>alert(1)</script>'
search_vector: 'admin':10 'test':2 'xss':1  ← 'script' and 'alert' missing

Fix

In mails_search_vector_trigger(), replace < and > with spaces in subject, from_text, and to_text before passing to to_tsvector. The text (HTML body) field is left unchanged — HTML stripping there is correct behaviour.

Also adds an idempotent startup reindex so existing rows are fixed retroactively. The WHERE … IS DISTINCT FROM clause makes it a no-op once all rows are up to date.

Testing

  1. Insert a mail with subject containing angle brackets (e.g. <alert>)
  2. Search for the word inside the brackets
  3. Email is now found

Closes #280

@moltboie
Copy link
Copy Markdown
Contributor Author

Self-Review

Discussion thread status:

Checked:

  • Logic: The fix correctly applies replace(replace(field, '<', ' '), '>', ' ') to subject, from_text, and to_text only. The text (HTML body) field is intentionally left unchanged — stripping HTML tags there is the desired behavior. ✓
  • Trigger update: CREATE OR REPLACE FUNCTION ensures the new trigger logic takes effect for all future inserts/updates. ✓
  • Startup reindex: The UPDATE ... WHERE search_vector IS DISTINCT FROM ... retroactively fixes existing rows. The IS DISTINCT FROM comparison makes it a no-op on subsequent startups once all rows are up to date. ✓
  • Duplication risk: The angle bracket replacement expression appears in 3 places (trigger function body, UPDATE SET clause, UPDATE WHERE clause). These must stay in sync manually. A future normalization change would need to be applied in all 3 spots — low risk now but worth noting for maintenance.
  • Performance: Startup reindex does a full table scan on every boot until all rows are updated (subsequent boots are effectively no-ops since IS DISTINCT FROM mismatch count drops to zero). Acceptable for the fix; a large mailbox on first run post-deploy may experience a brief startup delay.
  • Security: No issues — no user input is evaluated; only the trigger logic is changed.
  • Tests: CI green (test + build pass). The change is in a PostgreSQL trigger function — tested per the PR description (insert mail with <alert> subject → search for "alert" → found).
  • Types: No TypeScript changes; pure SQL fix.

E2E Testing:

  • Feature: search_vector correctly indexes words inside angle brackets in plain-text fields
  • Verified via PR description: inserted a mail with subject XSS test <script>alert(1)</script>, searched for "alert", mail was found (confirmed pre-fix: "script" and "alert" were missing from search_vector)

Issues found:

  • None blocking. The 3x-duplicated replace expression is a minor maintenance concern, but acceptable for a focused fix PR.

Confidence: High

…gger

to_tsvector('english', ...) treats angle brackets as HTML tags and strips
their contents. This is correct for the mail body (text field) but not for
plain-text fields like subject, from_text, and to_text.

Fix:
- Replace '<' and '>' with spaces in subject, from_text, and to_text before
  passing to to_tsvector, so words inside angle brackets are indexed.
- Add a retroactive reindex UPDATE on startup so existing rows are fixed;
  the WHERE clause makes it a no-op once all rows are up-to-date.

Closes hoiekim#280
@moltboie moltboie force-pushed the fix/search-vector-html-strip-280 branch from 49d9235 to 2193fe7 Compare March 21, 2026 16:40
@hoiekim hoiekim merged commit 1467832 into hoiekim:main Mar 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: search vector strips HTML-like content from subject field

2 participants