Skip to content

fix: escape inline HTML in markdown to eliminate XSS attack surface#484

Merged
chungjac merged 6 commits intoaws:mainfrom
chungjac:fix/escape-inline-html-xss
May 7, 2026
Merged

fix: escape inline HTML in markdown to eliminate XSS attack surface#484
chungjac merged 6 commits intoaws:mainfrom
chungjac:fix/escape-inline-html-xss

Conversation

@chungjac
Copy link
Copy Markdown
Contributor

@chungjac chungjac commented May 7, 2026

Summary

  • Adds html and image custom renderers in marked that escape raw HTML tokens instead of passing them through to the DOM. This eliminates the entire class of inline-HTML XSS vectors at the parser level, regardless of which tags or attributes are involved.
  • Removes resource-loading media tags (img, audio, video, source, track) from the sanitizer allowlist as defense-in-depth

Motivation

After PRs #462, #466, and #470 were merged, VAPT verification found that <img> tags in attacker-controlled filenames still rendered as live HTML because img was on the sanitizer's allowlist. The allowlist approach requires ongoing maintenance — every missed tag/attribute combo is a potential bypass (e.g. <input type="image" src=...>, style="background-image: url(...)", etc).

This PR follows the principle stated by the security reviewer: "any output resembling an HTML tag should be escaped before reaching the browser, regardless of its source."

Legitimate formatting uses markdown syntax (**bold**, [link](url), `code`), which marked converts structurally. Raw HTML in the input is never intentional — it's either untrusted data (like filenames) or LLM output errors.

How the fix works

All raw HTML tokens in markdown output are escaped at the marked renderer level, before they ever reach the DOM.

For example, if the LLM outputs <img src="https://attacker.com/exfil"> in its response:

Before fix After fix
What enters the DOM <img src="https://attacker.com/exfil"> (live HTML element) &lt;img src="https://attacker.com/exfil"&gt; (text node)
What the user sees Nothing (or broken image icon) <img src="https://attacker.com/exfil"> as visible text
Network requests Browser fetches the URL automatically None
JS execution onerror/onload handlers fire None

This applies to all inline HTML regardless of tag — <img>, <svg>, <script>, <input>, <div>, etc. Markdown formatting (**bold**,
[link](url), `code`) continues to render normally since those go through separate renderer paths.

Verified in browser

  • Zero outbound network requests to attacker URLs
  • document.title unchanged (no JS executed)
  • All payloads render as inert, readable text

Before / After

Before fix — raw HTML renders as live DOM elements

Malicious filenames execute JavaScript (onerror fires), render SVG content, make outbound network requests, and disappear from visible text:

before-fix

After fix — all HTML escaped to inert visible text

The same payloads display as safe, readable text with no code execution or network requests:

after-fix

Verification results

Check Result
<img src=x onerror=alert(1)> executes JS? No — rendered as text
<img src="https://attacker.com/exfil"> makes network request? No — zero outbound requests
<svg onload=...> executes JS? No — rendered as text
<input type="image" src="..."> fetches URL? No — rendered as text
<div style="background-image: url(...)"> fetches URL? No — rendered as text
**bold**, [link](url), `code` still render? Yes — markdown formatting unaffected

Test plan

  • All existing unit tests pass (17/17)
  • New tests verify: <img>, <svg>, <script>, <div> with event handlers, ![image](url) are all escaped
  • Markdown formatting still renders correctly
  • Live browser verification: zero network requests to attacker URLs, no JS execution
  • UI snapshot tests need update (binary PNGs will change since images no longer render)

Instead of relying solely on sanitize-html's allowlist to filter dangerous
tags (which requires ongoing maintenance as new bypass vectors are found),
escape all raw HTML tokens at the marked renderer level. This ensures any
output resembling an HTML tag is rendered as inert text regardless of source.

Also removes resource-loading media tags (img, audio, video, source, track)
from the sanitizer allowlist as defense-in-depth.
@chungjac chungjac requested a review from a team as a code owner May 7, 2026 21:12
@chungjac chungjac merged commit 571d6c2 into aws:main May 7, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants