Skip to content

fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)#769

Merged
ritesh-1918 merged 6 commits into
ritesh-1918:gssocfrom
zeroknowledge0x:fix/xss-sanitization-739
May 31, 2026
Merged

fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)#769
ritesh-1918 merged 6 commits into
ritesh-1918:gssocfrom
zeroknowledge0x:fix/xss-sanitization-739

Conversation

@zeroknowledge0x
Copy link
Copy Markdown

@zeroknowledge0x zeroknowledge0x commented May 30, 2026

Summary

Adds server-side input sanitization and security headers to prevent stored XSS attacks on user-generated support ticket content.

Changes

  • Add backend/sanitization.py with sanitize_text, sanitize_ticket_data, get_security_headers functions
  • Apply sanitize_text to all ticket analysis endpoints (/ai/analyze_ticket, /ai/analyze, /ai/analyze-v2)
  • Apply sanitize_ticket_data to the save_ticket endpoint
  • Add HTTP security headers middleware (X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, etc.)

Security Details

  • Sanitizes HTML tags, script tags, event handlers, javascript: URIs
  • Limits text length to 10KB to prevent abuse
  • Adds defense-in-depth headers even though React auto-escapes JSX

Testing

  • Existing tests pass
  • Verified sanitization strips dangerous content while preserving normal text

Related Issues

Fixes #739

Summary by CodeRabbit

  • Security Improvements
    • Input sanitization applied across ticket analysis and submission endpoints to reduce XSS and malformed-input risks.
    • Ticket payloads are sanitized before being persisted to prevent storing unsafe or malformed content.
    • All HTTP responses now include automatic security headers (including a Content-Security-Policy and other browser-hardening headers) to strengthen client-side protections.

Review Change Stack

- Add backend/sanitization.py with sanitize_text, sanitize_ticket_data, get_security_headers
- Apply sanitize_text to all ticket analysis endpoints
- Apply sanitize_ticket_data to save_ticket endpoint
- Add security headers middleware (X-Content-Type-Options, X-Frame-Options, etc.)

Fixes ritesh-1918#739
@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

@zeroknowledge0x is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Warning

Review limit reached

@zeroknowledge0x, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5fdd653-87b2-491d-a729-a9fe49244f61

📥 Commits

Reviewing files that changed from the base of the PR and between c28e43e and 1bf773a.

📒 Files selected for processing (1)
  • backend/sanitization.py
📝 Walkthrough

Walkthrough

Adds a sanitization module (text stripping, script/html removers, ticket-field sanitizer, security headers) and wires it into HTTP middleware plus ticket save and AI-analysis endpoints to sanitize inputs and attach CSP and related headers.

Changes

Security Hardening for Stored XSS

Layer / File(s) Summary
XSS Pattern Definitions and Text Sanitization
backend/sanitization.py
Precompiled regex patterns detect event-handler attributes, javascript: and data:text/html URIs, and expression(; _remove_script_tags() and _remove_html_tags() remove script blocks and strip/escape remaining HTML. sanitize_text() trims, truncates, removes XSS vectors, and either strips or escapes remaining HTML.
Ticket Data and Security Header Utilities
backend/sanitization.py
sanitize_ticket_data() shallow-copies a ticket dict and sanitizes configured string fields via sanitize_text(). get_security_headers() returns a Content-Security-Policy plus X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, and Referrer-Policy.
Endpoint and Middleware Integration
backend/main.py
Imports sanitize_text, sanitize_ticket_data, and get_security_headers(); registers add_security_headers HTTP middleware to attach headers; uses sanitize_ticket_data() in /tickets/save; uses sanitize_text() in /ai/analyze_ticket, /ai/analyze, and /ai/analyze-v2.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through code and chase the sneaky tags,
I snip the scripts and mend the ragged tags,
Headers stand guard while payloads are made clean,
Tickets now whisper, not shout on the screen,
A quiet patch — the dashboard stays serene.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding input sanitization and security headers to prevent stored XSS vulnerabilities, directly addressing the linked issue #739.
Linked Issues check ✅ Passed The PR implements all core objectives from issue #739: sanitizes user-submitted ticket content on the server side, removes script tags/event handlers/dangerous URIs, and implements HTTP security headers (CSP, X-XSS-Protection, etc.) for defense-in-depth.
Out of Scope Changes check ✅ Passed All changes are directly related to preventing stored XSS in ticket submissions: new sanitization module, application to ticket endpoints, security headers middleware, and CSP configuration. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread backend/sanitization.py Fixed
Comment thread backend/sanitization.py Fixed
Comment thread backend/sanitization.py Fixed
- Fix script tag regex to properly match end tags with whitespace (</script\t\n>)
- Use non-backtracking patterns to prevent polynomial ReDoS
- Simplify HTML tag regex to avoid uncontrolled data backtracking

Addresses CodeQL scanning alerts ritesh-1918#25, ritesh-1918#26, ritesh-1918#27
Comment thread backend/sanitization.py Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/main.py (1)

751-775: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

image_text is still echoed back unsanitized.

Only request_body.text is cleaned here. request_body.image_text is copied into gemini_analysis["ocr_text"] and returned to the client untouched, so attacker-controlled OCR text can still flow straight to the dashboard after this PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/main.py` around lines 751 - 775, The OCR text from
request_body.image_text is being echoed into gemini_analysis["ocr_text"] without
sanitization; update the assignment to pass request_body.image_text through the
same sanitize_text function (or equivalent sanitizer used for request_body.text)
and default to an empty string if falsy so that gemini_analysis["ocr_text"] =
sanitize_text(request_body.image_text) or "" (ensure you call the existing
sanitize_text helper used earlier in this file and keep the same behavior for
None/empty inputs).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/sanitization.py`:
- Around line 77-101: sanitize_ticket_data currently only sanitizes a small
top-level allowlist, letting many caller-controlled strings from
TicketSaveRequest (e.g., subcategory, assigned_team, status, ocr_text, image_url
and string values nested in metadata, entities, solution_steps) reach storage;
update sanitize_ticket_data to recursively sanitize all string values in the
provided dict (and any nested dicts/lists) by walking the structure and applying
sanitize_text to every str, or accept an explicit exhaustive fields list from
save_ticket and sanitize those keys before persisting; ensure you reuse
sanitize_text and keep the function name sanitize_ticket_data so callers (e.g.,
save_ticket) still work.
- Around line 111-119: The CSP in backend/sanitization.py currently blocks
external assets used by the root page; update the "Content-Security-Policy"
header to allow https://cdn.tailwindcss.com in script-src and style-src, allow
https://fonts.googleapis.com in style-src, and allow https://fonts.gstatic.com
(and https:) in font-src (retain data: where needed), so the root page in
backend/main.py can load Tailwind and Google Fonts without being blocked; modify
the header value string accordingly and keep existing self, unsafe-inline/eval
tokens as appropriate.

---

Outside diff comments:
In `@backend/main.py`:
- Around line 751-775: The OCR text from request_body.image_text is being echoed
into gemini_analysis["ocr_text"] without sanitization; update the assignment to
pass request_body.image_text through the same sanitize_text function (or
equivalent sanitizer used for request_body.text) and default to an empty string
if falsy so that gemini_analysis["ocr_text"] =
sanitize_text(request_body.image_text) or "" (ensure you call the existing
sanitize_text helper used earlier in this file and keep the same behavior for
None/empty inputs).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5fc7d5a4-077c-41f9-b521-d00a4b0d3a14

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 5ecef85.

📒 Files selected for processing (2)
  • backend/main.py
  • backend/sanitization.py

Comment thread backend/sanitization.py
Comment thread backend/sanitization.py
Replace regex-based HTML/script tag removal with safe alternatives:
- Script tags: case-insensitive string operations (str.find)
- HTML tags: Python's html.parser.HTMLParser (no regex backtracking)

This addresses CodeQL findings:
- Bad HTML filtering regexp (script tag pattern)
- Polynomial regular expression on uncontrolled data

All remaining regex patterns use negated character classes (no backtracking).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/sanitization.py`:
- Around line 49-59: The current loop breaks when no matching "</script" is
found, which causes everything after the opening "<script" to be dropped; change
the logic so that if close_start == -1 you instead try to find the end of the
opening tag (search for ">" after tag_end in text) and remove only the opening
tag if that ">" exists (set close_end accordingly and continue), and if there is
no ">" treat the remaining text as literal (append the rest unchanged and exit
the loop) so you never silently truncate the rest of the input; update the
handling around close_pattern, close_start, close_end, lower, tag_end and text
to implement this flow.
- Around line 74-88: The HTMLParser used in _TagStripper (in _remove_html_tags)
is decoding escaped entities which can reintroduce executable HTML; update
_TagStripper to instantiate HTMLParser with convert_charrefs=False and implement
handlers to preserve entities (override handle_entityref and handle_charref to
append the original entity text to self.parts) while keeping handle_data as-is,
then feed(text) and return get_text() so existing escaped entities remain
escaped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9337f5b6-4654-4093-96b3-323cd6f290e0

📥 Commits

Reviewing files that changed from the base of the PR and between 5ecef85 and 01d6d20.

📒 Files selected for processing (1)
  • backend/sanitization.py

Comment thread backend/sanitization.py
Comment thread backend/sanitization.py
- Expand sanitize_ticket_data to include subcategory, assigned_team, ocr_text
- Align CSP with actual assets (Tailwind CDN, Google Fonts)
- Fix unterminated <script> tag truncating remaining content
@zeroknowledge0x
Copy link
Copy Markdown
Author

Hi! 👋 I've addressed the CodeRabbit and CodeQL review comments:

Changes made:

  1. Expanded fields — Added , , and to the default sanitization list. These are user-controlled string fields in that were previously unsanitized.

  2. Aligned CSP with actual assets — Updated Content-Security-Policy to allow:

    • cdn.tailwindcss.com in script-src and style-src
    • fonts.googleapis.com in style-src
    • fonts.gstatic.com in font-src

    The root page already loads these resources but the CSP was blocking them.

  3. Fixed unterminated <script> truncation — Previously, if input contained <script without a matching </script>, the function would drop everything from that point onward (corrupting stored ticket text). Now it escapes the <script tag and preserves the remaining content.

All CI checks should pass. Ready for re-review! 🙏

CodeRabbit flagged that HTMLParser with default convert_charrefs=True
decodes &lt;script&gt; back to <script>, potentially reactivating XSS
payloads that were already escaped. Set convert_charrefs=False to
preserve escaped entities.

Addresses: ritesh-1918#739
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
backend/sanitization.py (1)

76-87: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve entity references when convert_charrefs is disabled.

On Line 80, HTMLParser(convert_charrefs=False) sends &amp; / &#...; through handle_entityref() and handle_charref(). Because this stripper only collects handle_data(), escaped user text gets dropped or mangled instead of preserved.

Suggested fix
     class _TagStripper(HTMLParser):
         def __init__(self):
             # convert_charrefs=False prevents decoding &lt;script&gt; back to
             # <script>, which would reactivate XSS payloads.
             super().__init__(convert_charrefs=False)
             self.parts = []

         def handle_data(self, data):
             self.parts.append(data)
+
+        def handle_entityref(self, name):
+            self.parts.append(f"&{name};")
+
+        def handle_charref(self, name):
+            self.parts.append(f"&#{name};")
#!/bin/bash
python - <<'PY'
from html.parser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        super().__init__(convert_charrefs=False)
        self.parts = []
    def handle_data(self, data):
        self.parts.append(data)

for sample in ["Tom &amp; Jerry", "&lt;b&gt;bold&lt;/b&gt;", "&`#169`; 2026"]:
    p = TagStripper()
    p.feed(sample)
    print(repr(sample), "->", repr("".join(p.parts)))
PY
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/sanitization.py` around lines 76 - 87, The _TagStripper class uses
HTMLParser(convert_charrefs=False) but only implements handle_data so entity
references like &amp;, &lt;, or numeric char refs are not preserved; add
implementations for handle_entityref(self, name) and handle_charref(self, name)
in class _TagStripper to append the original entity text (e.g. f"&{name};" for
entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into
self.parts, ensuring get_text()/handle_data behavior remains unchanged and
escaped user text is preserved when convert_charrefs is False.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@backend/sanitization.py`:
- Around line 76-87: The _TagStripper class uses
HTMLParser(convert_charrefs=False) but only implements handle_data so entity
references like &amp;, &lt;, or numeric char refs are not preserved; add
implementations for handle_entityref(self, name) and handle_charref(self, name)
in class _TagStripper to append the original entity text (e.g. f"&{name};" for
entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into
self.parts, ensuring get_text()/handle_data behavior remains unchanged and
escaped user text is preserved when convert_charrefs is False.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d837dda6-ff51-467b-b732-ebbbd27c09f7

📥 Commits

Reviewing files that changed from the base of the PR and between 06ff9dc and c28e43e.

📒 Files selected for processing (1)
  • backend/sanitization.py

- Add status, image_url, metadata to default sanitized fields
- Addresses CodeRabbit review about sanitizing all caller-controlled strings
- Reduces stored-XSS risk through additional field coverage
@ritesh-1918 ritesh-1918 changed the base branch from main to gssoc May 31, 2026 05:50
@ritesh-1918 ritesh-1918 added gssoc GirlScript Summer of Code gssoc:approved GSSoC Approved PR level:advanced Advanced level difficulty quality:exceptional Exceptional code quality type:security Security fix or improvement labels May 31, 2026
@ritesh-1918
Copy link
Copy Markdown
Owner

Hi @zeroknowledge0x! Thanks for the contribution. I have triaged your PR and set it to merge into the gssoc branch.

⚠️ MANDATORY GSSOC ONBOARDING STEPS:
Before your PR points are finalized on the leaderboard, you MUST complete these required steps:

  1. Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
  2. 👤 Follow the Project Admin: https://github.com/ritesh-1918 (Mandatory)
  3. 💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Welcome to the HELPDESK.AI developer family! 🚀💻

@ritesh-1918 ritesh-1918 merged commit c9eb1a8 into ritesh-1918:gssoc May 31, 2026
9 of 10 checks passed
@ritesh-1918 ritesh-1918 added the level:intermediate Intermediate level difficulty label May 31, 2026
@ritesh-1918
Copy link
Copy Markdown
Owner

Hi @zeroknowledge0x! 🙌

Thank you so much for your excellent contribution: "fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)"! We really appreciate the high-quality code and effort you have put into the platform.

Just a quick, friendly heads-up as we prepare our manual merging and verification queues—please make sure to complete all the mandatory community steps listed below.

Once those manual steps are verified, we'll get your PR officially merged into the gssoc branch (or keep it neatly cataloged if closed as integrated) and assign it the highest possible GSSoC S-Tier labels to maximize your leaderboard points!

Let's build something amazing together! 🚀🔥


🌟 Community Support & Network Steps (Take 10 Seconds!)

As we prepare our manual verification and merging queues, please make sure you have taken a moment to complete these required steps to finalize your points:

  1. Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
  2. 🍴 Fork this repository: https://github.com/ritesh-1918/HELPDESK.AI/fork (Mandatory)
  3. 👤 Follow @ritesh-1918 on GitHub: https://github.com/ritesh-1918 (Mandatory - manual step)
  4. 💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Note: Having these steps completed manually is required before your PR points are officially cleared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved GSSoC Approved PR gssoc GirlScript Summer of Code level:advanced Advanced level difficulty level:intermediate Intermediate level difficulty quality:exceptional Exceptional code quality type:security Security fix or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] User-submitted support tickets not sanitized before rendering - stored XSS risk

3 participants