fix: add input sanitization and security headers to prevent stored XSS (Fixes #739) by zeroknowledge0x · Pull Request #769 · ritesh-1918/HELPDESK.AI

zeroknowledge0x · 2026-05-30T14:25:43Z

Summary

Adds server-side input sanitization and security headers to prevent stored XSS attacks on user-generated support ticket content.

Changes

Add backend/sanitization.py with sanitize_text, sanitize_ticket_data, get_security_headers functions
Apply sanitize_text to all ticket analysis endpoints (/ai/analyze_ticket, /ai/analyze, /ai/analyze-v2)
Apply sanitize_ticket_data to the save_ticket endpoint
Add HTTP security headers middleware (X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, etc.)

Security Details

Sanitizes HTML tags, script tags, event handlers, javascript: URIs
Limits text length to 10KB to prevent abuse
Adds defense-in-depth headers even though React auto-escapes JSX

Testing

Existing tests pass
Verified sanitization strips dangerous content while preserving normal text

Related Issues

Fixes #739

Summary by CodeRabbit

Security Improvements
- Input sanitization applied across ticket analysis and submission endpoints to reduce XSS and malformed-input risks.
- Ticket payloads are sanitized before being persisted to prevent storing unsafe or malformed content.
- All HTTP responses now include automatic security headers (including a Content-Security-Policy and other browser-hardening headers) to strengthen client-side protections.

- Add backend/sanitization.py with sanitize_text, sanitize_ticket_data, get_security_headers - Apply sanitize_text to all ticket analysis endpoints - Apply sanitize_ticket_data to save_ticket endpoint - Add security headers middleware (X-Content-Type-Options, X-Frame-Options, etc.) Fixes ritesh-1918#739

vercel · 2026-05-30T14:25:47Z

@zeroknowledge0x is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-30T14:25:51Z

Warning

Review limit reached

@zeroknowledge0x, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5fdd653-87b2-491d-a729-a9fe49244f61

📥 Commits

Reviewing files that changed from the base of the PR and between c28e43e and 1bf773a.

📒 Files selected for processing (1)

backend/sanitization.py

📝 Walkthrough

Walkthrough

Adds a sanitization module (text stripping, script/html removers, ticket-field sanitizer, security headers) and wires it into HTTP middleware plus ticket save and AI-analysis endpoints to sanitize inputs and attach CSP and related headers.

Changes

Security Hardening for Stored XSS

Layer / File(s)	Summary
XSS Pattern Definitions and Text Sanitization `backend/sanitization.py`	Precompiled regex patterns detect event-handler attributes, `javascript:` and `data:text/html` URIs, and `expression(`; `_remove_script_tags()` and `_remove_html_tags()` remove script blocks and strip/escape remaining HTML. `sanitize_text()` trims, truncates, removes XSS vectors, and either strips or escapes remaining HTML.
Ticket Data and Security Header Utilities `backend/sanitization.py`	`sanitize_ticket_data()` shallow-copies a ticket dict and sanitizes configured string fields via `sanitize_text()`. `get_security_headers()` returns a Content-Security-Policy plus `X-Content-Type-Options`, `X-Frame-Options`, `X-XSS-Protection`, and `Referrer-Policy`.
Endpoint and Middleware Integration `backend/main.py`	Imports `sanitize_text`, `sanitize_ticket_data`, and `get_security_headers()`; registers `add_security_headers` HTTP middleware to attach headers; uses `sanitize_ticket_data()` in `/tickets/save`; uses `sanitize_text()` in `/ai/analyze_ticket`, `/ai/analyze`, and `/ai/analyze-v2`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through code and chase the sneaky tags,
I snip the scripts and mend the ragged tags,
Headers stand guard while payloads are made clean,
Tickets now whisper, not shout on the screen,
A quiet patch — the dashboard stays serene.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: adding input sanitization and security headers to prevent stored XSS vulnerabilities, directly addressing the linked issue `#739`.
Linked Issues check	✅ Passed	The PR implements all core objectives from issue `#739`: sanitizes user-submitted ticket content on the server side, removes script tags/event handlers/dangerous URIs, and implements HTTP security headers (CSP, X-XSS-Protection, etc.) for defense-in-depth.
Out of Scope Changes check	✅ Passed	All changes are directly related to preventing stored XSS in ticket submissions: new sanitization module, application to ticket endpoints, security headers middleware, and CSP configuration. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Fix script tag regex to properly match end tags with whitespace (</script\t\n>) - Use non-backtracking patterns to prevent polynomial ReDoS - Simplify HTML tag regex to avoid uncontrolled data backtracking Addresses CodeQL scanning alerts ritesh-1918#25, ritesh-1918#26, ritesh-1918#27

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/main.py (1)
751-775: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

image_text is still echoed back unsanitized.

Only request_body.text is cleaned here. request_body.image_text is copied into gemini_analysis["ocr_text"] and returned to the client untouched, so attacker-controlled OCR text can still flow straight to the dashboard after this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/main.py` around lines 751 - 775, The OCR text from
request_body.image_text is being echoed into gemini_analysis["ocr_text"] without
sanitization; update the assignment to pass request_body.image_text through the
same sanitize_text function (or equivalent sanitizer used for request_body.text)
and default to an empty string if falsy so that gemini_analysis["ocr_text"] =
sanitize_text(request_body.image_text) or "" (ensure you call the existing
sanitize_text helper used earlier in this file and keep the same behavior for
None/empty inputs).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/sanitization.py`:
- Around line 77-101: sanitize_ticket_data currently only sanitizes a small
top-level allowlist, letting many caller-controlled strings from
TicketSaveRequest (e.g., subcategory, assigned_team, status, ocr_text, image_url
and string values nested in metadata, entities, solution_steps) reach storage;
update sanitize_ticket_data to recursively sanitize all string values in the
provided dict (and any nested dicts/lists) by walking the structure and applying
sanitize_text to every str, or accept an explicit exhaustive fields list from
save_ticket and sanitize those keys before persisting; ensure you reuse
sanitize_text and keep the function name sanitize_ticket_data so callers (e.g.,
save_ticket) still work.
- Around line 111-119: The CSP in backend/sanitization.py currently blocks
external assets used by the root page; update the "Content-Security-Policy"
header to allow https://cdn.tailwindcss.com in script-src and style-src, allow
https://fonts.googleapis.com in style-src, and allow https://fonts.gstatic.com
(and https:) in font-src (retain data: where needed), so the root page in
backend/main.py can load Tailwind and Google Fonts without being blocked; modify
the header value string accordingly and keep existing self, unsafe-inline/eval
tokens as appropriate.

---

Outside diff comments:
In `@backend/main.py`:
- Around line 751-775: The OCR text from request_body.image_text is being echoed
into gemini_analysis["ocr_text"] without sanitization; update the assignment to
pass request_body.image_text through the same sanitize_text function (or
equivalent sanitizer used for request_body.text) and default to an empty string
if falsy so that gemini_analysis["ocr_text"] =
sanitize_text(request_body.image_text) or "" (ensure you call the existing
sanitize_text helper used earlier in this file and keep the same behavior for
None/empty inputs).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5fc7d5a4-077c-41f9-b521-d00a4b0d3a14

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 5ecef85.

📒 Files selected for processing (2)

backend/main.py
backend/sanitization.py

Replace regex-based HTML/script tag removal with safe alternatives: - Script tags: case-insensitive string operations (str.find) - HTML tags: Python's html.parser.HTMLParser (no regex backtracking) This addresses CodeQL findings: - Bad HTML filtering regexp (script tag pattern) - Polynomial regular expression on uncontrolled data All remaining regex patterns use negated character classes (no backtracking).

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/sanitization.py`:
- Around line 49-59: The current loop breaks when no matching "</script" is
found, which causes everything after the opening "<script" to be dropped; change
the logic so that if close_start == -1 you instead try to find the end of the
opening tag (search for ">" after tag_end in text) and remove only the opening
tag if that ">" exists (set close_end accordingly and continue), and if there is
no ">" treat the remaining text as literal (append the rest unchanged and exit
the loop) so you never silently truncate the rest of the input; update the
handling around close_pattern, close_start, close_end, lower, tag_end and text
to implement this flow.
- Around line 74-88: The HTMLParser used in _TagStripper (in _remove_html_tags)
is decoding escaped entities which can reintroduce executable HTML; update
_TagStripper to instantiate HTMLParser with convert_charrefs=False and implement
handlers to preserve entities (override handle_entityref and handle_charref to
append the original entity text to self.parts) while keeping handle_data as-is,
then feed(text) and return get_text() so existing escaped entities remain
escaped.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9337f5b6-4654-4093-96b3-323cd6f290e0

📥 Commits

Reviewing files that changed from the base of the PR and between 5ecef85 and 01d6d20.

📒 Files selected for processing (1)

backend/sanitization.py

- Expand sanitize_ticket_data to include subcategory, assigned_team, ocr_text - Align CSP with actual assets (Tailwind CDN, Google Fonts) - Fix unterminated <script> tag truncating remaining content

zeroknowledge0x · 2026-05-30T16:23:20Z

Hi! 👋 I've addressed the CodeRabbit and CodeQL review comments:

Changes made:

Expanded fields — Added , , and to the default sanitization list. These are user-controlled string fields in that were previously unsanitized.
Aligned CSP with actual assets — Updated Content-Security-Policy to allow:
- cdn.tailwindcss.com in script-src and style-src
- fonts.googleapis.com in style-src
- fonts.gstatic.com in font-src
The root page already loads these resources but the CSP was blocking them.
Fixed unterminated <script> truncation — Previously, if input contained <script without a matching </script>, the function would drop everything from that point onward (corrupting stored ticket text). Now it escapes the <script tag and preserves the remaining content.

All CI checks should pass. Ready for re-review! 🙏

CodeRabbit flagged that HTMLParser with default convert_charrefs=True decodes <script> back to <script>, potentially reactivating XSS payloads that were already escaped. Set convert_charrefs=False to preserve escaped entities. Addresses: ritesh-1918#739

coderabbitai

♻️ Duplicate comments (1)

backend/sanitization.py (1)

76-87: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve entity references when convert_charrefs is disabled.

On Line 80, HTMLParser(convert_charrefs=False) sends & / &#...; through handle_entityref() and handle_charref(). Because this stripper only collects handle_data(), escaped user text gets dropped or mangled instead of preserved.

Suggested fix

     class _TagStripper(HTMLParser):
         def __init__(self):
             # convert_charrefs=False prevents decoding &lt;script&gt; back to
             # <script>, which would reactivate XSS payloads.
             super().__init__(convert_charrefs=False)
             self.parts = []

         def handle_data(self, data):
             self.parts.append(data)
+
+        def handle_entityref(self, name):
+            self.parts.append(f"&{name};")
+
+        def handle_charref(self, name):
+            self.parts.append(f"&#{name};")

#!/bin/bash
python - <<'PY'
from html.parser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        super().__init__(convert_charrefs=False)
        self.parts = []
    def handle_data(self, data):
        self.parts.append(data)

for sample in ["Tom &amp; Jerry", "&lt;b&gt;bold&lt;/b&gt;", "&`#169`; 2026"]:
    p = TagStripper()
    p.feed(sample)
    print(repr(sample), "->", repr("".join(p.parts)))
PY

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/sanitization.py` around lines 76 - 87, The _TagStripper class uses
HTMLParser(convert_charrefs=False) but only implements handle_data so entity
references like &amp;, &lt;, or numeric char refs are not preserved; add
implementations for handle_entityref(self, name) and handle_charref(self, name)
in class _TagStripper to append the original entity text (e.g. f"&{name};" for
entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into
self.parts, ensuring get_text()/handle_data behavior remains unchanged and
escaped user text is preserved when convert_charrefs is False.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@backend/sanitization.py`:
- Around line 76-87: The _TagStripper class uses
HTMLParser(convert_charrefs=False) but only implements handle_data so entity
references like &amp;, &lt;, or numeric char refs are not preserved; add
implementations for handle_entityref(self, name) and handle_charref(self, name)
in class _TagStripper to append the original entity text (e.g. f"&{name};" for
entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into
self.parts, ensuring get_text()/handle_data behavior remains unchanged and
escaped user text is preserved when convert_charrefs is False.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d837dda6-ff51-467b-b732-ebbbd27c09f7

📥 Commits

Reviewing files that changed from the base of the PR and between 06ff9dc and c28e43e.

📒 Files selected for processing (1)

backend/sanitization.py

- Add status, image_url, metadata to default sanitized fields - Addresses CodeRabbit review about sanitizing all caller-controlled strings - Reduces stored-XSS risk through additional field coverage

ritesh-1918 · 2026-05-31T05:50:28Z

Hi @zeroknowledge0x! Thanks for the contribution. I have triaged your PR and set it to merge into the gssoc branch.

⚠️ MANDATORY GSSOC ONBOARDING STEPS:
Before your PR points are finalized on the leaderboard, you MUST complete these required steps:

⭐ Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
👤 Follow the Project Admin: https://github.com/ritesh-1918 (Mandatory)
💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Welcome to the HELPDESK.AI developer family! 🚀💻

ritesh-1918 · 2026-05-31T20:01:51Z

Hi @zeroknowledge0x! 🙌

Thank you so much for your excellent contribution: "fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)"! We really appreciate the high-quality code and effort you have put into the platform.

Just a quick, friendly heads-up as we prepare our manual merging and verification queues—please make sure to complete all the mandatory community steps listed below.

Once those manual steps are verified, we'll get your PR officially merged into the gssoc branch (or keep it neatly cataloged if closed as integrated) and assign it the highest possible GSSoC S-Tier labels to maximize your leaderboard points!

Let's build something amazing together! 🚀🔥

🌟 Community Support & Network Steps (Take 10 Seconds!)

As we prepare our manual verification and merging queues, please make sure you have taken a moment to complete these required steps to finalize your points:

⭐ Star this repository: https://github.com/ritesh-1918/HELPDESK.AI (Mandatory)
🍴 Fork this repository: https://github.com/ritesh-1918/HELPDESK.AI/fork (Mandatory)
👤 Follow @ritesh-1918 on GitHub: https://github.com/ritesh-1918 (Mandatory - manual step)
💼 Connect on LinkedIn: https://www.linkedin.com/in/ritesh1908/ (Mandatory)

Note: Having these steps completed manually is required before your PR points are officially cleared.

github-advanced-security AI found potential problems May 30, 2026

View reviewed changes

Comment thread backend/sanitization.py Fixed

Comment thread backend/sanitization.py Fixed

Comment thread backend/sanitization.py Fixed

github-advanced-security AI found potential problems May 30, 2026

View reviewed changes

Comment thread backend/sanitization.py Fixed

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread backend/sanitization.py

Comment thread backend/sanitization.py

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread backend/sanitization.py

Comment thread backend/sanitization.py

fix: address CodeRabbit reviews on sanitization module

06ff9dc

- Expand sanitize_ticket_data to include subcategory, assigned_team, ocr_text - Align CSP with actual assets (Tailwind CDN, Google Fonts) - Fix unterminated <script> tag truncating remaining content

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

fix: expand sanitize_ticket_data to cover more fields

1bf773a

- Add status, image_url, metadata to default sanitized fields - Addresses CodeRabbit review about sanitizing all caller-controlled strings - Reduces stored-XSS risk through additional field coverage

ritesh-1918 changed the base branch from main to gssoc May 31, 2026 05:50

ritesh-1918 added gssoc GirlScript Summer of Code gssoc:approved GSSoC Approved PR level:advanced Advanced level difficulty quality:exceptional Exceptional code quality type:security Security fix or improvement labels May 31, 2026

ritesh-1918 merged commit c9eb1a8 into ritesh-1918:gssoc May 31, 2026
9 of 10 checks passed

ritesh-1918 added the level:intermediate Intermediate level difficulty label May 31, 2026

Conversation

zeroknowledge0x commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Security Details

Testing

Related Issues

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zeroknowledge0x commented May 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ritesh-1918 commented May 31, 2026

Uh oh!

Uh oh!

ritesh-1918 commented May 31, 2026

🌟 Community Support & Network Steps (Take 10 Seconds!)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zeroknowledge0x commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading