fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)#769
Conversation
- Add backend/sanitization.py with sanitize_text, sanitize_ticket_data, get_security_headers - Apply sanitize_text to all ticket analysis endpoints - Apply sanitize_ticket_data to save_ticket endpoint - Add security headers middleware (X-Content-Type-Options, X-Frame-Options, etc.) Fixes ritesh-1918#739
|
@zeroknowledge0x is attempting to deploy a commit to the ritesh Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Review limit reached
More reviews will be available in 11 minutes and 58 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughAdds a sanitization module (text stripping, script/html removers, ticket-field sanitizer, security headers) and wires it into HTTP middleware plus ticket save and AI-analysis endpoints to sanitize inputs and attach CSP and related headers. ChangesSecurity Hardening for Stored XSS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Fix script tag regex to properly match end tags with whitespace (</script\t\n>) - Use non-backtracking patterns to prevent polynomial ReDoS - Simplify HTML tag regex to avoid uncontrolled data backtracking Addresses CodeQL scanning alerts ritesh-1918#25, ritesh-1918#26, ritesh-1918#27
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/main.py (1)
751-775:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
image_textis still echoed back unsanitized.Only
request_body.textis cleaned here.request_body.image_textis copied intogemini_analysis["ocr_text"]and returned to the client untouched, so attacker-controlled OCR text can still flow straight to the dashboard after this PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/main.py` around lines 751 - 775, The OCR text from request_body.image_text is being echoed into gemini_analysis["ocr_text"] without sanitization; update the assignment to pass request_body.image_text through the same sanitize_text function (or equivalent sanitizer used for request_body.text) and default to an empty string if falsy so that gemini_analysis["ocr_text"] = sanitize_text(request_body.image_text) or "" (ensure you call the existing sanitize_text helper used earlier in this file and keep the same behavior for None/empty inputs).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/sanitization.py`:
- Around line 77-101: sanitize_ticket_data currently only sanitizes a small
top-level allowlist, letting many caller-controlled strings from
TicketSaveRequest (e.g., subcategory, assigned_team, status, ocr_text, image_url
and string values nested in metadata, entities, solution_steps) reach storage;
update sanitize_ticket_data to recursively sanitize all string values in the
provided dict (and any nested dicts/lists) by walking the structure and applying
sanitize_text to every str, or accept an explicit exhaustive fields list from
save_ticket and sanitize those keys before persisting; ensure you reuse
sanitize_text and keep the function name sanitize_ticket_data so callers (e.g.,
save_ticket) still work.
- Around line 111-119: The CSP in backend/sanitization.py currently blocks
external assets used by the root page; update the "Content-Security-Policy"
header to allow https://cdn.tailwindcss.com in script-src and style-src, allow
https://fonts.googleapis.com in style-src, and allow https://fonts.gstatic.com
(and https:) in font-src (retain data: where needed), so the root page in
backend/main.py can load Tailwind and Google Fonts without being blocked; modify
the header value string accordingly and keep existing self, unsafe-inline/eval
tokens as appropriate.
---
Outside diff comments:
In `@backend/main.py`:
- Around line 751-775: The OCR text from request_body.image_text is being echoed
into gemini_analysis["ocr_text"] without sanitization; update the assignment to
pass request_body.image_text through the same sanitize_text function (or
equivalent sanitizer used for request_body.text) and default to an empty string
if falsy so that gemini_analysis["ocr_text"] =
sanitize_text(request_body.image_text) or "" (ensure you call the existing
sanitize_text helper used earlier in this file and keep the same behavior for
None/empty inputs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5fc7d5a4-077c-41f9-b521-d00a4b0d3a14
📒 Files selected for processing (2)
backend/main.pybackend/sanitization.py
Replace regex-based HTML/script tag removal with safe alternatives: - Script tags: case-insensitive string operations (str.find) - HTML tags: Python's html.parser.HTMLParser (no regex backtracking) This addresses CodeQL findings: - Bad HTML filtering regexp (script tag pattern) - Polynomial regular expression on uncontrolled data All remaining regex patterns use negated character classes (no backtracking).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/sanitization.py`:
- Around line 49-59: The current loop breaks when no matching "</script" is
found, which causes everything after the opening "<script" to be dropped; change
the logic so that if close_start == -1 you instead try to find the end of the
opening tag (search for ">" after tag_end in text) and remove only the opening
tag if that ">" exists (set close_end accordingly and continue), and if there is
no ">" treat the remaining text as literal (append the rest unchanged and exit
the loop) so you never silently truncate the rest of the input; update the
handling around close_pattern, close_start, close_end, lower, tag_end and text
to implement this flow.
- Around line 74-88: The HTMLParser used in _TagStripper (in _remove_html_tags)
is decoding escaped entities which can reintroduce executable HTML; update
_TagStripper to instantiate HTMLParser with convert_charrefs=False and implement
handlers to preserve entities (override handle_entityref and handle_charref to
append the original entity text to self.parts) while keeping handle_data as-is,
then feed(text) and return get_text() so existing escaped entities remain
escaped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
- Expand sanitize_ticket_data to include subcategory, assigned_team, ocr_text - Align CSP with actual assets (Tailwind CDN, Google Fonts) - Fix unterminated <script> tag truncating remaining content
|
Hi! 👋 I've addressed the CodeRabbit and CodeQL review comments: Changes made:
All CI checks should pass. Ready for re-review! 🙏 |
CodeRabbit flagged that HTMLParser with default convert_charrefs=True decodes <script> back to <script>, potentially reactivating XSS payloads that were already escaped. Set convert_charrefs=False to preserve escaped entities. Addresses: ritesh-1918#739
There was a problem hiding this comment.
♻️ Duplicate comments (1)
backend/sanitization.py (1)
76-87:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve entity references when
convert_charrefsis disabled.On Line 80,
HTMLParser(convert_charrefs=False)sends&/&#...;throughhandle_entityref()andhandle_charref(). Because this stripper only collectshandle_data(), escaped user text gets dropped or mangled instead of preserved.Suggested fix
class _TagStripper(HTMLParser): def __init__(self): # convert_charrefs=False prevents decoding <script> back to # <script>, which would reactivate XSS payloads. super().__init__(convert_charrefs=False) self.parts = [] def handle_data(self, data): self.parts.append(data) + + def handle_entityref(self, name): + self.parts.append(f"&{name};") + + def handle_charref(self, name): + self.parts.append(f"&#{name};")#!/bin/bash python - <<'PY' from html.parser import HTMLParser class TagStripper(HTMLParser): def __init__(self): super().__init__(convert_charrefs=False) self.parts = [] def handle_data(self, data): self.parts.append(data) for sample in ["Tom & Jerry", "<b>bold</b>", "&`#169`; 2026"]: p = TagStripper() p.feed(sample) print(repr(sample), "->", repr("".join(p.parts))) PY🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/sanitization.py` around lines 76 - 87, The _TagStripper class uses HTMLParser(convert_charrefs=False) but only implements handle_data so entity references like &, <, or numeric char refs are not preserved; add implementations for handle_entityref(self, name) and handle_charref(self, name) in class _TagStripper to append the original entity text (e.g. f"&{name};" for entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into self.parts, ensuring get_text()/handle_data behavior remains unchanged and escaped user text is preserved when convert_charrefs is False.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@backend/sanitization.py`:
- Around line 76-87: The _TagStripper class uses
HTMLParser(convert_charrefs=False) but only implements handle_data so entity
references like &, <, or numeric char refs are not preserved; add
implementations for handle_entityref(self, name) and handle_charref(self, name)
in class _TagStripper to append the original entity text (e.g. f"&{name};" for
entity refs and the appropriate "&`#NNN`;" or "&`#xHH`;" form for char refs) into
self.parts, ensuring get_text()/handle_data behavior remains unchanged and
escaped user text is preserved when convert_charrefs is False.
- Add status, image_url, metadata to default sanitized fields - Addresses CodeRabbit review about sanitizing all caller-controlled strings - Reduces stored-XSS risk through additional field coverage
|
Hi @zeroknowledge0x! Thanks for the contribution. I have triaged your PR and set it to merge into the
Welcome to the HELPDESK.AI developer family! 🚀💻 |
|
Hi @zeroknowledge0x! 🙌 Thank you so much for your excellent contribution: "fix: add input sanitization and security headers to prevent stored XSS (Fixes #739)"! We really appreciate the high-quality code and effort you have put into the platform. Just a quick, friendly heads-up as we prepare our manual merging and verification queues—please make sure to complete all the mandatory community steps listed below. Once those manual steps are verified, we'll get your PR officially merged into the Let's build something amazing together! 🚀🔥 🌟 Community Support & Network Steps (Take 10 Seconds!)As we prepare our manual verification and merging queues, please make sure you have taken a moment to complete these required steps to finalize your points:
Note: Having these steps completed manually is required before your PR points are officially cleared. |
Summary
Adds server-side input sanitization and security headers to prevent stored XSS attacks on user-generated support ticket content.
Changes
backend/sanitization.pywithsanitize_text,sanitize_ticket_data,get_security_headersfunctionssanitize_textto all ticket analysis endpoints (/ai/analyze_ticket,/ai/analyze,/ai/analyze-v2)sanitize_ticket_datato thesave_ticketendpointSecurity Details
Testing
Related Issues
Fixes #739
Summary by CodeRabbit