Skip to content

fix: make MCP JSON output UTF-8 safe#666

Open
badMade wants to merge 5 commits into
mainfrom
badmade/propose-fix-for-mcp-unicode-vulnerability
Open

fix: make MCP JSON output UTF-8 safe#666
badMade wants to merge 5 commits into
mainfrom
badmade/propose-fix-for-mcp-unicode-vulnerability

Conversation

@badMade
Copy link
Copy Markdown
Owner

@badMade badMade commented May 31, 2026

Motivation

  • Prevent MCP-controlled data serialized with ensure_ascii=False from embedding raw lone-surrogate code points (e.g. U+D800) that can cause UnicodeEncodeError and crash agent runs.
  • Restore availability when interacting with potentially hostile or buggy MCP servers by ensuring all MCP-to-agent JSON is UTF-8 encodable.

Description

  • Added UTF-8-safe serialization helpers: _utf8_safe_text, _sanitize_json_value, and _mcp_json_dumps in tools/mcp_tool.py to escape lone surrogates while preserving valid non-ASCII text.
  • Routed MCP sampling tool-call arguments and all MCP tool/resource/prompt JSON responses through _mcp_json_dumps (replacing direct json.dumps(..., ensure_ascii=False) usage) and used _utf8_safe_text for non-dict tool inputs.
  • Added regression tests tests/tools/test_mcp_surrogate_json.py that verify CJK characters are preserved, lone surrogates are escaped (not present raw), MCP tool handler output is UTF-8 encodable, and sampling arguments are safe.

Testing

  • Compiled the modified modules with python -m py_compile tools/mcp_tool.py tests/tools/test_mcp_surrogate_json.py and the compile step succeeded.
  • Ran the focused test suite pytest -q -o addopts="-m 'not integration'" tests/tools/test_mcp_surrogate_json.py and all tests passed (3 passed).
  • Ran static checks with ruff against the touched files and they passed.
  • Attempted broader stability/adjacent tests; unrelated environment issues (pytest xdist config, test live-system guard and missing yaml dependency) caused failures outside the scope of the UTF-8 fix and did not indicate regressions in the changes made.

Codex Task

Copilot AI review requested due to automatic review settings May 31, 2026 22:31
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces UTF-8-safe JSON serialization for MCP-controlled data by adding helper functions (_utf8_safe_text, _sanitize_json_value, and _mcp_json_dumps) to escape lone surrogates while preserving valid Unicode, along with corresponding regression tests. Feedback suggests optimizing performance by implementing fast-path checks: first, in _utf8_safe_text to avoid redundant encoding/decoding for already valid UTF-8 strings, and second, in _mcp_json_dumps to attempt standard serialization first and only fall back to recursive sanitization upon encountering a UnicodeEncodeError.

Comment thread tools/mcp_tool.py
Comment on lines +95 to +97
def _utf8_safe_text(text: str) -> str:
"""Return text that can always be encoded as UTF-8."""
return text.encode("utf-8", errors="backslashreplace").decode("utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid unnecessary string allocations and encoding/decoding overhead for valid UTF-8 strings (which represent the vast majority of cases), we can perform a fast-path check using a try...except UnicodeEncodeError block. If the string is already valid UTF-8, we can return it directly.

Suggested change
def _utf8_safe_text(text: str) -> str:
"""Return text that can always be encoded as UTF-8."""
return text.encode("utf-8", errors="backslashreplace").decode("utf-8")
def _utf8_safe_text(text: str) -> str:
"""Return text that can always be encoded as UTF-8."""
try:
text.encode("utf-8")
return text
except UnicodeEncodeError:
return text.encode("utf-8", errors="backslashreplace").decode("utf-8")

Comment thread tools/mcp_tool.py
Comment on lines +114 to +117
def _mcp_json_dumps(value: Any, **kwargs: Any) -> str:
"""Serialize MCP-controlled data without emitting invalid UTF-8 text."""
kwargs.setdefault("ensure_ascii", False)
return json.dumps(_sanitize_json_value(value), **kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Recursively traversing and copying the entire JSON object structure in _sanitize_json_value on every serialization call introduces significant CPU and memory overhead, especially for large payloads (e.g., file contents or resource lists). Since lone surrogates are extremely rare, we can optimize this by attempting a fast-path json.dumps first, and only falling back to sanitization if encoding the resulting payload fails with a UnicodeEncodeError.

Suggested change
def _mcp_json_dumps(value: Any, **kwargs: Any) -> str:
"""Serialize MCP-controlled data without emitting invalid UTF-8 text."""
kwargs.setdefault("ensure_ascii", False)
return json.dumps(_sanitize_json_value(value), **kwargs)
def _mcp_json_dumps(value: Any, **kwargs: Any) -> str:
"""Serialize MCP-controlled data without emitting invalid UTF-8 text."""
kwargs.setdefault("ensure_ascii", False)
payload = json.dumps(value, **kwargs)
try:
payload.encode("utf-8")
return payload
except UnicodeEncodeError:
return json.dumps(_sanitize_json_value(value), **kwargs)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

🔎 Lint report: badmade/propose-fix-for-mcp-unicode-vulnerability vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8256 on HEAD, 8258 on base (✅ -2)

🆕 New issues (53):

Rule Count
invalid-argument-type 40
unresolved-attribute 9
unsupported-operator 4
First entries
run_agent.py:13036: [invalid-argument-type] invalid-argument-type: Argument to function `save_context_length` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:8914: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:8915: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:8834: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
gateway/platforms/qqbot/adapter.py:1139: [unresolved-attribute] unresolved-attribute: Object of type `Self@_handle_c2c_message` has no attribute `gateway_runner`
run_agent.py:2633: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:3397: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:13065: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:8915: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str | None`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:13568: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:11477: [invalid-argument-type] invalid-argument-type: Argument to function `_fixed_temperature_for_model` is incorrect: Expected `str | None`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:7757: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:8831: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `int`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:14022: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:5845: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["/"]` and `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:9108: [invalid-argument-type] invalid-argument-type: Argument to function `get_transport` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:13799: [invalid-argument-type] invalid-argument-type: Argument to function `_pool_may_recover_from_rate_limit` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:9529: [invalid-argument-type] invalid-argument-type: Argument to function `_get_anthropic_max_output` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:13063: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:11548: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown, Unknown] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:13018: [invalid-argument-type] invalid-argument-type: Argument to function `normalize_usage` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`
run_agent.py:4864: [invalid-argument-type] invalid-argument-type: Argument to function `save_trajectory` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:5417: [invalid-argument-type] invalid-argument-type: Argument to function `parse_rate_limit_headers` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
... and 28 more

✅ Fixed issues (46):

Rule Count
invalid-argument-type 35
unresolved-attribute 6
unsupported-operator 4
unresolved-reference 1
First entries
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
gateway/run.py:5544: [unresolved-reference] unresolved-reference: Name `team_id` used when not defined
tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:12322: [invalid-argument-type] invalid-argument-type: Argument to function `apply_anthropic_cache_control` is incorrect: Expected `bool`, found `int | str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | dict[Unknown, Unknown]`
run_agent.py:9529: [invalid-argument-type] invalid-argument-type: Argument to function `_get_anthropic_max_output` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:5845: [unresolved-attribute] unresolved-attribute: Attribute `split` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]]`, `int`, `dict[Unknown, Unknown]` in union `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13102: [invalid-argument-type] invalid-argument-type: Argument to bound method `SessionDB.update_token_counts` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:8914: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:3397: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2377: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:11548: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown, Unknown] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:9108: [invalid-argument-type] invalid-argument-type: Argument to function `get_transport` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13799: [invalid-argument-type] invalid-argument-type: Argument to function `_pool_may_recover_from_rate_limit` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13568: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:9546: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_profile` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:2690: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 3 union elements`
run_agent.py:9721: [invalid-argument-type] invalid-argument-type: Argument to function `github_model_reasoning_efforts` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:7140: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:9694: [invalid-argument-type] invalid-argument-type: Argument to function `lmstudio_model_reasoning_options` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:9972: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown, Unknown] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:13065: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:14022: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2630: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
... and 21 more

Unchanged: 4313 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes MCP JSON serialization safer when MCP-controlled data contains lone surrogate code points that would otherwise fail UTF-8 encoding.

Changes:

  • Adds MCP-specific UTF-8-safe JSON serialization helpers.
  • Routes MCP tool/resource/prompt JSON responses through the new serializer.
  • Adds regression tests for surrogate escaping and CJK preservation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tools/mcp_tool.py Adds UTF-8-safe serialization and applies it to MCP JSON output paths.
tests/tools/test_mcp_surrogate_json.py Adds focused regression tests for surrogate-safe MCP serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/mcp_tool.py
"function": {
"name": tu.name,
"arguments": json.dumps(tu.input, ensure_ascii=False) if isinstance(tu.input, dict) else str(tu.input),
"arguments": _mcp_json_dumps(tu.input) if isinstance(tu.input, dict) else _utf8_safe_text(str(tu.input)),
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

Auto-merge: review received — action required

Review activity detected (3 inline comment(s) by non-author(s); 2 comment(s)/review(s) from review bot(s): gemini-code-assist[bot], copilot-pull-request-reviewer[bot]).

To confirm you have reviewed and accepted the feedback, please add the reviewed label to this PR. Auto-merge will then proceed.

View workflow run

@badMade
Copy link
Copy Markdown
Owner Author

badMade commented May 31, 2026

@claude code review

@badMade
Copy link
Copy Markdown
Owner Author

badMade commented May 31, 2026

@jules code review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Auto-merge: checks failing

The following checks did not pass:

  • test (failure)

Please fix the failing checks before this PR can be merged.

View workflow run

@badMade
Copy link
Copy Markdown
Owner Author

badMade commented Jun 1, 2026

@copilot, fix all failing checks.

Copy link
Copy Markdown

Copilot AI commented Jun 1, 2026

@copilot, fix all failing checks.

Fixed the failing checks in commits 4181e38 and 2feaded.

These changes:

  • define team_id safely in gateway auth (fixing the NameError failures),
  • block unresolved ${ENV_VAR} webhook secrets in signature validation,
  • skip QQ C2C attachment processing for unauthorized users while still forwarding text-only events.

I also reran the affected test set locally and re-ran validation.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Auto-merge: no CI detected

No CI check runs were found for commit 2feaded.
Auto-merge will not proceed until CI is configured and running.

View workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants