fix: make MCP JSON output UTF-8 safe#666
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces UTF-8-safe JSON serialization for MCP-controlled data by adding helper functions (_utf8_safe_text, _sanitize_json_value, and _mcp_json_dumps) to escape lone surrogates while preserving valid Unicode, along with corresponding regression tests. Feedback suggests optimizing performance by implementing fast-path checks: first, in _utf8_safe_text to avoid redundant encoding/decoding for already valid UTF-8 strings, and second, in _mcp_json_dumps to attempt standard serialization first and only fall back to recursive sanitization upon encountering a UnicodeEncodeError.
| def _utf8_safe_text(text: str) -> str: | ||
| """Return text that can always be encoded as UTF-8.""" | ||
| return text.encode("utf-8", errors="backslashreplace").decode("utf-8") |
There was a problem hiding this comment.
To avoid unnecessary string allocations and encoding/decoding overhead for valid UTF-8 strings (which represent the vast majority of cases), we can perform a fast-path check using a try...except UnicodeEncodeError block. If the string is already valid UTF-8, we can return it directly.
| def _utf8_safe_text(text: str) -> str: | |
| """Return text that can always be encoded as UTF-8.""" | |
| return text.encode("utf-8", errors="backslashreplace").decode("utf-8") | |
| def _utf8_safe_text(text: str) -> str: | |
| """Return text that can always be encoded as UTF-8.""" | |
| try: | |
| text.encode("utf-8") | |
| return text | |
| except UnicodeEncodeError: | |
| return text.encode("utf-8", errors="backslashreplace").decode("utf-8") |
| def _mcp_json_dumps(value: Any, **kwargs: Any) -> str: | ||
| """Serialize MCP-controlled data without emitting invalid UTF-8 text.""" | ||
| kwargs.setdefault("ensure_ascii", False) | ||
| return json.dumps(_sanitize_json_value(value), **kwargs) |
There was a problem hiding this comment.
Recursively traversing and copying the entire JSON object structure in _sanitize_json_value on every serialization call introduces significant CPU and memory overhead, especially for large payloads (e.g., file contents or resource lists). Since lone surrogates are extremely rare, we can optimize this by attempting a fast-path json.dumps first, and only falling back to sanitization if encoding the resulting payload fails with a UnicodeEncodeError.
| def _mcp_json_dumps(value: Any, **kwargs: Any) -> str: | |
| """Serialize MCP-controlled data without emitting invalid UTF-8 text.""" | |
| kwargs.setdefault("ensure_ascii", False) | |
| return json.dumps(_sanitize_json_value(value), **kwargs) | |
| def _mcp_json_dumps(value: Any, **kwargs: Any) -> str: | |
| """Serialize MCP-controlled data without emitting invalid UTF-8 text.""" | |
| kwargs.setdefault("ensure_ascii", False) | |
| payload = json.dumps(value, **kwargs) | |
| try: | |
| payload.encode("utf-8") | |
| return payload | |
| except UnicodeEncodeError: | |
| return json.dumps(_sanitize_json_value(value), **kwargs) |
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
40 |
unresolved-attribute |
9 |
unsupported-operator |
4 |
First entries
run_agent.py:13036: [invalid-argument-type] invalid-argument-type: Argument to function `save_context_length` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:8914: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:8915: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:8834: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
gateway/platforms/qqbot/adapter.py:1139: [unresolved-attribute] unresolved-attribute: Object of type `Self@_handle_c2c_message` has no attribute `gateway_runner`
run_agent.py:2633: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:3397: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:13065: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:8915: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str | None`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:13568: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:11477: [invalid-argument-type] invalid-argument-type: Argument to function `_fixed_temperature_for_model` is incorrect: Expected `str | None`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:7757: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:8831: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `int`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:14022: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:5845: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["/"]` and `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:9108: [invalid-argument-type] invalid-argument-type: Argument to function `get_transport` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:13799: [invalid-argument-type] invalid-argument-type: Argument to function `_pool_may_recover_from_rate_limit` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
run_agent.py:9529: [invalid-argument-type] invalid-argument-type: Argument to function `_get_anthropic_max_output` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:13063: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:11548: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown, Unknown] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:13018: [invalid-argument-type] invalid-argument-type: Argument to function `normalize_usage` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`
run_agent.py:4864: [invalid-argument-type] invalid-argument-type: Argument to function `save_trajectory` is incorrect: Expected `str`, found `str | Unknown | Divergent | ... omitted 3 union elements`
run_agent.py:5417: [invalid-argument-type] invalid-argument-type: Argument to function `parse_rate_limit_headers` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | ... omitted 3 union elements`
... and 28 more
✅ Fixed issues (46):
| Rule | Count |
|---|---|
invalid-argument-type |
35 |
unresolved-attribute |
6 |
unsupported-operator |
4 |
unresolved-reference |
1 |
First entries
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
gateway/run.py:5544: [unresolved-reference] unresolved-reference: Name `team_id` used when not defined
tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:12322: [invalid-argument-type] invalid-argument-type: Argument to function `apply_anthropic_cache_control` is incorrect: Expected `bool`, found `int | str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | dict[Unknown, Unknown]`
run_agent.py:9529: [invalid-argument-type] invalid-argument-type: Argument to function `_get_anthropic_max_output` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:5845: [unresolved-attribute] unresolved-attribute: Attribute `split` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]]`, `int`, `dict[Unknown, Unknown]` in union `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13102: [invalid-argument-type] invalid-argument-type: Argument to bound method `SessionDB.update_token_counts` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:8914: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:3397: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2377: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:11548: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown, Unknown] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:9108: [invalid-argument-type] invalid-argument-type: Argument to function `get_transport` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13799: [invalid-argument-type] invalid-argument-type: Argument to function `_pool_may_recover_from_rate_limit` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:13568: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:9546: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_profile` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:2690: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 3 union elements`
run_agent.py:9721: [invalid-argument-type] invalid-argument-type: Argument to function `github_model_reasoning_efforts` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:7140: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:9694: [invalid-argument-type] invalid-argument-type: Argument to function `lmstudio_model_reasoning_options` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:9972: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown, Unknown] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:13065: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:14022: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2630: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
... and 21 more
Unchanged: 4313 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
There was a problem hiding this comment.
Pull request overview
This PR makes MCP JSON serialization safer when MCP-controlled data contains lone surrogate code points that would otherwise fail UTF-8 encoding.
Changes:
- Adds MCP-specific UTF-8-safe JSON serialization helpers.
- Routes MCP tool/resource/prompt JSON responses through the new serializer.
- Adds regression tests for surrogate escaping and CJK preservation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tools/mcp_tool.py |
Adds UTF-8-safe serialization and applies it to MCP JSON output paths. |
tests/tools/test_mcp_surrogate_json.py |
Adds focused regression tests for surrogate-safe MCP serialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "function": { | ||
| "name": tu.name, | ||
| "arguments": json.dumps(tu.input, ensure_ascii=False) if isinstance(tu.input, dict) else str(tu.input), | ||
| "arguments": _mcp_json_dumps(tu.input) if isinstance(tu.input, dict) else _utf8_safe_text(str(tu.input)), |
Auto-merge: review received — action requiredReview activity detected (3 inline comment(s) by non-author(s); 2 comment(s)/review(s) from review bot(s): gemini-code-assist[bot], copilot-pull-request-reviewer[bot]). To confirm you have reviewed and accepted the feedback, please add the |
|
@claude code review |
|
@jules code review |
Auto-merge: checks failingThe following checks did not pass:
Please fix the failing checks before this PR can be merged. |
|
@copilot, fix all failing checks. |
Fixed the failing checks in commits 4181e38 and 2feaded. These changes:
I also reran the affected test set locally and re-ran validation. |
Auto-merge: no CI detectedNo CI check runs were found for commit |
Motivation
ensure_ascii=Falsefrom embedding raw lone-surrogate code points (e.g. U+D800) that can causeUnicodeEncodeErrorand crash agent runs.Description
_utf8_safe_text,_sanitize_json_value, and_mcp_json_dumpsintools/mcp_tool.pyto escape lone surrogates while preserving valid non-ASCII text.argumentsand all MCP tool/resource/prompt JSON responses through_mcp_json_dumps(replacing directjson.dumps(..., ensure_ascii=False)usage) and used_utf8_safe_textfor non-dict tool inputs.tests/tools/test_mcp_surrogate_json.pythat verify CJK characters are preserved, lone surrogates are escaped (not present raw), MCP tool handler output is UTF-8 encodable, and sampling arguments are safe.Testing
python -m py_compile tools/mcp_tool.py tests/tools/test_mcp_surrogate_json.pyand the compile step succeeded.pytest -q -o addopts="-m 'not integration'" tests/tools/test_mcp_surrogate_json.pyand all tests passed (3 passed).ruffagainst the touched files and they passed.yamldependency) caused failures outside the scope of the UTF-8 fix and did not indicate regressions in the changes made.Codex Task