Skip to content

InMemoryMemoryService search_memory fails for CJK/non-ASCII queries #5601

@foylaou

Description

@foylaou

🔴 Required Information

Describe the Bug:
InMemoryMemoryService.search_memory() always returns 0 results when the
query consists entirely of CJK (Chinese/Japanese/Korean) characters.
The internal helper _extract_words_lower() uses a \w+ regex to tokenize
text into a word set for keyword matching. In Python, \w+ does not match
CJK characters as individual tokens — an entire CJK string is treated as a
single token that never appears in the stored event text, causing every
CJK-only query to produce an empty set and therefore zero matches.

Steps to Reproduce:

  1. Install pip install google-adk
  2. Create an InMemoryMemoryService and add a session whose events contain
    CJK text (e.g. "你好,我是 Foy")
  3. Call search_memory() with a CJK query (e.g. "你知道我叫什麼名字嗎")
  4. Observe that the result always contains 0 memories

Expected Behavior:
search_memory() should return matching memories when the query shares
CJK characters with stored event content, consistent with how ASCII keyword
matching works for English text.

Observed Behavior:
search_memory() always returns an empty SearchMemoryResponse for any
query that contains only CJK characters, even when the stored session events
contain overlapping CJK characters.

Root cause — in in_memory_memory_service.py:

# _extract_words_lower uses re.findall(r'\w+', text.lower())
# For CJK input this returns an empty set:
_extract_words_lower('你知道我叫什麼名字嗎')  # → set()
_extract_words_lower('你好,我是 Foy')        # → {'foy'}  (CJK stripped)

# The match condition therefore never fires:
if any(query_word in words_in_event for query_word in words_in_query):
    # never reached for CJK-only queries

Environment Details:

  • ADK Library Version: 1.32.0
  • Desktop OS: macOS
  • Python Version: 3.13.9

Model Information:

  • Are you using LiteLLM: Yes (for reproduction; issue is in the memory layer, model-independent)
  • Which model is being used: gemini-2.5-flash / Ollama gemma4

🟡 Optional Information

Regression:
N/A — not tested on earlier versions.

Logs:

[PreloadMemory] 🔍 搜尋 query: 你知道我叫什麼名字嗎
[PreloadMemory] 結果數: 0

After patching search_memory to handle CJK characters:

[PreloadMemory] 🔍 搜尋 query: 你知道我叫什麼名字嗎
[PreloadMemory] 結果數: 2
[PreloadMemory]   → 你好,我是 Foy

Minimal Reproduction Code:

import asyncio
from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
from google.adk.sessions.in_memory_session_service import InMemorySessionService
from google.genai import types

async def main():
    session_service = InMemorySessionService()
    memory_service = InMemoryMemoryService()

    session = await session_service.create_session(
        app_name="test", user_id="user"
    )
    # Simulate a user event with CJK content
    session.events.append(
        type("Event", (), {
            "content": types.Content(parts=[types.Part(text="你好,我是 Foy")]),
            "author": "user",
            "timestamp": 0.0,
            "id": "evt1",
        })()
    )
    await memory_service.add_session_to_memory(session)

    result = await memory_service.search_memory(
        app_name="test", user_id="user", query="你知道我叫什麼名字嗎"
    )
    print(f"memories found: {len(result.memories)}")  # prints 0, expected >= 1

asyncio.run(main())

Suggested Fix:
Extend _extract_words_lower (or the matching logic) to also tokenize CJK
text at the individual character level:

import re

def _extract_words_lower(text: str) -> set[str]:
    ascii_words = set(re.findall(r'[a-zA-Z0-9]+', text.lower()))
    cjk_chars   = set(re.findall(r'[\u4e00-\u9fff\u3400-\u4dbf]', text))
    return ascii_words | cjk_chars

How often has this issue occurred?:

  • Always (100%)

Metadata

Metadata

Assignees

Labels

services[Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions