Skip to content

Performance: SentenceTransformer reload + Milvus reconnect compound to ~3s overhead per search in main servers #183

@JayDS22

Description

@JayDS22

Summary

Both server/app.py and server-https/app.py instantiate a new SentenceTransformer and open a new Milvus connection inside the milvus_search function on every search call. These two costs compound to roughly 2-3 seconds of overhead per query on CPU before the actual vector search even begins.

Issues #63 and #28 each track one half of this individually. This issue consolidates them as a compound performance problem since fixing only one still leaves significant per-request overhead.

Location

In server/app.py and server-https/app.py -- milvus_search() function:

def milvus_search(query, top_k=5):
    # Cost 1: New Milvus connection per request (~200-500ms)
    connections.connect(alias="default", host=MILVUS_HOST, port=MILVUS_PORT)
    collection = Collection(MILVUS_COLLECTION)
    collection.load()  # idempotent after first call
    
    # Cost 2: Model reload from disk per request (~2-3s on CPU)
    encoder = SentenceTransformer(EMBEDDING_MODEL)

The Right Pattern Already Exists

kagent-feast-mcp/mcp-server/server.py already implements the correct pattern:

model: SentenceTransformer = None
client: MilvusClient = None

def _init():
    global model, client
    if model is None:
        model = SentenceTransformer(EMBEDDING_MODEL)
    if client is None:
        client = MilvusClient(uri=MILVUS_URI, ...)

The main servers should adopt this same lazy-init singleton pattern. Combined, this would reduce per-query overhead from ~3s to near zero for all requests after the first.

Note: As Sinan pointed out in the Slack discussion, collection.load() is idempotent server-side in Milvus -- once loaded it stays loaded across client disconnects. So the real per-request costs are the model reload and the connection setup/teardown, not all three.

PR freeze is on, so flagging this for when PRs reopen. Happy to pick this up.

Related: #63 (model reload), #28 (connection pooling), #181 (content truncation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions