Performance: SentenceTransformer reload + Milvus reconnect compound to ~3s overhead per search in main servers

## Summary

Both `server/app.py` and `server-https/app.py` instantiate a new `SentenceTransformer` and open a new Milvus connection inside the `milvus_search` function on every search call. These two costs compound to roughly 2-3 seconds of overhead per query on CPU before the actual vector search even begins.

Issues #63 and #28 each track one half of this individually. This issue consolidates them as a compound performance problem since fixing only one still leaves significant per-request overhead.

## Location

**In `server/app.py` and `server-https/app.py`** -- `milvus_search()` function:

```python
def milvus_search(query, top_k=5):
    # Cost 1: New Milvus connection per request (~200-500ms)
    connections.connect(alias="default", host=MILVUS_HOST, port=MILVUS_PORT)
    collection = Collection(MILVUS_COLLECTION)
    collection.load()  # idempotent after first call
    
    # Cost 2: Model reload from disk per request (~2-3s on CPU)
    encoder = SentenceTransformer(EMBEDDING_MODEL)
```

## The Right Pattern Already Exists

`kagent-feast-mcp/mcp-server/server.py` already implements the correct pattern:

```python
model: SentenceTransformer = None
client: MilvusClient = None

def _init():
    global model, client
    if model is None:
        model = SentenceTransformer(EMBEDDING_MODEL)
    if client is None:
        client = MilvusClient(uri=MILVUS_URI, ...)
```

The main servers should adopt this same lazy-init singleton pattern. Combined, this would reduce per-query overhead from ~3s to near zero for all requests after the first.

Note: As Sinan pointed out in the Slack discussion, `collection.load()` is idempotent server-side in Milvus -- once loaded it stays loaded across client disconnects. So the real per-request costs are the model reload and the connection setup/teardown, not all three.

PR freeze is on, so flagging this for when PRs reopen. Happy to pick this up.

Related: #63 (model reload), #28 (connection pooling), #181 (content truncation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: SentenceTransformer reload + Milvus reconnect compound to ~3s overhead per search in main servers #183

Summary

Location

The Right Pattern Already Exists

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance: SentenceTransformer reload + Milvus reconnect compound to ~3s overhead per search in main servers #183

Description

Summary

Location

The Right Pattern Already Exists

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions