-
Notifications
You must be signed in to change notification settings - Fork 6
Full Text Search
Zeppelin supports BM25 full-text search alongside vector similarity search. Full-text search is configured per-field at namespace creation time and uses inverted indexes built during compaction.
Add full_text_search configuration when creating a namespace:
curl -X POST http://localhost:8080/v1/namespaces \
-H "Content-Type: application/json" \
-d '{
"name": "articles",
"dimensions": 384,
"distance_metric": "cosine",
"full_text_search": {
"title": {
"language": "english",
"stemming": true,
"k1": 1.5,
"b": 0.75
},
"content": {
"language": "english",
"stemming": true,
"remove_stopwords": true
}
}
}'Then upsert vectors with text in the configured attribute fields:
curl -X POST http://localhost:8080/v1/namespaces/articles/vectors \
-H "Content-Type: application/json" \
-d '{
"vectors": [
{
"id": "article-1",
"values": [0.1, 0.2, ...],
"attributes": {
"title": "Introduction to Vector Search",
"content": "Vector search engines use embeddings to find similar documents..."
}
}
]
}'Each field in full_text_search is an FtsFieldConfig object:
| Field | Type | Default | Description |
|---|---|---|---|
language |
string | "english" |
Language for tokenization and stemming |
stemming |
boolean | true |
Apply Snowball stemming (e.g., "running" → "run") |
remove_stopwords |
boolean | true |
Remove common function words |
case_sensitive |
boolean | false |
Preserve case during tokenization |
k1 |
float | 1.2 |
BM25 term-frequency saturation (higher = more weight on TF) |
b |
float | 0.75 |
BM25 length normalization (0.0 = no norm, 1.0 = full norm) |
max_token_length |
integer | 40 |
Tokens exceeding this length are discarded |
An empty {} object uses all defaults.
Use rank_by instead of vector in the query endpoint:
curl -X POST http://localhost:8080/v1/namespaces/articles/query \
-H "Content-Type: application/json" \
-d '{
"rank_by": ["content", "BM25", "vector search engine"],
"top_k": 10
}'The rank_by field is a JSON array expression that composes BM25 scores.
Score documents by BM25 relevance on one field:
["content", "BM25", "search query"]Add scores from multiple fields:
["Sum", [
["title", "BM25", "vector search"],
["content", "BM25", "vector search"]
]]Take the highest score across fields:
["Max", [
["title", "BM25", "vector search"],
["content", "BM25", "vector search"]
]]Multiply a field's score by a weight:
["Product", 2.0, ["title", "BM25", "vector search"]]Combine for weighted multi-field scoring:
["Sum", [
["Product", 2.0, ["title", "BM25", "vector search"]],
["content", "BM25", "vector search"]
]]This scores documents as 2.0 * BM25(title, "vector search") + BM25(content, "vector search").
Set last_as_prefix: true to treat the last query token as a prefix match (useful for autocomplete):
{
"rank_by": ["title", "BM25", "vec"],
"last_as_prefix": true,
"top_k": 5
}This matches documents containing tokens that start with "vec" (e.g., "vector", "vectorize").
Text is processed through a 5-step pipeline:
-
Unicode word segmentation — Split on word boundaries (using
unicode-segmentation) -
Lowercase — Convert to lowercase (unless
case_sensitive: true) -
Length filter — Discard tokens longer than
max_token_length -
Stopword removal — Remove common function words (if
remove_stopwords: true) -
Stemming — Apply Snowball English stemmer (if
stemming: true)
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with
This is a conservative list based on Lucene/Elasticsearch defaults — only function words, no content words.
Zeppelin uses the standard Okapi BM25 formula:
IDF (Inverse Document Frequency):
IDF(t) = ln((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
Per-term score:
score(t, D) = IDF(t) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D| / avgdl))
Document score:
BM25(D, Q) = Σ score(t, D) for all query terms t
Where:
-
N= total documents in the corpus -
df(t)= number of documents containing termt -
tf= term frequency in documentD -
|D|= document length in tokens -
avgdl= average document length
Scoring direction: BM25 scores are higher-is-better (relevance), unlike vector distances which are lower-is-better.
BM25 queries support the same filter syntax as vector queries:
{
"rank_by": ["content", "BM25", "machine learning"],
"top_k": 10,
"filter": {
"op": "and",
"filters": [
{"op": "eq", "field": "language", "value": "english"},
{"op": "range", "field": "year", "gte": 2024}
]
}
}Getting Started
API & SDKs
Configuration
Architecture
Operations