Full Text Search

Full-Text Search

Zeppelin supports BM25 full-text search alongside vector similarity search. Full-text search is configured per-field at namespace creation time and uses inverted indexes built during compaction.

Enabling FTS

Add full_text_search configuration when creating a namespace:

curl -X POST http://localhost:8080/v1/namespaces \
  -H "Content-Type: application/json" \
  -d '{
    "name": "articles",
    "dimensions": 384,
    "distance_metric": "cosine",
    "full_text_search": {
      "title": {
        "language": "english",
        "stemming": true,
        "k1": 1.5,
        "b": 0.75
      },
      "content": {
        "language": "english",
        "stemming": true,
        "remove_stopwords": true
      }
    }
  }'

Then upsert vectors with text in the configured attribute fields:

curl -X POST http://localhost:8080/v1/namespaces/articles/vectors \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": [
      {
        "id": "article-1",
        "values": [0.1, 0.2, ...],
        "attributes": {
          "title": "Introduction to Vector Search",
          "content": "Vector search engines use embeddings to find similar documents..."
        }
      }
    ]
  }'

Configuration

Each field in full_text_search is an FtsFieldConfig object:

Field	Type	Default	Description
`language`	string	`"english"`	Language for tokenization and stemming
`stemming`	boolean	`true`	Apply Snowball stemming (e.g., "running" → "run")
`remove_stopwords`	boolean	`true`	Remove common function words
`case_sensitive`	boolean	`false`	Preserve case during tokenization
`k1`	float	`1.2`	BM25 term-frequency saturation (higher = more weight on TF)
`b`	float	`0.75`	BM25 length normalization (0.0 = no norm, 1.0 = full norm)
`max_token_length`	integer	`40`	Tokens exceeding this length are discarded

An empty {} object uses all defaults.

Querying

Use rank_by instead of vector in the query endpoint:

curl -X POST http://localhost:8080/v1/namespaces/articles/query \
  -H "Content-Type: application/json" \
  -d '{
    "rank_by": ["content", "BM25", "vector search engine"],
    "top_k": 10
  }'

rank_by Expressions

The rank_by field is a JSON array expression that composes BM25 scores.

Single-field BM25

Score documents by BM25 relevance on one field:

["content", "BM25", "search query"]

Multi-field with Sum

Add scores from multiple fields:

["Sum", [
  ["title", "BM25", "vector search"],
  ["content", "BM25", "vector search"]
]]

Multi-field with Max

Take the highest score across fields:

["Max", [
  ["title", "BM25", "vector search"],
  ["content", "BM25", "vector search"]
]]

Weighted scoring with Product

Multiply a field's score by a weight:

["Product", 2.0, ["title", "BM25", "vector search"]]

Nested expressions

Combine for weighted multi-field scoring:

["Sum", [
  ["Product", 2.0, ["title", "BM25", "vector search"]],
  ["content", "BM25", "vector search"]
]]

This scores documents as 2.0 * BM25(title, "vector search") + BM25(content, "vector search").

Prefix Search

Set last_as_prefix: true to treat the last query token as a prefix match (useful for autocomplete):

{
  "rank_by": ["title", "BM25", "vec"],
  "last_as_prefix": true,
  "top_k": 5
}

This matches documents containing tokens that start with "vec" (e.g., "vector", "vectorize").

Tokenization Pipeline

Text is processed through a 5-step pipeline:

Unicode word segmentation — Split on word boundaries (using unicode-segmentation)
Lowercase — Convert to lowercase (unless case_sensitive: true)
Length filter — Discard tokens longer than max_token_length
Stopword removal — Remove common function words (if remove_stopwords: true)
Stemming — Apply Snowball English stemmer (if stemming: true)

Stopword List (36 words)

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with

This is a conservative list based on Lucene/Elasticsearch defaults — only function words, no content words.

BM25 Scoring

Zeppelin uses the standard Okapi BM25 formula:

IDF (Inverse Document Frequency):

IDF(t) = ln((N - df(t) + 0.5) / (df(t) + 0.5) + 1)

Per-term score:

score(t, D) = IDF(t) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D| / avgdl))

Document score:

BM25(D, Q) = Σ score(t, D) for all query terms t

Where:

N = total documents in the corpus
df(t) = number of documents containing term t
tf = term frequency in document D
|D| = document length in tokens
avgdl = average document length

Scoring direction: BM25 scores are higher-is-better (relevance), unlike vector distances which are lower-is-better.

Combining with Filters

BM25 queries support the same filter syntax as vector queries:

{
  "rank_by": ["content", "BM25", "machine learning"],
  "top_k": 10,
  "filter": {
    "op": "and",
    "filters": [
      {"op": "eq", "field": "language", "value": "english"},
      {"op": "range", "field": "year", "gte": 2024}
    ]
  }
}

Getting Started

API & SDKs

Configuration

Architecture

Operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full Text Search

Full-Text Search

Enabling FTS

Configuration

Querying

rank_by Expressions

Single-field BM25

Multi-field with Sum

Multi-field with Max

Weighted scoring with Product

Nested expressions

Prefix Search

Tokenization Pipeline

Stopword List (36 words)

BM25 Scoring

Combining with Filters

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally