Skip to content

Full Text Search

Anup Ghatage edited this page Feb 12, 2026 · 1 revision

Full-Text Search

Zeppelin supports BM25 full-text search alongside vector similarity search. Full-text search is configured per-field at namespace creation time and uses inverted indexes built during compaction.

Enabling FTS

Add full_text_search configuration when creating a namespace:

curl -X POST http://localhost:8080/v1/namespaces \
  -H "Content-Type: application/json" \
  -d '{
    "name": "articles",
    "dimensions": 384,
    "distance_metric": "cosine",
    "full_text_search": {
      "title": {
        "language": "english",
        "stemming": true,
        "k1": 1.5,
        "b": 0.75
      },
      "content": {
        "language": "english",
        "stemming": true,
        "remove_stopwords": true
      }
    }
  }'

Then upsert vectors with text in the configured attribute fields:

curl -X POST http://localhost:8080/v1/namespaces/articles/vectors \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": [
      {
        "id": "article-1",
        "values": [0.1, 0.2, ...],
        "attributes": {
          "title": "Introduction to Vector Search",
          "content": "Vector search engines use embeddings to find similar documents..."
        }
      }
    ]
  }'

Configuration

Each field in full_text_search is an FtsFieldConfig object:

Field Type Default Description
language string "english" Language for tokenization and stemming
stemming boolean true Apply Snowball stemming (e.g., "running" → "run")
remove_stopwords boolean true Remove common function words
case_sensitive boolean false Preserve case during tokenization
k1 float 1.2 BM25 term-frequency saturation (higher = more weight on TF)
b float 0.75 BM25 length normalization (0.0 = no norm, 1.0 = full norm)
max_token_length integer 40 Tokens exceeding this length are discarded

An empty {} object uses all defaults.

Querying

Use rank_by instead of vector in the query endpoint:

curl -X POST http://localhost:8080/v1/namespaces/articles/query \
  -H "Content-Type: application/json" \
  -d '{
    "rank_by": ["content", "BM25", "vector search engine"],
    "top_k": 10
  }'

rank_by Expressions

The rank_by field is a JSON array expression that composes BM25 scores.

Single-field BM25

Score documents by BM25 relevance on one field:

["content", "BM25", "search query"]

Multi-field with Sum

Add scores from multiple fields:

["Sum", [
  ["title", "BM25", "vector search"],
  ["content", "BM25", "vector search"]
]]

Multi-field with Max

Take the highest score across fields:

["Max", [
  ["title", "BM25", "vector search"],
  ["content", "BM25", "vector search"]
]]

Weighted scoring with Product

Multiply a field's score by a weight:

["Product", 2.0, ["title", "BM25", "vector search"]]

Nested expressions

Combine for weighted multi-field scoring:

["Sum", [
  ["Product", 2.0, ["title", "BM25", "vector search"]],
  ["content", "BM25", "vector search"]
]]

This scores documents as 2.0 * BM25(title, "vector search") + BM25(content, "vector search").

Prefix Search

Set last_as_prefix: true to treat the last query token as a prefix match (useful for autocomplete):

{
  "rank_by": ["title", "BM25", "vec"],
  "last_as_prefix": true,
  "top_k": 5
}

This matches documents containing tokens that start with "vec" (e.g., "vector", "vectorize").

Tokenization Pipeline

Text is processed through a 5-step pipeline:

  1. Unicode word segmentation — Split on word boundaries (using unicode-segmentation)
  2. Lowercase — Convert to lowercase (unless case_sensitive: true)
  3. Length filter — Discard tokens longer than max_token_length
  4. Stopword removal — Remove common function words (if remove_stopwords: true)
  5. Stemming — Apply Snowball English stemmer (if stemming: true)

Stopword List (36 words)

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with

This is a conservative list based on Lucene/Elasticsearch defaults — only function words, no content words.

BM25 Scoring

Zeppelin uses the standard Okapi BM25 formula:

IDF (Inverse Document Frequency):

IDF(t) = ln((N - df(t) + 0.5) / (df(t) + 0.5) + 1)

Per-term score:

score(t, D) = IDF(t) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |D| / avgdl))

Document score:

BM25(D, Q) = Σ score(t, D) for all query terms t

Where:

  • N = total documents in the corpus
  • df(t) = number of documents containing term t
  • tf = term frequency in document D
  • |D| = document length in tokens
  • avgdl = average document length

Scoring direction: BM25 scores are higher-is-better (relevance), unlike vector distances which are lower-is-better.

Combining with Filters

BM25 queries support the same filter syntax as vector queries:

{
  "rank_by": ["content", "BM25", "machine learning"],
  "top_k": 10,
  "filter": {
    "op": "and",
    "filters": [
      {"op": "eq", "field": "language", "value": "english"},
      {"op": "range", "field": "year", "gte": 2024}
    ]
  }
}

Clone this wiki locally