Skip to content

P10: Analyzer Scope (Curated Languages & ICU) #32

@davidkelley

Description

@davidkelley

You are working in the searchlite repo. Implement P10: expand analyzers safely with a small set of curated language packs and ICU/Unicode options. Do not add geo features.

High-level goals

  • Add ICU/Unicode normalization options and a limited set of language analyzers (e.g., English, Spanish, French) with stopwords/stemmers.
  • Allow custom stopword lists per analyzer; keep analyzer config deterministic and small.
  • Preserve backward compatibility: existing analyzers remain defaults; legacy tokenizer alias still works.
  • Keep performance and predictability: bound analyzer options, validate language codes, and avoid heavy dependencies; ensure pipelines stay allocation-light.

Assumed prerequisites

  • P1 analyzers framework, per-field analyzer/search_analyzer, edge_ngram, synonyms, stemmer (English), stopwords (en), unicode tokenizer.

Scope (P10)

  1. ICU/Unicode options
    • Add a normalizer/char filter option for Unicode normalization (NFKC/NFD) and optional case folding.
    • Expose a tokenizer mode that is Unicode-aware and punctuation-aware (may reuse existing unicode tokenizer with options).
  2. Language packs (curated)
    • Built-in stopword lists for a small set (e.g., en, es, fr, de). Keep lists local; no network.
    • Stemmers/lemmatizers where available (Snowball stemmers that fit Rust 1.88).
    • Allow analyzer config to reference these packs by name; still allow explicit stopword arrays.
  3. Analyzer config enhancements
    • Token filter options: { "stopwords": "es" } or { "stopwords": ["a","de",...] }.
    • Stemmer filter: { "stemmer": "spanish" } etc.; validate supported list.
    • Optional char_filters: basic HTML strip and Unicode normalize (NFKC/NFD) with casefold.

User-facing API changes

  • index-schema.json: extend analyzer/filter definitions to allow language codes and char_filters.
  • search-request.schema.json: no change (analyzers are schema-time), but ensure query-time uses search_analyzer.
  • Backward compatibility: if no analyzers defined, auto-inject default; if tokenizer used, treat as analyzer alias; existing configs remain valid.

Implementation details

  • Add new filter/normalizer modules under searchlite-core/src/analysis.
  • Ship curated stopword lists in-repo; ensure license compatibility.
    • Stemmer: use rust-stemmers or similar; gate languages to the curated set.
    • Char filters: implement HTML strip (basic tag removal) and Unicode normalize + casefold.
  • Validation/perf: enforce supported language list; reject unknown/empty configs; keep char filters bounded (no exponential behavior); reuse buffers across filters/tokenizers.

Testing

  • Unit: tokenization/normalization for ICU options; stopwords/stemmers for en/es/fr/de; char filter behavior.
  • Integration: schema with Spanish analyzer; queries respect search_analyzer; highlight uses analyzed tokens.
  • Backward-compat test: legacy schema with tokenizer: "default" yields identical output.
  • Negative: reject unsupported language codes, invalid char_filter configs, and excessive/custom stopword sizes if needed.

Docs

  • README: document available analyzers/tokenizers/filters, language codes, and examples per language.
  • index-schema.json docstrings updated for new options; list supported language codes.

Performance/constraints

  • Keep pipelines allocation-light; reuse buffers.
  • Ensure new deps are Rust 1.88 compatible and small.

Acceptance criteria

  • New language analyzers and ICU options work and are documented.
  • Existing behavior unchanged when new options are unused.
  • Tests and docs updated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions