You are working in the searchlite repo. Implement P10: expand analyzers safely with a small set of curated language packs and ICU/Unicode options. Do not add geo features.
High-level goals
- Add ICU/Unicode normalization options and a limited set of language analyzers (e.g., English, Spanish, French) with stopwords/stemmers.
- Allow custom stopword lists per analyzer; keep analyzer config deterministic and small.
- Preserve backward compatibility: existing analyzers remain defaults; legacy tokenizer alias still works.
- Keep performance and predictability: bound analyzer options, validate language codes, and avoid heavy dependencies; ensure pipelines stay allocation-light.
Assumed prerequisites
- P1 analyzers framework, per-field analyzer/search_analyzer, edge_ngram, synonyms, stemmer (English), stopwords (en), unicode tokenizer.
Scope (P10)
- ICU/Unicode options
- Add a normalizer/char filter option for Unicode normalization (NFKC/NFD) and optional case folding.
- Expose a tokenizer mode that is Unicode-aware and punctuation-aware (may reuse existing unicode tokenizer with options).
- Language packs (curated)
- Built-in stopword lists for a small set (e.g., en, es, fr, de). Keep lists local; no network.
- Stemmers/lemmatizers where available (Snowball stemmers that fit Rust 1.88).
- Allow analyzer config to reference these packs by name; still allow explicit stopword arrays.
- Analyzer config enhancements
- Token filter options: { "stopwords": "es" } or { "stopwords": ["a","de",...] }.
- Stemmer filter: { "stemmer": "spanish" } etc.; validate supported list.
- Optional char_filters: basic HTML strip and Unicode normalize (NFKC/NFD) with casefold.
User-facing API changes
- index-schema.json: extend analyzer/filter definitions to allow language codes and char_filters.
- search-request.schema.json: no change (analyzers are schema-time), but ensure query-time uses search_analyzer.
- Backward compatibility: if no analyzers defined, auto-inject
default; if tokenizer used, treat as analyzer alias; existing configs remain valid.
Implementation details
- Add new filter/normalizer modules under searchlite-core/src/analysis.
- Ship curated stopword lists in-repo; ensure license compatibility.
- Stemmer: use rust-stemmers or similar; gate languages to the curated set.
- Char filters: implement HTML strip (basic tag removal) and Unicode normalize + casefold.
- Validation/perf: enforce supported language list; reject unknown/empty configs; keep char filters bounded (no exponential behavior); reuse buffers across filters/tokenizers.
Testing
- Unit: tokenization/normalization for ICU options; stopwords/stemmers for en/es/fr/de; char filter behavior.
- Integration: schema with Spanish analyzer; queries respect search_analyzer; highlight uses analyzed tokens.
- Backward-compat test: legacy schema with
tokenizer: "default" yields identical output.
- Negative: reject unsupported language codes, invalid char_filter configs, and excessive/custom stopword sizes if needed.
Docs
- README: document available analyzers/tokenizers/filters, language codes, and examples per language.
- index-schema.json docstrings updated for new options; list supported language codes.
Performance/constraints
- Keep pipelines allocation-light; reuse buffers.
- Ensure new deps are Rust 1.88 compatible and small.
Acceptance criteria
- New language analyzers and ICU options work and are documented.
- Existing behavior unchanged when new options are unused.
- Tests and docs updated.
You are working in the searchlite repo. Implement P10: expand analyzers safely with a small set of curated language packs and ICU/Unicode options. Do not add geo features.
High-level goals
Assumed prerequisites
Scope (P10)
User-facing API changes
default; if tokenizer used, treat as analyzer alias; existing configs remain valid.Implementation details
Testing
tokenizer: "default"yields identical output.Docs
Performance/constraints
Acceptance criteria