P10: Analyzer Scope (Curated Languages & ICU)

You are working in the searchlite repo. Implement P10: expand analyzers safely with a small set of curated language packs and ICU/Unicode options. Do not add geo features.

High-level goals
- Add ICU/Unicode normalization options and a limited set of language analyzers (e.g., English, Spanish, French) with stopwords/stemmers.
- Allow custom stopword lists per analyzer; keep analyzer config deterministic and small.
- Preserve backward compatibility: existing analyzers remain defaults; legacy tokenizer alias still works.
- Keep performance and predictability: bound analyzer options, validate language codes, and avoid heavy dependencies; ensure pipelines stay allocation-light.

Assumed prerequisites
- P1 analyzers framework, per-field analyzer/search_analyzer, edge_ngram, synonyms, stemmer (English), stopwords (en), unicode tokenizer.

Scope (P10)
1) ICU/Unicode options
   - Add a normalizer/char filter option for Unicode normalization (NFKC/NFD) and optional case folding.
   - Expose a tokenizer mode that is Unicode-aware and punctuation-aware (may reuse existing unicode tokenizer with options).
2) Language packs (curated)
   - Built-in stopword lists for a small set (e.g., en, es, fr, de). Keep lists local; no network.
   - Stemmers/lemmatizers where available (Snowball stemmers that fit Rust 1.88).
   - Allow analyzer config to reference these packs by name; still allow explicit stopword arrays.
3) Analyzer config enhancements
   - Token filter options: { "stopwords": "es" } or { "stopwords": ["a","de",...] }.
   - Stemmer filter: { "stemmer": "spanish" } etc.; validate supported list.
   - Optional char_filters: basic HTML strip and Unicode normalize (NFKC/NFD) with casefold.

User-facing API changes
- index-schema.json: extend analyzer/filter definitions to allow language codes and char_filters.
- search-request.schema.json: no change (analyzers are schema-time), but ensure query-time uses search_analyzer.
- Backward compatibility: if no analyzers defined, auto-inject `default`; if tokenizer used, treat as analyzer alias; existing configs remain valid.

Implementation details
- Add new filter/normalizer modules under searchlite-core/src/analysis.
- Ship curated stopword lists in-repo; ensure license compatibility.
   - Stemmer: use rust-stemmers or similar; gate languages to the curated set.
   - Char filters: implement HTML strip (basic tag removal) and Unicode normalize + casefold.
- Validation/perf: enforce supported language list; reject unknown/empty configs; keep char filters bounded (no exponential behavior); reuse buffers across filters/tokenizers.

Testing
- Unit: tokenization/normalization for ICU options; stopwords/stemmers for en/es/fr/de; char filter behavior.
- Integration: schema with Spanish analyzer; queries respect search_analyzer; highlight uses analyzed tokens.
- Backward-compat test: legacy schema with `tokenizer: "default"` yields identical output.
- Negative: reject unsupported language codes, invalid char_filter configs, and excessive/custom stopword sizes if needed.

Docs
- README: document available analyzers/tokenizers/filters, language codes, and examples per language.
- index-schema.json docstrings updated for new options; list supported language codes.

Performance/constraints
- Keep pipelines allocation-light; reuse buffers.
- Ensure new deps are Rust 1.88 compatible and small.

Acceptance criteria
- New language analyzers and ICU options work and are documented.
- Existing behavior unchanged when new options are unused.
- Tests and docs updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P10: Analyzer Scope (Curated Languages & ICU) #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

P10: Analyzer Scope (Curated Languages & ICU) #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions