Skip to content

#5: Fuzzy/cross-field search + exact total hits #91

@davidkelley

Description

@davidkelley

Goal

Match ES behavior for typo-tolerant multi_match and cross-field relevance, and allow callers to request exact total hit counts.

Features

  • MultiMatch fuzziness (AUTO, numeric edit distance) for string queries.
  • Combined/cross-fields scoring akin to ES combined_fields (BM25F style).
  • Exact total hits toggle track_total_hits=true.

Implementation Plan

1) MultiMatch Fuzziness

  • Extend QueryNode::MultiMatch to accept fuzziness (enum: Auto, Edits(u8)).
  • Planner: when fuzziness set, expand terms using bounded Levenshtein per field’s analyzer tokens; cap expansions (max_expansions) and min_length similar to FuzzyOptions.
  • Scoring: treat fuzzy expansions as additional terms with reduced boost (1 / (1+edit_distance)) or configurable.
  • For single-term queries, allow Auto mapping: length 1–2 → 0, 3–5 → 1, 6+ → 2.
  • Tests: typo cases (“pikchu” -> “pikachu”), ensure non-fuzzy default unaffected.

2) Combined Fields (Cross-Fields)

  • Add match_type: CrossFields in MultiMatch already exists; implement BM25F-like logic:
    • Normalize term freq across listed fields; compute combined score using average field length.
    • Planner builds a term group spanning fields; scorer uses aggregated tf/len.
  • Default operator: AND vs OR controlled by operator; support minimum_should_match.
  • Tests: query spans name/set fields; verify relevance parity with ES behavior.

3) Exact Total Hits

  • SearchRequest gains track_total_hits: Option<bool> (default false to preserve speed).
  • Reader:
    • If flag true, compute exact doc count matching query (post-filter) without early termination; may reuse aggregation pipeline to count or run full collect with DocCollector.
    • If false, keep existing estimate.
  • Response: reuse total_hits_estimate; when exact, set field and maybe add boolean total_hits_exact=true to signal accuracy (optional but recommended).

Code Touchpoints

  • searchlite-core/src/api/types.rs: new fields/enums; serde defaults.
  • searchlite-core/src/query/planner.rs & scorer: implement cross-field tf aggregation and fuzzy expansions.
  • searchlite-core/src/api/reader.rs: track_total_hits handling; ensure cursors unaffected.
  • HTTP layer: accept new JSON fields; validation errors on invalid fuzziness string.

Performance/Bounds

  • Set per-request caps: fuzzy expansions max 50 terms per original token; fail fast if exceeded with 400.
  • track_total_hits may be expensive; add warning log when enabled without limit.

Tests

  • Unit: planner builds expected term groups; fuzzy expansion obeys limits.
  • Integration: search request with track_total_hits=true returns exact count on small fixture.
  • Regression: ensure profile/explain still work with new scoring paths.

Migration Notes

  • Existing clients unaffected unless they opt-in to fuzziness/track_total_hits.
  • Document default fuzzy behavior per Auto rules; provide examples mapping from ES queries used in managemco (multi_match with AUTO on single-term searches).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions