Skip to content

♻️ Migrate to structured query model with nested filter operators #128

@StephanMeijer

Description

@StephanMeijer

Note

This query model replaces current functionality. visited is dropped (reproducible via id + reach filters). services deferred to separate MR (will become queryable index field).

Summary

Migrate from the current flat query parameter model to a structured expression-based query DSL with boolean combinators (and, or, not) and field operators.

Current State

Current Query Schema (SearchQueryParametersSchema)

Location: /src/backend/core/schemas.py (lines 109-120)

class SearchQueryParametersSchema(BaseModel):
    q: str                                           # Full-text search query
    services: StringListParameter = []               # Which services to search
    visited: StringListParameter = []                # Document IDs user has visited
    reach: Optional[ReachEnum] = None                # PUBLIC/AUTHENTICATED/RESTRICTED
    tags: StringListParameter = []                   # Tag filtering (OR logic only)
    path: Optional[str] = None                       # Path prefix filter
    order_by: Optional[Literal[...]] = "relevance"   # Sort field
    order_direction: Optional[Literal["asc","desc"]] = "desc"
    nb_results: Optional[int] = 50                   # Result limit (1-100)

Current Limitations

Field Current Behavior Limitation
tags {"terms": {"tags": [...]}} OR logic only - matches ANY tag
tags No negation Cannot exclude tags
tags No AND logic Cannot require ALL tags present
services Inclusion only Cannot exclude specific services
visited Hardcoded in filter Not flexible, couples access control to search
Logic Flat structure Cannot express complex boolean combinations

Target State

Refined Query Shape

{
  "query": "budget",
  "where": {
    "and": [
      { "field": "reach", "op": "eq", "value": "restricted" },
      {
        "or": [
          { "field": "tags", "op": "all", "value": ["finance", "approved"] },
          { "field": "path", "op": "prefix", "value": "/teams/legal" }
        ]
      }
    ]
  },
  "sort": [{ "field": "relevance", "direction": "desc" }],
  "limit": 50
}

Query DSL Structure

Top-Level Schema

interface SearchQuery {
  query: string;                  // Full-text search query (renamed from q)
  where?: WhereClause;            // Filter expression (optional)
  sort?: SortClause[];            // Sort criteria (array for multi-sort)
  limit?: number;                 // Result limit (1-100, default 50)
}

Where Clause (Recursive Expression)

type WhereClause = 
  | { and: WhereClause[] }      // All conditions must match
  | { or: WhereClause[] }       // Any condition must match
  | { not: WhereClause }        // Negate condition
  | FieldCondition;             // Leaf condition

interface FieldCondition {
  field: string;                // Field name
  op: Operator;                 // Operator
  value: unknown;               // Operand value
}

Operators

Operator Description Value Type OpenSearch Mapping
eq Exact equality string | number | boolean {"term": {field: value}}
in Match ANY value (OR) string[] {"terms": {field: values}}
all Match ALL values (AND) string[] Multiple {"term": {field: value}} in must
prefix Prefix match string {"prefix": {field: value}}
gt Greater than number | date {"range": {field: {"gt": value}}}
gte Greater than or equal number | date {"range": {field: {"gte": value}}}
lt Less than number | date {"range": {field: {"lt": value}}}
lte Less than or equal number | date {"range": {field: {"lte": value}}}
exists Field exists boolean {"exists": {field: field}} or must_not

Migration from Old Parameters

Old Pattern New Pattern
q=budget query: "budget"
visited=doc1,doc2 where: { field: "id", op: "in", value: ["doc1", "doc2"] }
services=drive,wiki where: { field: "services", op: "in", value: ["drive", "wiki"] } (after index MR)
reach=public where: { field: "reach", op: "eq", value: "public" }
tags=finance,legal (OR) where: { field: "tags", op: "in", value: ["finance", "legal"] }
tags with AND logic where: { field: "tags", op: "all", value: ["finance", "legal"] }
path=/teams/ where: { field: "path", op: "prefix", value: "/teams/" }

Example Queries

Simple equality filter

{
  "query": "report",
  "where": { "field": "reach", "op": "eq", "value": "public" }
}

Tags with AND logic (require all)

{
  "query": "*",
  "where": { "field": "tags", "op": "all", "value": ["finance", "Q4", "approved"] }
}

Exclude drafts

{
  "query": "policy",
  "where": {
    "not": { "field": "tags", "op": "in", "value": ["draft", "wip"] }
  }
}

Access control (replaces visited)

{
  "query": "*",
  "where": {
    "and": [
      { "field": "id", "op": "in", "value": ["doc-uuid-1", "doc-uuid-2"] },
      { "field": "reach", "op": "in", "value": ["public", "authenticated"] }
    ]
  }
}

Complex boolean combination

{
  "query": "budget",
  "where": {
    "and": [
      { "field": "reach", "op": "eq", "value": "restricted" },
      {
        "or": [
          { "field": "tags", "op": "in", "value": ["finance"] },
          { "field": "path", "op": "prefix", "value": "/teams/legal" }
        ]
      },
      { "not": { "field": "tags", "op": "in", "value": ["archived"] } }
    ]
  }
}

Date range filter

{
  "query": "*",
  "where": {
    "and": [
      { "field": "created_at", "op": "gte", "value": "2024-01-01" },
      { "field": "created_at", "op": "lt", "value": "2025-01-01" }
    ]
  }
}

Pydantic Schema Design

from typing import Optional, List, Literal, Union
from pydantic import BaseModel, Field
from enum import Enum


class Operator(str, Enum):
    EQ = "eq"
    IN = "in"
    ALL = "all"
    PREFIX = "prefix"
    GT = "gt"
    GTE = "gte"
    LT = "lt"
    LTE = "lte"
    EXISTS = "exists"


class FieldCondition(BaseModel):
    field: str
    op: Operator
    value: Union[str, int, float, bool, List[str], List[int]]


class AndClause(BaseModel):
    and_: List["WhereClause"] = Field(alias="and")


class OrClause(BaseModel):
    or_: List["WhereClause"] = Field(alias="or")


class NotClause(BaseModel):
    not_: "WhereClause" = Field(alias="not")


WhereClause = Union[AndClause, OrClause, NotClause, FieldCondition]

# Enable forward references
AndClause.model_rebuild()
OrClause.model_rebuild()
NotClause.model_rebuild()


class SortClause(BaseModel):
    field: Literal["relevance", "title", "created_at", "updated_at", "size"] = "relevance"
    direction: Literal["asc", "desc"] = "desc"


class SearchQuerySchema(BaseModel):
    """Schema for structured query DSL - replaces SearchQueryParametersSchema"""
    
    query: str
    where: Optional[WhereClause] = None
    sort: Optional[List[SortClause]] = None
    limit: Optional[int] = Field(default=50, ge=1, le=100)

OpenSearch Query Builder

def build_opensearch_filter(where: WhereClause) -> dict:
    """Recursively build OpenSearch bool query from WhereClause."""
    
    if isinstance(where, AndClause):
        return {
            "bool": {
                "must": [build_opensearch_filter(c) for c in where.and_]
            }
        }
    
    if isinstance(where, OrClause):
        return {
            "bool": {
                "should": [build_opensearch_filter(c) for c in where.or_],
                "minimum_should_match": 1
            }
        }
    
    if isinstance(where, NotClause):
        return {
            "bool": {
                "must_not": [build_opensearch_filter(where.not_)]
            }
        }
    
    # FieldCondition
    return build_field_condition(where)


def build_field_condition(cond: FieldCondition) -> dict:
    """Build OpenSearch clause from field condition."""
    
    # Map external field names to OpenSearch field names
    field = "_id" if cond.field == "id" else cond.field
    
    match cond.op:
        case Operator.EQ:
            return {"term": {field: cond.value}}
        
        case Operator.IN:
            return {"terms": {field: cond.value}}
        
        case Operator.ALL:
            # All values must match - multiple term queries in bool.must
            return {
                "bool": {
                    "must": [{"term": {field: v}} for v in cond.value]
                }
            }
        
        case Operator.PREFIX:
            return {"prefix": {field: cond.value}}
        
        case Operator.GT | Operator.GTE | Operator.LT | Operator.LTE:
            return {"range": {field: {cond.op.value: cond.value}}}
        
        case Operator.EXISTS:
            clause = {"exists": {"field": field}}
            return clause if cond.value else {"bool": {"must_not": [clause]}}

Files to Modify

File Changes
src/backend/core/schemas.py New recursive WhereClause schema, SearchQuerySchema
src/backend/core/services/search.py New build_opensearch_filter() function, update get_query()
src/backend/core/views.py Update SearchDocumentView to use new schema
src/backend/core/enums.py Add Operator enum

Acceptance Criteria

  • Recursive WhereClause schema validates nested boolean expressions
  • and / or / not combinators work at any nesting depth
  • eq operator performs exact term match
  • in operator matches ANY value (OR semantics)
  • all operator requires ALL values (AND semantics)
  • prefix operator performs prefix match
  • gt/gte/lt/lte operators work for dates and numbers
  • exists operator checks field presence
  • sort accepts array for multi-field sorting
  • id field maps to OpenSearch _id internally
  • Unit tests cover nested boolean combinations
  • Integration tests verify OpenSearch query generation

Labels

enhancement, api, breaking-change

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions