Skip to content

[FEATURE] Rule-Based Query Recommendation Engine #532

@ansjcy

Description

@ansjcy

Is your feature request related to a problem?

OpenSearch Query Insights currently identifies slow and resource-intensive queries but provides no actionable guidance on how to fix them. Users face several critical issues:

  1. Silent Failures: Queries like {"term": {"status": "Active"}} on text fields return 0 results without warnings
  2. Performance Issues: Leading wildcards (*search) cause 100-1000x slowdowns with no indication why
  3. Safety Risks: Sorting on text fields can trigger OOM crashes that bring down entire clusters
  4. Knowledge Barrier: Fixing these issues requires deep OpenSearch expertise that most users lack

For example:

User sees: Query latency = 5,200ms in Top N Queries dashboard
User thinks: "Why is this slow? What should I do?"
User outcome: Spends hours searching documentation, may not find solution

What solution would you like?

A rule-based recommendation engine integrated into Query Insights that:

  1. Automatically analyzes queries from Top N Queries and potentially profiler requests
  2. Detects anti-patterns using predefined rules (Rules TBD)
  3. Generates actionable recommendations with:
    • Clear problem description
    • Impact assessment (latency, memory, correctness)
    • Specific fix with code examples
    • Confidence scores
  4. Surfaces recommendations through:
    • Top N Queries dashboard (inline badges and panels)
    • Query profiler page (on-demand analysis)
    • REST API endpoints (see below for supports)
sequenceDiagram
    participant User
    participant Dashboard
    participant API as REST API
    participant Service as RecommendationService
    participant Context as QueryContext
    participant Rules as Rule Engine
    participant Cache as Metadata Cache

    User->>Dashboard: View Top N Queries
    Dashboard->>API: GET /_insights/top_queries?recommendations=true
    API->>Service: analyzeTopQueries(records)

    loop For each query record
        Service->>Context: build(record)
        Context->>Cache: getFieldType(field)
        Cache-->>Context: "text"
        Context->>Cache: getFieldCardinality(field)
        Cache-->>Context: 10000000

        Service->>Rules: evaluate(context)
        Rules->>Rules: match all active rules
        Rules-->>Service: List<Recommendation>

        Service->>Service: attach recommendations to record
        Service->>Service: store.put(queryHash, recommendations)
    end

    Service-->>API: List<QueryRecord with recommendations>
    API-->>Dashboard: top queries with recommendations embedded
    Dashboard-->>User: Show recommendations inline

    User->>Dashboard: Click specific query in Top N list
    Dashboard->>API: GET /_insights/recommendations/{queryId}
    API->>Service: getRecommendations(queryId)
    Service->>Service: store.get(queryHash)
    Service-->>API: List<Recommendation>
    API-->>Dashboard: recommendations for specific query
    Dashboard-->>User: Show detailed recommendations

    User->>Dashboard: Click "Analyze Query" (Profiler)
    Dashboard->>API: POST /_insights/recommendations/analyze
    API->>Service: analyzeQuery(query, indices)
    Service->>Context: build(query)
    Service->>Rules: evaluate(context)
    Rules-->>Service: List<Recommendation>
    Service-->>API: recommendations
    API-->>Dashboard: recommendations with code examples
    Dashboard-->>User: Display recommendations + copy button
Loading

Key factors:

  1. Asynchronous Processing: Recommendation generation happens off the search path (zero query latency impact)
  2. Cached Metadata if possible: Field types and cardinality cached for O(1) lookups (no cluster state queries during rule evaluation)
  3. Rule-Based (Phase 1): Deterministic, explainable recommendations.
  4. Fail-Safe: Errors in recommendation engine never propagate to query execution

What alternatives have you considered?

We can build recommendation as separate service outside OpenSearch cluster, let QI provide as much metadata as possible.

Pros:

  • Language flexibility (could use Python for ML)
  • Independent scaling
  • Isolation from cluster

Cons:

  • Data export requirements: To export top queries to external sinks, we must mask/remove sensitive information (usernames, IP addresses, PII in query values), losing critical query details needed for analysis
  • Loss of query context: External service cannot access cluster metadata like field types, field cardinality, index settings, and workload group configurations that are essential for rule evaluation
  • Rule-based recommendations become nearly impossible: almost any useful query specific rules require analyzing query context / metadata / the exact query pattern (*search), which may be masked during export.
  • Network latency: Additional hop for recommendation generation
  • Security concerns: Exporting query data outside the cluster increases attack surface
  • Overhead on emiting metrics: It is impossible to emit all required metrics for recommendation on external service, it will also add extra overhead to the cluster (so this is not like "NO Impact at all" with this approach).

Decision: Keep recommendation engine in-plugin. Rule-based recommendations fundamentally depend on having access to:

  1. Exact query structure (e.g., detecting * at start of wildcard pattern)
  2. Cluster metadata (field types, cardinality, index settings)
  3. Real-time context (workload groups, current cluster state)

Do you have any additional context?

Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestv3.6.0Issues targeting release v3.6.0

Type

No type
No fields configured for issues without a type.

Projects

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions