Skip to content

[FEATURE] Query Event Monitoring and Notifications #529

@ansjcy

Description

@ansjcy

Is your feature request related to a problem?

OpenSearch users must manually monitor Query Insights dashboards to detect problematic query patterns. This reactive approach leads to:

  1. Discovery Delay: New expensive query patterns may run for hours before being noticed
  2. Manual Toil: Administrators must periodically check dashboards to catch issues
  3. No Proactive Alerts: No automatic notification when queries degrade or new patterns emerge
  4. Difficult Pattern Tracking: For clusters with dynamically-generated queries, it's nearly impossible to track what queries are impacting the cluster

Users need proactive notifications when query patterns change, rather than constantly monitoring dashboards.

What solution would you like?

Overview

Provide pre-configured event monitors for common query performance scenarios with simple setup and multi-channel notifications. The solution leverages:

  1. Existing Data: Query Insights already exports top N queries to local indices (top_queries-YYYY.MM.dd-*)
  2. Alerting Plugin Integration: Use OpenSearch Alerting plugin for notification delivery
  3. Pre-Built Templates: Ship monitor templates that users can enable with minimal configuration

User Experience

Enable monitors via REST API:

POST /_insights/monitoring/install
{
  "monitors": [
    "new_query_in_top_n",
    "performance_regression",
    "resource_spike",
    "query_cancellation_spike",
    "wlm_threshold_breach"
  ],
  "notification_config": {
    "name": "Query Insights Alerts",
    "description": "Notification channel for Query Insights monitoring",
    "config_type": "slack",
    "slack": {
      "url": "https://hooks.slack.com/services/..."
    }
  },
  "scope": "cluster", // or user, role specific.
  "thresholds": {
    "new_query_in_top_n": {
      "duration_minutes": 10,
      "top_n_size": 10
    },
    "performance_regression": {
      "threshold_percent": 50
    }
  }
}

Response:

{
  "monitors_created": [
    {
      "monitor_id": "abc123",
      "monitor_type": "new_query_in_top_n",
      "status": "enabled",
      "alerting_monitor_id": "qi-new-query-001"
    },
    {
      "monitor_id": "def456",
      "monitor_type": "performance_regression",
      "status": "enabled",
      "alerting_monitor_id": "qi-perf-reg-001"
    }
  ],
  "notification_config_id": "notify-xyz789"
}

Manage monitors:

# List Query Insights monitors
GET /_insights/monitoring/monitors

# Get specific monitor details
GET /_insights/monitoring/monitors/{monitor_id}

# Update monitor thresholds
# This updates the query DSL in the underlying Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
  "thresholds": {
    "duration_minutes": 30
  }
}

# Enable/disable monitor
# This sets the "enabled" flag in the Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
  "enabled": false
}

# Delete monitor
# This deletes the Alerting monitor and removes QI's mapping
DELETE /_insights/monitoring/monitors/{monitor_id}

Pre-Built Monitor Scenarios

Based on current Query Insights capabilities, the following monitors are proposed:

Monitor ID Trigger Condition Data Source Value
new_query_in_top_n New query_group_hashcode appears in top N for > X minutes top_n_query, query_group_hashcode, timestamp Catch new expensive patterns
performance_regression Query latency/CPU/memory increases > X% from rolling baseline latency, cpu, memory measurements with time-series analysis Detect degradation
resource_spike Query CPU or memory exceeds threshold cpu, memory measurements Prevent resource exhaustion
query_cancellation_spike Cancellation rate exceeds threshold is_cancelled attribute Detect client timeouts or overload
wlm_threshold_breach Queries from specific WLM group exceed limits wlm_group_id, resource measurements Workload management enforcement
per_user_load_spike Specific user's query count/resources spike username, aggregated metrics Identify problematic users

Key Data Fields Used:

  • timestamp - For time-based windows
  • latency, cpu, memory - Performance metrics (from measurements map)
  • query_group_hashcode - Query pattern identity (from attributes)
  • top_n_query - Map indicating which metric type(s) this was top N for
  • is_cancelled - Cancellation flag
  • wlm_group_id - Workload management group
  • username - User who initiated the query

Implementation Approach

The implementation leverages existing OpenSearch Alerting plugin:

  1. Query Insights Role:

    • Already exports data to top_queries-YYYY.MM.dd-* indices
    • Provides REST API to install pre-built monitor templates
    • Translates monitor configs into Alerting plugin Monitor objects
  2. Alerting Plugin Role:

    • Executes scheduled queries against top_queries-* indices
    • Evaluates trigger conditions
    • Handles notification delivery (Slack, email, webhooks, SNS)
    • Manages throttling and deduplication
  3. User Flow:

    User enables monitor
         ↓
    Query Insights API creates Monitor in Alerting plugin
         ↓
    Alerting plugin runs scheduled query every 5 min
         ↓
    If condition matches → send notification
    

Details flows

Flow 1: Monitor Installation

sequenceDiagram
    actor User
    participant Dashboard as Query Insights Dashboard
    participant API as Query Insights REST API
    participant Templates as Monitor Templates
    participant NotifAPI as Notifications Plugin API
    participant AlertingAPI as Alerting Plugin API

    User->>Dashboard: Enable "New Query in Top N" monitor
    Dashboard->>API: POST /_insights/monitoring/install<br/>{monitors, notification_config}
    activate API

    API->>Templates: Load monitor template(s)
    Templates-->>API: Return template JSON(s)

    Note over API,NotifAPI: Step 1: Create notification channel
    API->>NotifAPI: POST /_plugins/_notifications/configs
    Note right of NotifAPI: Request body:<br/>{<br/>  "name": "Query Insights Alerts",<br/>  "config_type": "slack",<br/>  "slack": {"url": "..."},<br/>  ...}
    activate NotifAPI
    NotifAPI->>NotifAPI: Create notification config
    NotifAPI-->>API: {config_id: "notify-xyz789"}
    deactivate NotifAPI

    Note over API,AlertingAPI: Step 2: Create monitor(s) in Alerting plugin
    loop For each monitor type
        API->>API: Render template with:<br/>- User thresholds<br/>- notification config_id<br/>- Query against top_queries-*
        API->>AlertingAPI: POST /_plugins/_alerting/monitors
        Note right of AlertingAPI: Monitor JSON includes:<br/>- schedule (every 5 min)<br/>- inputs (query)<br/>- triggers (conditions)<br/>- actions (with config_id)
        activate AlertingAPI
        AlertingAPI->>AlertingAPI: Validate and create monitor
        AlertingAPI-->>API: {_id: "qi-new-query-001"}
        deactivate AlertingAPI
    end

    API->>API: Store mapping:<br/>QI monitor_id -> Alerting monitor_id
    API-->>Dashboard: {monitors_created, notification_config_id}
    deactivate API
    Dashboard-->>User: "Monitors enabled successfully"
Loading

Flow 2: Periodic Monitoring and Alert Triggering

sequenceDiagram
    participant Scheduler as Alerting Scheduler
    participant Monitor as Monitor Execution
    participant Index as top_queries-* indices
    participant Trigger as Trigger Evaluator
    participant Action as Action Executor
    participant Slack as Slack Webhook

    loop Every 5 minutes
        Scheduler->>Monitor: Execute monitor "qi-monitor-001"
        activate Monitor

        Monitor->>Index: Query: Search for new query_group_hashcode<br/>in last 10 minutes
        activate Index
        Index-->>Monitor: Return matching documents
        deactivate Index

        Monitor->>Trigger: Evaluate condition
        activate Trigger
        Trigger->>Trigger: Check: hits.total.value > 0?

        alt Condition is TRUE
            Trigger-->>Monitor: Trigger fired

            Monitor->>Action: Execute actions
            activate Action
            Action->>Action: Render message template<br/>with query hash, latency, etc.
            Action->>Slack: POST webhook
            activate Slack
            Slack-->>Action: 200 OK
            deactivate Slack
            Action-->>Monitor: Action completed
            deactivate Action
        else Condition is FALSE
            Trigger-->>Monitor: No trigger
        end

        deactivate Trigger

        Monitor-->>Scheduler: Execution complete
        deactivate Monitor
    end
Loading

Flow 3: Monitor Management

sequenceDiagram
    actor User
    participant API as Query Insights REST API
    participant Registry as Monitor Registry
    participant AlertingAPI as Alerting Plugin API

    User->>API: GET /_insights/monitoring/status
    API->>Registry: List installed monitors
    Registry->>AlertingAPI: GET /_plugins/_alerting/monitors
    AlertingAPI-->>Registry: Return all QI monitors
    Registry-->>API: Filter QI-created monitors
    API-->>User: {monitors: [...]}

    User->>API: POST /_insights/monitoring/disable<br/>{monitor_id: "qi-monitor-001"}
    API->>AlertingAPI: PUT /_plugins/_alerting/monitors/qi-monitor-001<br/>{enabled: false}
    AlertingAPI-->>API: Monitor disabled
    API-->>User: Success

    User->>API: PUT /_insights/monitoring/qi-monitor-001<br/>{thresholds: {duration_minutes: 30}}
    API->>AlertingAPI: GET monitor qi-monitor-001
    AlertingAPI-->>API: Monitor JSON
    API->>API: Update query with new threshold
    API->>AlertingAPI: PUT monitor with updated query
    AlertingAPI-->>API: Monitor updated
    API-->>User: Threshold updated
Loading

Dashboard Integration

Add a "Monitoring" section in Query Insights dashboards:

Features:

  • List of available monitors with enable/disable toggles
  • Threshold configuration inputs
  • Notification channel setup
  • Recent alert history table
  • Preview of alert message templates

Example UI Flow:

  1. User navigates to Query Insights > Monitoring
  2. Selects "New Query in Top N" monitor
  3. Sets threshold (e.g., "Alert if new query stays in top 10 for 15+ minutes")
  4. Configures Slack webhook
  5. Enables monitor
  6. Monitor is created in Alerting plugin and starts running

What alternatives have you considered?

Alternative 1: Manual Monitor Creation

Description: Users manually create monitors in the Alerting UI by writing queries against top_queries-* indices.

How it works:

  • User navigates to Alerting plugin
  • Writes DSL query like:
    {
      "query": {
        "bool": {
          "must": [
            { "range": { "timestamp": { "gte": "now-10m" }}},
            { "term": { "top_n_query.latency": true }}
          ],
          "must_not": [
            { "terms": { "query_group_hashcode": ["known_hash_1", "known_hash_2"] }}
          ]
        }
      }
    }
  • Configures trigger conditions
  • Sets up notification channels

Pros:

  • No new code needed
  • Maximum flexibility for power users

Cons:

  • Limited access based on index permissions
  • High barrier to entry and error-prone
  • Poor discoverability (users don't know monitoring is possible from query insights dashboards)
  • Limited integration with query insights dashboards
  • No consistency across deployments

Alternative 2: Build Custom Alerting System

Description: Create a dedicated alerting engine within Query Insights plugin, separate from OpenSearch Alerting.

How it works:

  • Query Insights plugin includes:
    • Monitor registry and scheduler
    • Condition evaluator
    • Notification sender (Slack, email, webhooks)
    • Alert history storage
  • Monitors defined in Query Insights settings
  • Runs on query_insights_executor thread pool

Pros:

  • Full control over features
  • Tightly integrated with Query Insights
  • Can implement real-time event-driven alerts

Cons:

  • Reinvents existing functionality (notification channels, throttling, deduplication)
  • More maintenance burden
  • Inconsistent UX (alerts in different place than other OpenSearch alerts)
  • Longer development time

Verdict: Leverage existing Alerting plugin infrastructure for consistent UX and reduced maintenance.

Do you have any additional context?

Data Requirements

Query Insights already exports the following fields to top_queries-* indices (no new fields needed):

Existing Fields Used:

  • timestamp - When query was executed
  • query_group_hashcode - Query pattern identifier for detecting new patterns
  • latency, cpu, memory - Performance metrics (in measurements nested object)
  • top_n_query - Map indicating which metric types this was top N for
  • is_cancelled - Boolean flag for cancellation detection
  • wlm_group_id - Workload management group identifier
  • username - User who initiated the query
  • indices - Indices queried
  • search_type - Type of search operation

Optional Enhancements (Future):

  • first_seen_timestamp - When query pattern first appeared (for "new query" detection)
  • baseline_latency_p95 - Rolling 7-day baseline (for regression detection)

These enhancements would improve detection accuracy but are not required for Phase 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions