You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenSearch users must manually monitor Query Insights dashboards to detect problematic query patterns. This reactive approach leads to:
Discovery Delay: New expensive query patterns may run for hours before being noticed
Manual Toil: Administrators must periodically check dashboards to catch issues
No Proactive Alerts: No automatic notification when queries degrade or new patterns emerge
Difficult Pattern Tracking: For clusters with dynamically-generated queries, it's nearly impossible to track what queries are impacting the cluster
Users need proactive notifications when query patterns change, rather than constantly monitoring dashboards.
What solution would you like?
Overview
Provide pre-configured event monitors for common query performance scenarios with simple setup and multi-channel notifications. The solution leverages:
Existing Data: Query Insights already exports top N queries to local indices (top_queries-YYYY.MM.dd-*)
Alerting Plugin Integration: Use OpenSearch Alerting plugin for notification delivery
Pre-Built Templates: Ship monitor templates that users can enable with minimal configuration
# List Query Insights monitors
GET /_insights/monitoring/monitors
# Get specific monitor details
GET /_insights/monitoring/monitors/{monitor_id}
# Update monitor thresholds# This updates the query DSL in the underlying Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
"thresholds": {
"duration_minutes": 30
}
}
# Enable/disable monitor# This sets the "enabled" flag in the Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
"enabled": false
}
# Delete monitor# This deletes the Alerting monitor and removes QI's mapping
DELETE /_insights/monitoring/monitors/{monitor_id}
Pre-Built Monitor Scenarios
Based on current Query Insights capabilities, the following monitors are proposed:
Monitor ID
Trigger Condition
Data Source
Value
new_query_in_top_n
New query_group_hashcode appears in top N for > X minutes
top_n_query, query_group_hashcode, timestamp
Catch new expensive patterns
performance_regression
Query latency/CPU/memory increases > X% from rolling baseline
latency, cpu, memory measurements with time-series analysis
User enables monitor
↓
Query Insights API creates Monitor in Alerting plugin
↓
Alerting plugin runs scheduled query every 5 min
↓
If condition matches → send notification
Details flows
Flow 1: Monitor Installation
sequenceDiagram
actor User
participant Dashboard as Query Insights Dashboard
participant API as Query Insights REST API
participant Templates as Monitor Templates
participant NotifAPI as Notifications Plugin API
participant AlertingAPI as Alerting Plugin API
User->>Dashboard: Enable "New Query in Top N" monitor
Dashboard->>API: POST /_insights/monitoring/install<br/>{monitors, notification_config}
activate API
API->>Templates: Load monitor template(s)
Templates-->>API: Return template JSON(s)
Note over API,NotifAPI: Step 1: Create notification channel
API->>NotifAPI: POST /_plugins/_notifications/configs
Note right of NotifAPI: Request body:<br/>{<br/> "name": "Query Insights Alerts",<br/> "config_type": "slack",<br/> "slack": {"url": "..."},<br/> ...}
activate NotifAPI
NotifAPI->>NotifAPI: Create notification config
NotifAPI-->>API: {config_id: "notify-xyz789"}
deactivate NotifAPI
Note over API,AlertingAPI: Step 2: Create monitor(s) in Alerting plugin
loop For each monitor type
API->>API: Render template with:<br/>- User thresholds<br/>- notification config_id<br/>- Query against top_queries-*
API->>AlertingAPI: POST /_plugins/_alerting/monitors
Note right of AlertingAPI: Monitor JSON includes:<br/>- schedule (every 5 min)<br/>- inputs (query)<br/>- triggers (conditions)<br/>- actions (with config_id)
activate AlertingAPI
AlertingAPI->>AlertingAPI: Validate and create monitor
AlertingAPI-->>API: {_id: "qi-new-query-001"}
deactivate AlertingAPI
end
API->>API: Store mapping:<br/>QI monitor_id -> Alerting monitor_id
API-->>Dashboard: {monitors_created, notification_config_id}
deactivate API
Dashboard-->>User: "Monitors enabled successfully"
Loading
Flow 2: Periodic Monitoring and Alert Triggering
sequenceDiagram
participant Scheduler as Alerting Scheduler
participant Monitor as Monitor Execution
participant Index as top_queries-* indices
participant Trigger as Trigger Evaluator
participant Action as Action Executor
participant Slack as Slack Webhook
loop Every 5 minutes
Scheduler->>Monitor: Execute monitor "qi-monitor-001"
activate Monitor
Monitor->>Index: Query: Search for new query_group_hashcode<br/>in last 10 minutes
activate Index
Index-->>Monitor: Return matching documents
deactivate Index
Monitor->>Trigger: Evaluate condition
activate Trigger
Trigger->>Trigger: Check: hits.total.value > 0?
alt Condition is TRUE
Trigger-->>Monitor: Trigger fired
Monitor->>Action: Execute actions
activate Action
Action->>Action: Render message template<br/>with query hash, latency, etc.
Action->>Slack: POST webhook
activate Slack
Slack-->>Action: 200 OK
deactivate Slack
Action-->>Monitor: Action completed
deactivate Action
else Condition is FALSE
Trigger-->>Monitor: No trigger
end
deactivate Trigger
Monitor-->>Scheduler: Execution complete
deactivate Monitor
end
Loading
Flow 3: Monitor Management
sequenceDiagram
actor User
participant API as Query Insights REST API
participant Registry as Monitor Registry
participant AlertingAPI as Alerting Plugin API
User->>API: GET /_insights/monitoring/status
API->>Registry: List installed monitors
Registry->>AlertingAPI: GET /_plugins/_alerting/monitors
AlertingAPI-->>Registry: Return all QI monitors
Registry-->>API: Filter QI-created monitors
API-->>User: {monitors: [...]}
User->>API: POST /_insights/monitoring/disable<br/>{monitor_id: "qi-monitor-001"}
API->>AlertingAPI: PUT /_plugins/_alerting/monitors/qi-monitor-001<br/>{enabled: false}
AlertingAPI-->>API: Monitor disabled
API-->>User: Success
User->>API: PUT /_insights/monitoring/qi-monitor-001<br/>{thresholds: {duration_minutes: 30}}
API->>AlertingAPI: GET monitor qi-monitor-001
AlertingAPI-->>API: Monitor JSON
API->>API: Update query with new threshold
API->>AlertingAPI: PUT monitor with updated query
AlertingAPI-->>API: Monitor updated
API-->>User: Threshold updated
Loading
Dashboard Integration
Add a "Monitoring" section in Query Insights dashboards:
Features:
List of available monitors with enable/disable toggles
Threshold configuration inputs
Notification channel setup
Recent alert history table
Preview of alert message templates
Example UI Flow:
User navigates to Query Insights > Monitoring
Selects "New Query in Top N" monitor
Sets threshold (e.g., "Alert if new query stays in top 10 for 15+ minutes")
Configures Slack webhook
Enables monitor
Monitor is created in Alerting plugin and starts running
What alternatives have you considered?
Alternative 1: Manual Monitor Creation
Description: Users manually create monitors in the Alerting UI by writing queries against top_queries-* indices.
Is your feature request related to a problem?
OpenSearch users must manually monitor Query Insights dashboards to detect problematic query patterns. This reactive approach leads to:
Users need proactive notifications when query patterns change, rather than constantly monitoring dashboards.
What solution would you like?
Overview
Provide pre-configured event monitors for common query performance scenarios with simple setup and multi-channel notifications. The solution leverages:
top_queries-YYYY.MM.dd-*)User Experience
Enable monitors via REST API:
Response:
{ "monitors_created": [ { "monitor_id": "abc123", "monitor_type": "new_query_in_top_n", "status": "enabled", "alerting_monitor_id": "qi-new-query-001" }, { "monitor_id": "def456", "monitor_type": "performance_regression", "status": "enabled", "alerting_monitor_id": "qi-perf-reg-001" } ], "notification_config_id": "notify-xyz789" }Manage monitors:
Pre-Built Monitor Scenarios
Based on current Query Insights capabilities, the following monitors are proposed:
query_group_hashcodeappears in top N for > X minutestop_n_query,query_group_hashcode,timestamplatency,cpu,memorymeasurements with time-series analysiscpu,memorymeasurementsis_cancelledattributewlm_group_id, resource measurementsusername, aggregated metricsKey Data Fields Used:
timestamp- For time-based windowslatency,cpu,memory- Performance metrics (frommeasurementsmap)query_group_hashcode- Query pattern identity (fromattributes)top_n_query- Map indicating which metric type(s) this was top N foris_cancelled- Cancellation flagwlm_group_id- Workload management groupusername- User who initiated the queryImplementation Approach
The implementation leverages existing OpenSearch Alerting plugin:
Query Insights Role:
top_queries-YYYY.MM.dd-*indicesAlerting Plugin Role:
top_queries-*indicesUser Flow:
Details flows
Flow 1: Monitor Installation
sequenceDiagram actor User participant Dashboard as Query Insights Dashboard participant API as Query Insights REST API participant Templates as Monitor Templates participant NotifAPI as Notifications Plugin API participant AlertingAPI as Alerting Plugin API User->>Dashboard: Enable "New Query in Top N" monitor Dashboard->>API: POST /_insights/monitoring/install<br/>{monitors, notification_config} activate API API->>Templates: Load monitor template(s) Templates-->>API: Return template JSON(s) Note over API,NotifAPI: Step 1: Create notification channel API->>NotifAPI: POST /_plugins/_notifications/configs Note right of NotifAPI: Request body:<br/>{<br/> "name": "Query Insights Alerts",<br/> "config_type": "slack",<br/> "slack": {"url": "..."},<br/> ...} activate NotifAPI NotifAPI->>NotifAPI: Create notification config NotifAPI-->>API: {config_id: "notify-xyz789"} deactivate NotifAPI Note over API,AlertingAPI: Step 2: Create monitor(s) in Alerting plugin loop For each monitor type API->>API: Render template with:<br/>- User thresholds<br/>- notification config_id<br/>- Query against top_queries-* API->>AlertingAPI: POST /_plugins/_alerting/monitors Note right of AlertingAPI: Monitor JSON includes:<br/>- schedule (every 5 min)<br/>- inputs (query)<br/>- triggers (conditions)<br/>- actions (with config_id) activate AlertingAPI AlertingAPI->>AlertingAPI: Validate and create monitor AlertingAPI-->>API: {_id: "qi-new-query-001"} deactivate AlertingAPI end API->>API: Store mapping:<br/>QI monitor_id -> Alerting monitor_id API-->>Dashboard: {monitors_created, notification_config_id} deactivate API Dashboard-->>User: "Monitors enabled successfully"Flow 2: Periodic Monitoring and Alert Triggering
sequenceDiagram participant Scheduler as Alerting Scheduler participant Monitor as Monitor Execution participant Index as top_queries-* indices participant Trigger as Trigger Evaluator participant Action as Action Executor participant Slack as Slack Webhook loop Every 5 minutes Scheduler->>Monitor: Execute monitor "qi-monitor-001" activate Monitor Monitor->>Index: Query: Search for new query_group_hashcode<br/>in last 10 minutes activate Index Index-->>Monitor: Return matching documents deactivate Index Monitor->>Trigger: Evaluate condition activate Trigger Trigger->>Trigger: Check: hits.total.value > 0? alt Condition is TRUE Trigger-->>Monitor: Trigger fired Monitor->>Action: Execute actions activate Action Action->>Action: Render message template<br/>with query hash, latency, etc. Action->>Slack: POST webhook activate Slack Slack-->>Action: 200 OK deactivate Slack Action-->>Monitor: Action completed deactivate Action else Condition is FALSE Trigger-->>Monitor: No trigger end deactivate Trigger Monitor-->>Scheduler: Execution complete deactivate Monitor endFlow 3: Monitor Management
sequenceDiagram actor User participant API as Query Insights REST API participant Registry as Monitor Registry participant AlertingAPI as Alerting Plugin API User->>API: GET /_insights/monitoring/status API->>Registry: List installed monitors Registry->>AlertingAPI: GET /_plugins/_alerting/monitors AlertingAPI-->>Registry: Return all QI monitors Registry-->>API: Filter QI-created monitors API-->>User: {monitors: [...]} User->>API: POST /_insights/monitoring/disable<br/>{monitor_id: "qi-monitor-001"} API->>AlertingAPI: PUT /_plugins/_alerting/monitors/qi-monitor-001<br/>{enabled: false} AlertingAPI-->>API: Monitor disabled API-->>User: Success User->>API: PUT /_insights/monitoring/qi-monitor-001<br/>{thresholds: {duration_minutes: 30}} API->>AlertingAPI: GET monitor qi-monitor-001 AlertingAPI-->>API: Monitor JSON API->>API: Update query with new threshold API->>AlertingAPI: PUT monitor with updated query AlertingAPI-->>API: Monitor updated API-->>User: Threshold updatedDashboard Integration
Add a "Monitoring" section in Query Insights dashboards:
Features:
Example UI Flow:
What alternatives have you considered?
Alternative 1: Manual Monitor Creation
Description: Users manually create monitors in the Alerting UI by writing queries against
top_queries-*indices.How it works:
{ "query": { "bool": { "must": [ { "range": { "timestamp": { "gte": "now-10m" }}}, { "term": { "top_n_query.latency": true }} ], "must_not": [ { "terms": { "query_group_hashcode": ["known_hash_1", "known_hash_2"] }} ] } } }Pros:
Cons:
Alternative 2: Build Custom Alerting System
Description: Create a dedicated alerting engine within Query Insights plugin, separate from OpenSearch Alerting.
How it works:
Pros:
Cons:
Verdict: Leverage existing Alerting plugin infrastructure for consistent UX and reduced maintenance.
Do you have any additional context?
Data Requirements
Query Insights already exports the following fields to
top_queries-*indices (no new fields needed):Existing Fields Used:
timestamp- When query was executedquery_group_hashcode- Query pattern identifier for detecting new patternslatency,cpu,memory- Performance metrics (inmeasurementsnested object)top_n_query- Map indicating which metric types this was top N foris_cancelled- Boolean flag for cancellation detectionwlm_group_id- Workload management group identifierusername- User who initiated the queryindices- Indices queriedsearch_type- Type of search operationOptional Enhancements (Future):
first_seen_timestamp- When query pattern first appeared (for "new query" detection)baseline_latency_p95- Rolling 7-day baseline (for regression detection)These enhancements would improve detection accuracy but are not required for Phase 1.