[FEATURE] Query Event Monitoring and Notifications

### Is your feature request related to a problem?

OpenSearch users must manually monitor Query Insights dashboards to detect problematic query patterns. This reactive approach leads to:

1. **Discovery Delay**: New expensive query patterns may run for hours before being noticed
2. **Manual Toil**: Administrators must periodically check dashboards to catch issues
3. **No Proactive Alerts**: No automatic notification when queries degrade or new patterns emerge
4. **Difficult Pattern Tracking**: For clusters with dynamically-generated queries, it's nearly impossible to track what queries are impacting the cluster

Users need **proactive notifications** when query patterns change, rather than constantly monitoring dashboards.

### What solution would you like?
#### Overview

Provide **pre-configured event monitors** for common query performance scenarios with **simple setup** and **multi-channel notifications**. The solution leverages:

1. **Existing Data**: Query Insights already exports top N queries to local indices (`top_queries-YYYY.MM.dd-*`)
2. **Alerting Plugin Integration**: Use OpenSearch Alerting plugin for notification delivery
3. **Pre-Built Templates**: Ship monitor templates that users can enable with minimal configuration

#### User Experience

**Enable monitors via REST API:**
```json
POST /_insights/monitoring/install
{
 "monitors": [
 "new_query_in_top_n",
 "performance_regression",
 "resource_spike",
 "query_cancellation_spike",
 "wlm_threshold_breach"
 ],
 "notification_config": {
 "name": "Query Insights Alerts",
 "description": "Notification channel for Query Insights monitoring",
 "config_type": "slack",
 "slack": {
 "url": "https://hooks.slack.com/services/..."
 }
 },
 "scope": "cluster", // or user, role specific.
 "thresholds": {
 "new_query_in_top_n": {
 "duration_minutes": 10,
 "top_n_size": 10
 },
 "performance_regression": {
 "threshold_percent": 50
 }
 }
}
```

**Response:**
```json
{
 "monitors_created": [
 {
 "monitor_id": "abc123",
 "monitor_type": "new_query_in_top_n",
 "status": "enabled",
 "alerting_monitor_id": "qi-new-query-001"
 },
 {
 "monitor_id": "def456",
 "monitor_type": "performance_regression",
 "status": "enabled",
 "alerting_monitor_id": "qi-perf-reg-001"
 }
 ],
 "notification_config_id": "notify-xyz789"
}
```

**Manage monitors:**
```bash
# List Query Insights monitors
GET /_insights/monitoring/monitors

# Get specific monitor details
GET /_insights/monitoring/monitors/{monitor_id}

# Update monitor thresholds
# This updates the query DSL in the underlying Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
 "thresholds": {
 "duration_minutes": 30
 }
}

# Enable/disable monitor
# This sets the "enabled" flag in the Alerting monitor
PUT /_insights/monitoring/monitors/{monitor_id}
{
 "enabled": false
}

# Delete monitor
# This deletes the Alerting monitor and removes QI's mapping
DELETE /_insights/monitoring/monitors/{monitor_id}
```

#### Pre-Built Monitor Scenarios

Based on **current Query Insights capabilities**, the following monitors are proposed:

| Monitor ID | Trigger Condition | Data Source | Value |
|-----------|-------------------|-------------|-------|
| **new_query_in_top_n** | New `query_group_hashcode` appears in top N for > X minutes | `top_n_query`, `query_group_hashcode`, `timestamp` | Catch new expensive patterns |
| **performance_regression** | Query latency/CPU/memory increases > X% from rolling baseline | `latency`, `cpu`, `memory` measurements with time-series analysis | Detect degradation |
| **resource_spike** | Query CPU or memory exceeds threshold | `cpu`, `memory` measurements | Prevent resource exhaustion |
| **query_cancellation_spike** | Cancellation rate exceeds threshold | `is_cancelled` attribute | Detect client timeouts or overload |
| **wlm_threshold_breach** | Queries from specific WLM group exceed limits | `wlm_group_id`, resource measurements | Workload management enforcement |
| **per_user_load_spike** | Specific user's query count/resources spike | `username`, aggregated metrics | Identify problematic users |

**Key Data Fields Used:**
- `timestamp` - For time-based windows
- `latency`, `cpu`, `memory` - Performance metrics (from `measurements` map)
- `query_group_hashcode` - Query pattern identity (from `attributes`)
- `top_n_query` - Map indicating which metric type(s) this was top N for
- `is_cancelled` - Cancellation flag
- `wlm_group_id` - Workload management group
- `username` - User who initiated the query

#### Implementation Approach

The implementation leverages **existing OpenSearch Alerting plugin**:

1. **Query Insights Role**:
 - Already exports data to `top_queries-YYYY.MM.dd-*` indices
 - Provides REST API to install pre-built monitor templates
 - Translates monitor configs into Alerting plugin Monitor objects

2. **Alerting Plugin Role**:
 - Executes scheduled queries against `top_queries-*` indices
 - Evaluates trigger conditions
 - Handles notification delivery (Slack, email, webhooks, SNS)
 - Manages throttling and deduplication

3. **User Flow**:
 ```
 User enables monitor
 ↓
 Query Insights API creates Monitor in Alerting plugin
 ↓
 Alerting plugin runs scheduled query every 5 min
 ↓
 If condition matches → send notification
 ```

#### Details flows
Flow 1: Monitor Installation

```mermaid
sequenceDiagram
 actor User
 participant Dashboard as Query Insights Dashboard
 participant API as Query Insights REST API
 participant Templates as Monitor Templates
 participant NotifAPI as Notifications Plugin API
 participant AlertingAPI as Alerting Plugin API

 User->>Dashboard: Enable "New Query in Top N" monitor
 Dashboard->>API: POST /_insights/monitoring/install {monitors, notification_config}
 activate API

 API->>Templates: Load monitor template(s)
 Templates-->>API: Return template JSON(s)

 Note over API,NotifAPI: Step 1: Create notification channel
 API->>NotifAPI: POST /_plugins/_notifications/configs
 Note right of NotifAPI: Request body: { "name": "Query Insights Alerts", "config_type": "slack", "slack": {"url": "..."}, ...}
 activate NotifAPI
 NotifAPI->>NotifAPI: Create notification config
 NotifAPI-->>API: {config_id: "notify-xyz789"}
 deactivate NotifAPI

 Note over API,AlertingAPI: Step 2: Create monitor(s) in Alerting plugin
 loop For each monitor type
 API->>API: Render template with: - User thresholds - notification config_id - Query against top_queries-*
 API->>AlertingAPI: POST /_plugins/_alerting/monitors
 Note right of AlertingAPI: Monitor JSON includes: - schedule (every 5 min) - inputs (query) - triggers (conditions) - actions (with config_id)
 activate AlertingAPI
 AlertingAPI->>AlertingAPI: Validate and create monitor
 AlertingAPI-->>API: {_id: "qi-new-query-001"}
 deactivate AlertingAPI
 end

 API->>API: Store mapping: QI monitor_id -> Alerting monitor_id
 API-->>Dashboard: {monitors_created, notification_config_id}
 deactivate API
 Dashboard-->>User: "Monitors enabled successfully"
```

Flow 2: Periodic Monitoring and Alert Triggering

```mermaid
sequenceDiagram
 participant Scheduler as Alerting Scheduler
 participant Monitor as Monitor Execution
 participant Index as top_queries-* indices
 participant Trigger as Trigger Evaluator
 participant Action as Action Executor
 participant Slack as Slack Webhook

 loop Every 5 minutes
 Scheduler->>Monitor: Execute monitor "qi-monitor-001"
 activate Monitor

 Monitor->>Index: Query: Search for new query_group_hashcode in last 10 minutes
 activate Index
 Index-->>Monitor: Return matching documents
 deactivate Index

 Monitor->>Trigger: Evaluate condition
 activate Trigger
 Trigger->>Trigger: Check: hits.total.value > 0?

 alt Condition is TRUE
 Trigger-->>Monitor: Trigger fired

 Monitor->>Action: Execute actions
 activate Action
 Action->>Action: Render message template with query hash, latency, etc.
 Action->>Slack: POST webhook
 activate Slack
 Slack-->>Action: 200 OK
 deactivate Slack
 Action-->>Monitor: Action completed
 deactivate Action
 else Condition is FALSE
 Trigger-->>Monitor: No trigger
 end

 deactivate Trigger

 Monitor-->>Scheduler: Execution complete
 deactivate Monitor
 end
```


Flow 3: Monitor Management

```mermaid
sequenceDiagram
 actor User
 participant API as Query Insights REST API
 participant Registry as Monitor Registry
 participant AlertingAPI as Alerting Plugin API

 User->>API: GET /_insights/monitoring/status
 API->>Registry: List installed monitors
 Registry->>AlertingAPI: GET /_plugins/_alerting/monitors
 AlertingAPI-->>Registry: Return all QI monitors
 Registry-->>API: Filter QI-created monitors
 API-->>User: {monitors: [...]}

 User->>API: POST /_insights/monitoring/disable {monitor_id: "qi-monitor-001"}
 API->>AlertingAPI: PUT /_plugins/_alerting/monitors/qi-monitor-001 {enabled: false}
 AlertingAPI-->>API: Monitor disabled
 API-->>User: Success

 User->>API: PUT /_insights/monitoring/qi-monitor-001 {thresholds: {duration_minutes: 30}}
 API->>AlertingAPI: GET monitor qi-monitor-001
 AlertingAPI-->>API: Monitor JSON
 API->>API: Update query with new threshold
 API->>AlertingAPI: PUT monitor with updated query
 AlertingAPI-->>API: Monitor updated
 API-->>User: Threshold updated
```


#### Dashboard Integration

Add a **"Monitoring"** section in Query Insights dashboards:

**Features:**
- List of available monitors with enable/disable toggles
- Threshold configuration inputs
- Notification channel setup
- Recent alert history table
- Preview of alert message templates

**Example UI Flow:**
1. User navigates to Query Insights > Monitoring
2. Selects "New Query in Top N" monitor
3. Sets threshold (e.g., "Alert if new query stays in top 10 for 15+ minutes")
4. Configures Slack webhook
5. Enables monitor
6. Monitor is created in Alerting plugin and starts running

### What alternatives have you considered?
#### Alternative 1: Manual Monitor Creation

**Description:** Users manually create monitors in the Alerting UI by writing queries against `top_queries-*` indices.

**How it works:**
- User navigates to Alerting plugin
- Writes DSL query like:
 ```json
 {
 "query": {
 "bool": {
 "must": [
 { "range": { "timestamp": { "gte": "now-10m" }}},
 { "term": { "top_n_query.latency": true }}
 ],
 "must_not": [
 { "terms": { "query_group_hashcode": ["known_hash_1", "known_hash_2"] }}
 ]
 }
 }
 }
 ```
- Configures trigger conditions
- Sets up notification channels

**Pros:**
- No new code needed
- Maximum flexibility for power users

**Cons:**
- Limited access based on index permissions
- High barrier to entry and error-prone 
- Poor discoverability (users don't know monitoring is possible from query insights dashboards)
- Limited integration with query insights dashboards
- No consistency across deployments

---

#### Alternative 2: Build Custom Alerting System

**Description:** Create a dedicated alerting engine within Query Insights plugin, separate from OpenSearch Alerting.

**How it works:**
- Query Insights plugin includes:
 - Monitor registry and scheduler
 - Condition evaluator
 - Notification sender (Slack, email, webhooks)
 - Alert history storage
- Monitors defined in Query Insights settings
- Runs on query_insights_executor thread pool

**Pros:**
- Full control over features
- Tightly integrated with Query Insights
- Can implement real-time event-driven alerts

**Cons:**
- Reinvents existing functionality (notification channels, throttling, deduplication)
- More maintenance burden
- Inconsistent UX (alerts in different place than other OpenSearch alerts)
- Longer development time

**Verdict:** Leverage existing Alerting plugin infrastructure for consistent UX and reduced maintenance.

### Do you have any additional context?
#### Data Requirements

Query Insights already exports the following fields to `top_queries-*` indices (no new fields needed):

**Existing Fields Used:**
- `timestamp` - When query was executed
- `query_group_hashcode` - Query pattern identifier for detecting new patterns
- `latency`, `cpu`, `memory` - Performance metrics (in `measurements` nested object)
- `top_n_query` - Map indicating which metric types this was top N for
- `is_cancelled` - Boolean flag for cancellation detection
- `wlm_group_id` - Workload management group identifier
- `username` - User who initiated the query
- `indices` - Indices queried
- `search_type` - Type of search operation

**Optional Enhancements (Future):**
- `first_seen_timestamp` - When query pattern first appeared (for "new query" detection)
- `baseline_latency_p95` - Rolling 7-day baseline (for regression detection)

These enhancements would improve detection accuracy but are not required for Phase 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Query Event Monitoring and Notifications #529

Is your feature request related to a problem?

What solution would you like?

Overview

User Experience

Pre-Built Monitor Scenarios

Implementation Approach

Details flows

Dashboard Integration

What alternatives have you considered?

Alternative 1: Manual Monitor Creation

Alternative 2: Build Custom Alerting System

Do you have any additional context?

Data Requirements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Monitor ID	Trigger Condition	Data Source	Value
new_query_in_top_n	New `query_group_hashcode` appears in top N for > X minutes	`top_n_query`, `query_group_hashcode`, `timestamp`	Catch new expensive patterns
performance_regression	Query latency/CPU/memory increases > X% from rolling baseline	`latency`, `cpu`, `memory` measurements with time-series analysis	Detect degradation
resource_spike	Query CPU or memory exceeds threshold	`cpu`, `memory` measurements	Prevent resource exhaustion
query_cancellation_spike	Cancellation rate exceeds threshold	`is_cancelled` attribute	Detect client timeouts or overload
wlm_threshold_breach	Queries from specific WLM group exceed limits	`wlm_group_id`, resource measurements	Workload management enforcement
per_user_load_spike	Specific user's query count/resources spike	`username`, aggregated metrics	Identify problematic users

[FEATURE] Query Event Monitoring and Notifications #529

Description

Is your feature request related to a problem?

What solution would you like?

Overview

User Experience

Pre-Built Monitor Scenarios

Implementation Approach

Details flows

Dashboard Integration

What alternatives have you considered?

Alternative 1: Manual Monitor Creation

Alternative 2: Build Custom Alerting System

Do you have any additional context?

Data Requirements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions