Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/recipes/_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ rubric.

## Prerequisites

- DING binary `>= v0.3.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- <Platform-specific requirements: account, runtime version, etc.>
- A notifier endpoint (Slack webhook URL, custom webhook, etc.)

Expand All @@ -32,7 +32,7 @@ rules:
match:
metric: run.exit
condition: value > 0
message: "Job failed (exit {{ .exit_code }})"
message: "Job failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand Down
8 changes: 4 additions & 4 deletions docs/recipes/argo-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.7.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.7.0` (multi-arch, scratch base) into each step's Pod via an initContainer; no need to bake DING into your workload image.
- DING binary `>= v0.10.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.10.0` (multi-arch, scratch base) into each step's Pod via an initContainer; no need to bake DING into your workload image.
- Argo Workflows controller installed in the cluster, `>= v3.5` (most users on v3.5/v3.6 LTS series).
- `kubectl` access to a namespace where you can create Workflows, ConfigMaps, and Secrets.
- `argo` CLI installed locally (ships with the controller; one-line install per [Argo docs](https://argo-workflows.readthedocs.io/en/latest/quick-start/)).
Expand Down Expand Up @@ -49,7 +49,7 @@ data:
match:
metric: run.exit
condition: value > 0
message: "Argo step {{ .pod }} (workflow {{ .workflow }}) failed with exit {{ .exit_code }}"
message: "Argo step {{ .pod }} (workflow {{ .workflow }}) failed with exit {{ .exit_code }} after {{ .duration_seconds }}s"
alert:
- notifier: slack
---
Expand All @@ -71,7 +71,7 @@ spec:
name: ding-config
initContainers:
- name: install-ding
image: ghcr.io/ding-labs/ding:v0.7.0
image: ghcr.io/ding-labs/ding:v0.10.0
# `ding install` self-copies the binary — works against the FROM-scratch
# release image (no /bin/sh available). Added in DING v0.5.1.
command: ["/ding", "install", "/shared/ding"]
Expand Down Expand Up @@ -173,7 +173,7 @@ spec:
- { name: ding-config, configMap: { name: ding-config } }
initContainers:
- name: install-ding
image: ghcr.io/ding-labs/ding:v0.7.0
image: ghcr.io/ding-labs/ding:v0.10.0
command: ["/ding", "install", "/shared/ding"]
mirrorVolumeMounts: true
container:
Expand Down
6 changes: 3 additions & 3 deletions docs/recipes/buildkite.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.3.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- A Buildkite organization with at least one agent ([free trial available](https://buildkite.com/pricing))
- A notifier endpoint (Slack webhook URL, custom webhook, etc.)

Expand Down Expand Up @@ -33,7 +33,7 @@ rules:
match:
metric: run.exit
condition: value > 0
message: "{{ .repo }}@{{ .branch }} failed (exit {{ .exit_code }})"
message: "{{ .repo }}@{{ .branch }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand Down Expand Up @@ -64,7 +64,7 @@ Use these in `match.labels` or `message` templates. See [Configuration](../confi
2. Trigger a build. Confirm a successful step produces no alert.
3. Force a failure (`exit 1` in `run-tests.sh`). Confirm the alert fires in Slack within ~5 seconds of step exit.

If the alert doesn't fire, check the Buildkite build log for `ding` output. Common issues: webhook URL not exposed (env hook scope, agent vs pipeline level), or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md).
If the alert doesn't fire, check the Buildkite build log for `ding` output. Common issues: webhook URL not exposed (env hook scope, agent vs pipeline level), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run).

## Native Buildkite UI surfacing

Expand Down
6 changes: 3 additions & 3 deletions docs/recipes/gitlab-ci.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.3.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- A GitLab project with CI enabled (gitlab.com or self-hosted)
- A notifier endpoint (Slack webhook URL, custom webhook, etc.) accessible from the runner

Expand Down Expand Up @@ -35,7 +35,7 @@ rules:
match:
metric: run.exit
condition: value > 0
message: "Pipeline {{ .branch }} failed (exit {{ .exit_code }})"
message: "Pipeline {{ .branch }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand Down Expand Up @@ -67,7 +67,7 @@ Use these in `match.labels` for selective rules, or in `message` templates as `{
2. Push a commit. Confirm the pipeline runs and that a successful job produces no alert.
3. Force a failure: change `run-tests.sh` to `exit 1`. Confirm the alert fires in Slack within ~5 seconds of job exit.

If the alert doesn't fire, check the GitLab CI job log for `ding` output. Common issues: webhook URL not exposed to the job (mark the variable as not "Protected" if testing on a non-protected branch), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md).
If the alert doesn't fire, check the GitLab CI job log for `ding` output. Common issues: webhook URL not exposed to the job (mark the variable as not "Protected" if testing on a non-protected branch), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run).

## Native GitLab UI surfacing

Expand Down
6 changes: 3 additions & 3 deletions docs/recipes/jenkins.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.3.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- A Jenkins controller (any version supporting Pipeline DSL — most do)
- A notifier endpoint (Slack webhook URL, custom webhook, etc.) reachable from the Jenkins agent

Expand Down Expand Up @@ -47,7 +47,7 @@ rules:
match:
metric: run.exit
condition: value > 0
message: "{{ .job }} build {{ .build }} failed (exit {{ .exit_code }})"
message: "{{ .job }} build {{ .build }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand Down Expand Up @@ -77,7 +77,7 @@ Note: Jenkins doesn't expose `repo`, `branch`, or `commit` as universal env vars
2. Trigger the job. Confirm a successful build produces no alert.
3. Force a failure (`exit 1` in `run-tests.sh`). Confirm the alert fires in Slack within ~5 seconds of build exit.

If the alert doesn't fire, check the Jenkins build console for `ding` output. Common issues: webhook credential not exposed to the job (`withCredentials` block missing or wrong `credentialsId`), or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md).
If the alert doesn't fire, check the Jenkins build console for `ding` output. Common issues: webhook credential not exposed to the job (`withCredentials` block missing or wrong `credentialsId`), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run).

## Tradeoffs / known limitations

Expand Down
10 changes: 5 additions & 5 deletions docs/recipes/kubernetes-jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Prerequisites

- DING binary `>= v0.7.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.7.0` (multi-arch, scratch base) into your Pod via an initContainer; no need to bake DING into your workload image.
- DING binary `>= v0.10.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.10.0` (multi-arch, scratch base) into your Pod via an initContainer; no need to bake DING into your workload image.
- A Kubernetes cluster `>= 1.21` for the primary wrapper pattern below. The sidecar alternative documented in [Configuration](#sidecar-alternative-k8s-129) requires `>= 1.29` for native sidecar lifecycle.
- `kubectl` access to a namespace where you can create Jobs, ConfigMaps, and Secrets.
- A notifier endpoint (Slack webhook URL or custom webhook) you can store in a Kubernetes Secret.
Expand Down Expand Up @@ -58,7 +58,7 @@ data:
match:
metric: run.exit
condition: value > 0
message: "{{ .pod }} (Job {{ .job_name }}) failed with exit {{ .exit_code }}"
message: "{{ .pod }} (Job {{ .job_name }}) failed with exit {{ .exit_code }} after {{ .duration_seconds }}s"
alert:
- notifier: slack
---
Expand All @@ -82,7 +82,7 @@ spec:
name: ding-config
initContainers:
- name: install-ding
image: ghcr.io/ding-labs/ding:v0.7.0
image: ghcr.io/ding-labs/ding:v0.10.0
# `ding install` self-copies the binary — works against the FROM-scratch
# release image (no /bin/sh available). Added in DING v0.5.1.
command: ["/ding", "install", "/shared/ding"]
Expand Down Expand Up @@ -184,7 +184,7 @@ spec:
spec:
initContainers:
- name: ding
image: ghcr.io/ding-labs/ding:v0.7.0
image: ghcr.io/ding-labs/ding:v0.10.0
restartPolicy: Always # native sidecar — K8s 1.29+
command: ["/ding", "serve", "--config", "/etc/ding/ding.yaml"]
# ...volumeMounts for config + downward-API env block
Expand Down Expand Up @@ -225,7 +225,7 @@ rules:
- name: job_failed
match: { metric: run.exit }
condition: value > 0
message: "Job failed (exit {{ .exit_code }})"
message: "Job failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: k8s
```
Expand Down
8 changes: 4 additions & 4 deletions docs/recipes/mlflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.6.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- `mlflow >= 2.0` (`pip install mlflow`)
- An MLflow tracking URI: local SQLite for dev; remote tracking server like Databricks or self-hosted (`mlflow server`) for production deep-links to work
- A notifier endpoint (Slack webhook URL is the canonical example)
Expand Down Expand Up @@ -49,7 +49,7 @@ rules:
match: { metric: run.exit }
condition: value > 0
message: |
MLflow run failed (exit {{ .exit_code }})
MLflow run failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)
<{{ .tracking_uri }}/#/experiments/{{ .experiment_id }}/runs/{{ .run_id }}|View run in MLflow UI>
alert:
- notifier: slack
Expand Down Expand Up @@ -87,7 +87,7 @@ A Slack message during training when `val_loss` exceeds threshold:
…and on training-process exit:

> 🔔 `training_failed`
> MLflow run failed (exit 1)
> MLflow run failed (exit 1 after 42s)
> [View run in MLflow UI](#)

The deep-link in the second message takes you straight to the MLflow run page. All alerts are auto-tagged with `run_id`, `runner=mlflow`, `experiment_id`, `tracking_uri`.
Expand Down Expand Up @@ -121,7 +121,7 @@ mlflow run . --env-manager=local
# 3. labels include run_id, experiment_id, tracking_uri
```

If the alert doesn't fire, check the `mlflow run` log for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not exported in the shell that ran `mlflow run`, or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md).
If the alert doesn't fire, check the `mlflow run` log for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not exported in the shell that ran `mlflow run`, or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run).

## Tradeoffs / known limitations

Expand Down
8 changes: 4 additions & 4 deletions docs/recipes/modal.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.8.0` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- Modal CLI (`pip install modal`) authenticated via `modal token new`
- A Modal account (free tier with $30/mo credit covers this recipe end-to-end)
- A notifier endpoint (Slack webhook URL is the canonical example)
Expand Down Expand Up @@ -67,7 +67,7 @@ rules:
- name: training_failed
match: { metric: run.exit }
condition: value > 0
message: "Modal function {{ .function_name }} (task {{ .modal_task_id }}) failed (exit {{ .exit_code }})"
message: "Modal function {{ .function_name }} (task {{ .modal_task_id }}) failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand All @@ -91,7 +91,7 @@ A Slack message during training when `val_loss` exceeds threshold:
…and on function exit:

> 🔔 `training_failed`
> Modal function trainer (task ta-abc123def) failed (exit 1)
> Modal function trainer (task ta-abc123def) failed (exit 1 after 287s)

The `modal_task_id` matches the task ID visible in the Modal dashboard, so the Slack alert is one click away from the function's logs and metrics.

Expand All @@ -105,7 +105,7 @@ DING does **not** auto-detect Modal — Modal's runtime owns the container entry
| `MODAL_FUNCTION_NAME` | The function's Python name |
| `MODAL_APP_NAME` | The Modal `App` name (for multi-function apps) |

Emit any subset as flat top-level JSON keys (DING's ingester at `internal/ingester/json.go` extracts top-level strings as labels and numbers as floats; nested objects are skipped). Use them in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full reference.
Emit any subset as flat top-level JSON keys DING extracts top-level strings as labels and numbers as floats; nested objects are skipped. Use them in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full notifier reference.

## Verification

Expand Down
10 changes: 5 additions & 5 deletions docs/recipes/ray.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Prerequisites

- DING binary `>= v0.7.1` — see [install](../install.md)
- DING binary `>= v0.10.0` — see [install](../install.md)
- `ray >= 2.0` (`pip install "ray[default]"`; add `train`/`tune` extras as needed for your workload)
- A running Ray cluster: local single-node (`ray start --head`) for dev; KubeRay/Anyscale/EKS for production
- A notifier endpoint (Slack webhook URL is the canonical example)
Expand Down Expand Up @@ -39,7 +39,7 @@ rules:
- name: training_failed
match: { metric: run.exit }
condition: value > 0
message: "Ray job {{ .run_id }} failed (exit {{ .exit_code }})"
message: "Ray job {{ .run_id }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)"
alert:
- notifier: slack
```
Expand Down Expand Up @@ -112,7 +112,7 @@ A Slack message during training when `val_loss` exceeds threshold:
…and on training-process exit:

> 🔔 `training_failed`
> Ray job raysubmit_abcdef1234567890 failed (exit 1)
> Ray job raysubmit_abcdef1234567890 failed (exit 1 after 1843s)

All Path A alerts are auto-tagged with `run_id` + `runner=ray`. The `run_id` matches the UUID printed by `ray job list`.

Expand All @@ -125,7 +125,7 @@ All Path A alerts are auto-tagged with `run_id` + `runner=ray`. The `run_id` mat
| `run_id` | `RAY_JOB_ID` | Ray job UUID matching `ray job list` output |
| `runner` | `"ray"` (set by runctx) | |

Use these in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full reference.
Use these in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full notifier reference.

## Verification

Expand All @@ -146,7 +146,7 @@ ray job list
ray stop
```

If the alert doesn't fire, check the Ray driver logs (`ray job logs <id>`) for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not forwarded via `--runtime-env-json`, or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md).
If the alert doesn't fire, check the Ray driver logs (`ray job logs <id>`) for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not forwarded via `--runtime-env-json`, or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run).

## Tradeoffs / known limitations

Expand Down
Loading