From 07c37a618c7787b06020e6a88c328bfa7d36a3a0 Mon Sep 17 00:00:00 2001 From: zuchka Date: Fri, 8 May 2026 17:35:12 -0700 Subject: [PATCH] docs(recipes): restore duration_seconds + version pin sweep MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Restores {{ .duration_seconds }} to every run.exit message template across all 8 platform recipes plus _template.md. The reference was stripped from mlflow in fcdc11e because Float-typed event fields were unreachable from message templates; commit 5fda9c1 fixed the engine (shipped v0.10.0) but no recipe ever re-adopted the feature. Bumps DING binary pin to >= v0.10.0 across all 9 files (was a range of v0.3.0–v0.8.0 reflecting authoring time, not feature requirement) and the container image tag in argo + k8s recipes to match. Adjacent consistency fixes: - drain_timeout link in 5 recipes (gitlab-ci, jenkins, buildkite, mlflow, ray) deep-links to the configuration.md anchor. - modal + ray phrase "full notifier reference" matching the other 6 recipes' Configuration-section convention. Three other "full reference" instances in native-surfacing sections (gitlab-ci, buildkite, k8s) are left as-is — they reference a specific notifier type, a different sentence pattern. - modal.md drops a leak of the internal/ingester/json.go path. Three ML recipes (mlflow, modal, ray) also update their rendered exit-alert excerpts in the "What you get" section to match the YAML message — illustrative durations: 42s / 287s / 1843s. mode: end-of-run sweep already shipped in 9cca220; no changes there. Verified via: - grep uniformity (every recipe pinned + carries duration_seconds) - markdown anchor sweep (11 internal refs resolve) - ding validate on extracted buildkite + mlflow YAML - ding test-rule render check: synthetic run.exit event with duration_seconds=42.5 renders "after 42.5s" in the alert text Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/recipes/_template.md | 4 ++-- docs/recipes/argo-workflows.md | 8 ++++---- docs/recipes/buildkite.md | 6 +++--- docs/recipes/gitlab-ci.md | 6 +++--- docs/recipes/jenkins.md | 6 +++--- docs/recipes/kubernetes-jobs.md | 10 +++++----- docs/recipes/mlflow.md | 8 ++++---- docs/recipes/modal.md | 8 ++++---- docs/recipes/ray.md | 10 +++++----- 9 files changed, 33 insertions(+), 33 deletions(-) diff --git a/docs/recipes/_template.md b/docs/recipes/_template.md index 841adf4..01ebcf4 100644 --- a/docs/recipes/_template.md +++ b/docs/recipes/_template.md @@ -12,7 +12,7 @@ rubric. ## Prerequisites -- DING binary `>= v0.3.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - - A notifier endpoint (Slack webhook URL, custom webhook, etc.) @@ -32,7 +32,7 @@ rules: match: metric: run.exit condition: value > 0 - message: "Job failed (exit {{ .exit_code }})" + message: "Job failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` diff --git a/docs/recipes/argo-workflows.md b/docs/recipes/argo-workflows.md index 11e1d67..22d90bb 100644 --- a/docs/recipes/argo-workflows.md +++ b/docs/recipes/argo-workflows.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.7.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.7.0` (multi-arch, scratch base) into each step's Pod via an initContainer; no need to bake DING into your workload image. +- DING binary `>= v0.10.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.10.0` (multi-arch, scratch base) into each step's Pod via an initContainer; no need to bake DING into your workload image. - Argo Workflows controller installed in the cluster, `>= v3.5` (most users on v3.5/v3.6 LTS series). - `kubectl` access to a namespace where you can create Workflows, ConfigMaps, and Secrets. - `argo` CLI installed locally (ships with the controller; one-line install per [Argo docs](https://argo-workflows.readthedocs.io/en/latest/quick-start/)). @@ -49,7 +49,7 @@ data: match: metric: run.exit condition: value > 0 - message: "Argo step {{ .pod }} (workflow {{ .workflow }}) failed with exit {{ .exit_code }}" + message: "Argo step {{ .pod }} (workflow {{ .workflow }}) failed with exit {{ .exit_code }} after {{ .duration_seconds }}s" alert: - notifier: slack --- @@ -71,7 +71,7 @@ spec: name: ding-config initContainers: - name: install-ding - image: ghcr.io/ding-labs/ding:v0.7.0 + image: ghcr.io/ding-labs/ding:v0.10.0 # `ding install` self-copies the binary — works against the FROM-scratch # release image (no /bin/sh available). Added in DING v0.5.1. command: ["/ding", "install", "/shared/ding"] @@ -173,7 +173,7 @@ spec: - { name: ding-config, configMap: { name: ding-config } } initContainers: - name: install-ding - image: ghcr.io/ding-labs/ding:v0.7.0 + image: ghcr.io/ding-labs/ding:v0.10.0 command: ["/ding", "install", "/shared/ding"] mirrorVolumeMounts: true container: diff --git a/docs/recipes/buildkite.md b/docs/recipes/buildkite.md index a8c8ca5..05902bf 100644 --- a/docs/recipes/buildkite.md +++ b/docs/recipes/buildkite.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.3.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - A Buildkite organization with at least one agent ([free trial available](https://buildkite.com/pricing)) - A notifier endpoint (Slack webhook URL, custom webhook, etc.) @@ -33,7 +33,7 @@ rules: match: metric: run.exit condition: value > 0 - message: "{{ .repo }}@{{ .branch }} failed (exit {{ .exit_code }})" + message: "{{ .repo }}@{{ .branch }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` @@ -64,7 +64,7 @@ Use these in `match.labels` or `message` templates. See [Configuration](../confi 2. Trigger a build. Confirm a successful step produces no alert. 3. Force a failure (`exit 1` in `run-tests.sh`). Confirm the alert fires in Slack within ~5 seconds of step exit. -If the alert doesn't fire, check the Buildkite build log for `ding` output. Common issues: webhook URL not exposed (env hook scope, agent vs pipeline level), or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md). +If the alert doesn't fire, check the Buildkite build log for `ding` output. Common issues: webhook URL not exposed (env hook scope, agent vs pipeline level), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run). ## Native Buildkite UI surfacing diff --git a/docs/recipes/gitlab-ci.md b/docs/recipes/gitlab-ci.md index 7f5becc..7a6704b 100644 --- a/docs/recipes/gitlab-ci.md +++ b/docs/recipes/gitlab-ci.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.3.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - A GitLab project with CI enabled (gitlab.com or self-hosted) - A notifier endpoint (Slack webhook URL, custom webhook, etc.) accessible from the runner @@ -35,7 +35,7 @@ rules: match: metric: run.exit condition: value > 0 - message: "Pipeline {{ .branch }} failed (exit {{ .exit_code }})" + message: "Pipeline {{ .branch }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` @@ -67,7 +67,7 @@ Use these in `match.labels` for selective rules, or in `message` templates as `{ 2. Push a commit. Confirm the pipeline runs and that a successful job produces no alert. 3. Force a failure: change `run-tests.sh` to `exit 1`. Confirm the alert fires in Slack within ~5 seconds of job exit. -If the alert doesn't fire, check the GitLab CI job log for `ding` output. Common issues: webhook URL not exposed to the job (mark the variable as not "Protected" if testing on a non-protected branch), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md). +If the alert doesn't fire, check the GitLab CI job log for `ding` output. Common issues: webhook URL not exposed to the job (mark the variable as not "Protected" if testing on a non-protected branch), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run). ## Native GitLab UI surfacing diff --git a/docs/recipes/jenkins.md b/docs/recipes/jenkins.md index 95124c8..f1d906a 100644 --- a/docs/recipes/jenkins.md +++ b/docs/recipes/jenkins.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.3.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - A Jenkins controller (any version supporting Pipeline DSL — most do) - A notifier endpoint (Slack webhook URL, custom webhook, etc.) reachable from the Jenkins agent @@ -47,7 +47,7 @@ rules: match: metric: run.exit condition: value > 0 - message: "{{ .job }} build {{ .build }} failed (exit {{ .exit_code }})" + message: "{{ .job }} build {{ .build }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` @@ -77,7 +77,7 @@ Note: Jenkins doesn't expose `repo`, `branch`, or `commit` as universal env vars 2. Trigger the job. Confirm a successful build produces no alert. 3. Force a failure (`exit 1` in `run-tests.sh`). Confirm the alert fires in Slack within ~5 seconds of build exit. -If the alert doesn't fire, check the Jenkins build console for `ding` output. Common issues: webhook credential not exposed to the job (`withCredentials` block missing or wrong `credentialsId`), or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md). +If the alert doesn't fire, check the Jenkins build console for `ding` output. Common issues: webhook credential not exposed to the job (`withCredentials` block missing or wrong `credentialsId`), or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run). ## Tradeoffs / known limitations diff --git a/docs/recipes/kubernetes-jobs.md b/docs/recipes/kubernetes-jobs.md index 68ba5c8..6898509 100644 --- a/docs/recipes/kubernetes-jobs.md +++ b/docs/recipes/kubernetes-jobs.md @@ -14,7 +14,7 @@ ## Prerequisites -- DING binary `>= v0.7.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.7.0` (multi-arch, scratch base) into your Pod via an initContainer; no need to bake DING into your workload image. +- DING binary `>= v0.10.0` — see [install](../install.md). The recipe pulls the official container image `ghcr.io/ding-labs/ding:v0.10.0` (multi-arch, scratch base) into your Pod via an initContainer; no need to bake DING into your workload image. - A Kubernetes cluster `>= 1.21` for the primary wrapper pattern below. The sidecar alternative documented in [Configuration](#sidecar-alternative-k8s-129) requires `>= 1.29` for native sidecar lifecycle. - `kubectl` access to a namespace where you can create Jobs, ConfigMaps, and Secrets. - A notifier endpoint (Slack webhook URL or custom webhook) you can store in a Kubernetes Secret. @@ -58,7 +58,7 @@ data: match: metric: run.exit condition: value > 0 - message: "{{ .pod }} (Job {{ .job_name }}) failed with exit {{ .exit_code }}" + message: "{{ .pod }} (Job {{ .job_name }}) failed with exit {{ .exit_code }} after {{ .duration_seconds }}s" alert: - notifier: slack --- @@ -82,7 +82,7 @@ spec: name: ding-config initContainers: - name: install-ding - image: ghcr.io/ding-labs/ding:v0.7.0 + image: ghcr.io/ding-labs/ding:v0.10.0 # `ding install` self-copies the binary — works against the FROM-scratch # release image (no /bin/sh available). Added in DING v0.5.1. command: ["/ding", "install", "/shared/ding"] @@ -184,7 +184,7 @@ spec: spec: initContainers: - name: ding - image: ghcr.io/ding-labs/ding:v0.7.0 + image: ghcr.io/ding-labs/ding:v0.10.0 restartPolicy: Always # native sidecar — K8s 1.29+ command: ["/ding", "serve", "--config", "/etc/ding/ding.yaml"] # ...volumeMounts for config + downward-API env block @@ -225,7 +225,7 @@ rules: - name: job_failed match: { metric: run.exit } condition: value > 0 - message: "Job failed (exit {{ .exit_code }})" + message: "Job failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: k8s ``` diff --git a/docs/recipes/mlflow.md b/docs/recipes/mlflow.md index 02eb48f..98e767b 100644 --- a/docs/recipes/mlflow.md +++ b/docs/recipes/mlflow.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.6.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - `mlflow >= 2.0` (`pip install mlflow`) - An MLflow tracking URI: local SQLite for dev; remote tracking server like Databricks or self-hosted (`mlflow server`) for production deep-links to work - A notifier endpoint (Slack webhook URL is the canonical example) @@ -49,7 +49,7 @@ rules: match: { metric: run.exit } condition: value > 0 message: | - MLflow run failed (exit {{ .exit_code }}) + MLflow run failed (exit {{ .exit_code }} after {{ .duration_seconds }}s) <{{ .tracking_uri }}/#/experiments/{{ .experiment_id }}/runs/{{ .run_id }}|View run in MLflow UI> alert: - notifier: slack @@ -87,7 +87,7 @@ A Slack message during training when `val_loss` exceeds threshold: …and on training-process exit: > 🔔 `training_failed` -> MLflow run failed (exit 1) +> MLflow run failed (exit 1 after 42s) > [View run in MLflow UI](#) The deep-link in the second message takes you straight to the MLflow run page. All alerts are auto-tagged with `run_id`, `runner=mlflow`, `experiment_id`, `tracking_uri`. @@ -121,7 +121,7 @@ mlflow run . --env-manager=local # 3. labels include run_id, experiment_id, tracking_uri ``` -If the alert doesn't fire, check the `mlflow run` log for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not exported in the shell that ran `mlflow run`, or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md). +If the alert doesn't fire, check the `mlflow run` log for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not exported in the shell that ran `mlflow run`, or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run). ## Tradeoffs / known limitations diff --git a/docs/recipes/modal.md b/docs/recipes/modal.md index 319255e..acb9781 100644 --- a/docs/recipes/modal.md +++ b/docs/recipes/modal.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.8.0` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - Modal CLI (`pip install modal`) authenticated via `modal token new` - A Modal account (free tier with $30/mo credit covers this recipe end-to-end) - A notifier endpoint (Slack webhook URL is the canonical example) @@ -67,7 +67,7 @@ rules: - name: training_failed match: { metric: run.exit } condition: value > 0 - message: "Modal function {{ .function_name }} (task {{ .modal_task_id }}) failed (exit {{ .exit_code }})" + message: "Modal function {{ .function_name }} (task {{ .modal_task_id }}) failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` @@ -91,7 +91,7 @@ A Slack message during training when `val_loss` exceeds threshold: …and on function exit: > 🔔 `training_failed` -> Modal function trainer (task ta-abc123def) failed (exit 1) +> Modal function trainer (task ta-abc123def) failed (exit 1 after 287s) The `modal_task_id` matches the task ID visible in the Modal dashboard, so the Slack alert is one click away from the function's logs and metrics. @@ -105,7 +105,7 @@ DING does **not** auto-detect Modal — Modal's runtime owns the container entry | `MODAL_FUNCTION_NAME` | The function's Python name | | `MODAL_APP_NAME` | The Modal `App` name (for multi-function apps) | -Emit any subset as flat top-level JSON keys (DING's ingester at `internal/ingester/json.go` extracts top-level strings as labels and numbers as floats; nested objects are skipped). Use them in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full reference. +Emit any subset as flat top-level JSON keys — DING extracts top-level strings as labels and numbers as floats; nested objects are skipped. Use them in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full notifier reference. ## Verification diff --git a/docs/recipes/ray.md b/docs/recipes/ray.md index 2a157f5..f1d445f 100644 --- a/docs/recipes/ray.md +++ b/docs/recipes/ray.md @@ -4,7 +4,7 @@ ## Prerequisites -- DING binary `>= v0.7.1` — see [install](../install.md) +- DING binary `>= v0.10.0` — see [install](../install.md) - `ray >= 2.0` (`pip install "ray[default]"`; add `train`/`tune` extras as needed for your workload) - A running Ray cluster: local single-node (`ray start --head`) for dev; KubeRay/Anyscale/EKS for production - A notifier endpoint (Slack webhook URL is the canonical example) @@ -39,7 +39,7 @@ rules: - name: training_failed match: { metric: run.exit } condition: value > 0 - message: "Ray job {{ .run_id }} failed (exit {{ .exit_code }})" + message: "Ray job {{ .run_id }} failed (exit {{ .exit_code }} after {{ .duration_seconds }}s)" alert: - notifier: slack ``` @@ -112,7 +112,7 @@ A Slack message during training when `val_loss` exceeds threshold: …and on training-process exit: > 🔔 `training_failed` -> Ray job raysubmit_abcdef1234567890 failed (exit 1) +> Ray job raysubmit_abcdef1234567890 failed (exit 1 after 1843s) All Path A alerts are auto-tagged with `run_id` + `runner=ray`. The `run_id` matches the UUID printed by `ray job list`. @@ -125,7 +125,7 @@ All Path A alerts are auto-tagged with `run_id` + `runner=ray`. The `run_id` mat | `run_id` | `RAY_JOB_ID` | Ray job UUID matching `ray job list` output | | `runner` | `"ray"` (set by runctx) | | -Use these in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full reference. +Use these in `match.labels` or `message` template variables. See [Configuration](../configuration.md) for the full notifier reference. ## Verification @@ -146,7 +146,7 @@ ray job list ray stop ``` -If the alert doesn't fire, check the Ray driver logs (`ray job logs `) for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not forwarded via `--runtime-env-json`, or `drain_timeout` shorter than the notifier retry window — see [Configuration](../configuration.md). +If the alert doesn't fire, check the Ray driver logs (`ray job logs `) for `ding` output. Common issues: `SLACK_WEBHOOK_URL` not forwarded via `--runtime-env-json`, or `drain_timeout` shorter than the notifier retry window — see [Configuration → drain_timeout](../configuration.md#drain_timeout-and-retry-behaviour-in-ding-run). ## Tradeoffs / known limitations