From e3fe6ff53edd37134595dc05cce3bbc6b16c18a0 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 14:02:55 -0400
Subject: [PATCH 01/10] move hallucination doc

---
 .../template_evaluations.md                   | 70 ++++++++++++++++++-
 .../evaluations/evaluation_compatibility.md   |  4 +-
 .../evaluations/managed_evaluations/_index.md |  5 +-
 .../quality_evaluations.md                    | 68 ------------------
 .../llm_observability/instrumentation/sdk.md  |  8 +--
 5 files changed, 76 insertions(+), 79 deletions(-)
diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
index 7d0f150ac85..b7b2f9ace63 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
@@ -8,6 +8,9 @@ further_reading:
 - link: "/llm_observability/setup"
   tag: "Documentation"
   text: "Learn how to set up LLM Observability"
+- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/"
+  tag: "Blog"
+  text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability"
 aliases:
   - /llm_observability/evaluations/agent_evaluations
   - /llm_observability/evaluations/managed_evaluations/agent_evaluations
@@ -15,7 +18,7 @@ aliases:
   - /llm_observability/evaluations/managed_evaluations/session_level_evaluations
 ---
 
-Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation. 
+Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Hallucination][25], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation. 
 
 For best practices and details on how to create LLM-as-a-judge evaluations, read [Create a custom LLM-as-a-judge evaluation][17].
 
@@ -52,6 +55,69 @@ Datadog provides the following categories of Failure to Answer, listed in the fo
 | Redirection Response | Redirects the user to another source or suggests an alternative approach | If you have additional details, I'd be happy to include them|
 | Refusal Response | Explicitly declines to provide an answer or to complete the request | Sorry, I can't answer this question |
 
+### Hallucination
+
+Hallucination evaluations identify instances where the LLM makes a claim that disagrees with the provided input context. This check helps ensure your RAG applications stay grounded in retrieved data and do not fabricate information.
+
+{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
+
+| Evaluation Stage | Evaluation Definition |
+|---|---|
+| Evaluated on Output | Hallucination flags any output that disagrees with the context provided to the LLM. |
+
+<div class="alert alert-info">Hallucination detection is only available for OpenAI.</div>
+
+#### Configure a Hallucination evaluation
+
+Use [Prompt Tracking][26] annotations to track your prompts and set them up for hallucination detection. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data.
+
+{{< code-block lang="python" >}}
+from ddtrace.llmobs import LLMObs
+from ddtrace.llmobs.types import Prompt
+
+# if your llm call is auto-instrumented...
+with LLMObs.annotation_context(
+        prompt=Prompt(
+            id="generate_answer_prompt",
+            template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
+            variables={"user_question": user_question, "article": article},
+            rag_query_variables=["user_question"],
+            rag_context_variables=["article"]
+        ),
+        name="generate_answer"
+):
+    oai_client.chat.completions.create(...) # autoinstrumented llm call
+
+# if your llm call is manually instrumented ...
+@llm(name="generate_answer")
+def generate_answer():
+  ...
+  LLMObs.annotate(
+            prompt=Prompt(
+                id="generate_answer_prompt",
+                template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
+                variables={"user_question": user_question, "article": article},
+                rag_query_variables=["user_question"],
+                rag_context_variables=["article"]
+            ),
+  )
+{{< /code-block >}}
+
+The `variables` dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use `rag_query_variables` and `rag_context_variables` to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base).
+
+Hallucination detection does not run if either the rag query, the rag context, or the span output is empty.
+
+Prompt Tracking is available on Python starting from version 3.15. It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. You can find more examples of prompt tracking and instrumentation in the [SDK documentation][26].
+
+Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled:
+
+| Configuration Option | Description |
+|---|---|
+| Contradiction | Claims made in the LLM-generated response that go directly against the provided context |
+| Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context |
+
+Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims.
+
 ### Prompt Injection
 
 Prompt Injection evaluations identify attempts by unauthorized or malicious authors to manipulate the LLM's responses or redirect the conversation in ways not intended by the original author. This check maintains the integrity and authenticity of interactions between users and the LLM.
@@ -342,3 +408,5 @@ result = triage_agent.run_sync(
 [22]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness
 [23]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness
 [24]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection
+[25]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
+[26]: /llm_observability/instrumentation/sdk?tab=python#prompt-tracking
diff --git a/content/en/llm_observability/evaluations/evaluation_compatibility.md b/content/en/llm_observability/evaluations/evaluation_compatibility.md
index ed17c51d9a9..9272e5a5cba 100644
--- a/content/en/llm_observability/evaluations/evaluation_compatibility.md
+++ b/content/en/llm_observability/evaluations/evaluation_compatibility.md
@@ -13,7 +13,6 @@ Managed evaluations are supported for the following configurations.
 
 | Evaluation                      | DD-trace version  |  LLM Provider                 | Applicable span |
 | --------------------------------| ----------------- | ------------------------------| ----------------|
-| [Hallucination][4]              | v2.18+            | OpenAI                        | LLM only        |
 | [Language Mismatch][10]         | Fully supported   | Self hosted                   | All span kinds  |
 
 ### Custom LLM-as-a-judge evaluations
@@ -34,6 +33,7 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f
 | Evaluation              | DD-trace version | LLM Provider                  | Applicable span |
 | ----------------------- | ---------------- | ----------------------------- | --------------- |
 | [Failure to Answer][5]  | Fully supported  | All third party LLM providers | All span kinds  |
+| [Hallucination][4]      | v2.18+           | OpenAI                        | LLM only        |
 | [Sentiment][6]          | Fully supported  | All third party LLM providers | All span kinds  |
 | [Toxicity][7]           | Fully supported  | All third party LLM providers | All span kinds  |
 | [Prompt Injection][8]   | Fully supported  | All third party LLM providers | All span kinds  |
@@ -46,7 +46,7 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f
 [1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection
 [2]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness
 [3]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness
-[4]: /llm_observability/evaluations/managed_evaluations#hallucination
+[4]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
 [5]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#failure-to-answer
 [6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#sentiment
 [7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#toxicity
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
index 051924c82ce..35a3cc54447 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
@@ -11,9 +11,6 @@ further_reading:
 - link: "/llm_observability/setup"
   tag: "Documentation"
   text: "Learn how to set up LLM Observability"
-- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/"
-  tag: "Blog"
-  text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability"
 aliases:
     - /llm_observability/evaluations/ootb_evaluations
 ---
@@ -224,7 +221,7 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam
 [3]: https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage
 [4]: /llm_observability/evaluations/managed_evaluations/quality_evaluations
 [5]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#topic-relevancy
-[6]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#hallucination
+[6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
 [7]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#failure-to-answer
 [8]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#language-mismatch
 [9]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#sentiment
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md b/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md
index 34e097e81f5..f60fb94b9bb 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md
@@ -8,78 +8,12 @@ further_reading:
 - link: "/llm_observability/setup"
   tag: "Documentation"
   text: "Learn how to set up LLM Observability"
-- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/"
-  tag: "Blog"
-  text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability"
 aliases:
     - /llm_observability/evaluations/quality_evaluations
 ---
 
 Quality evaluations help ensure your LLM-powered applications generate accurate, relevant, and safe responses. Managed evaluations automatically score model outputs on key quality dimensions and attach results to traces, helping you detect issues, monitor trends, and improve response quality over time.
 
-#### Hallucination
-
-This check identifies instances where the LLM makes a claim that disagrees with the provided input context.
-
-{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
-
-| Evaluation Stage | Evaluation Method | Evaluation Definition |
-|---|---|---|
-| Evaluated on Output | Evaluated using LLM | Hallucination flags any output that disagrees with the context provided to the LLM. |
-
-##### Instrumentation
-You can use [Prompt Tracking][2] annotations to track your prompts and set them up for hallucination configuration. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data.
-
-{{< code-block lang="python" >}}
-from ddtrace.llmobs import LLMObs
-from ddtrace.llmobs.types import Prompt
-
-# if your llm call is auto-instrumented...
-with LLMObs.annotation_context(
-        prompt=Prompt(
-            id="generate_answer_prompt",
-            template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
-            variables={"user_question": user_question, "article": article},
-            rag_query_variables=["user_question"],
-            rag_context_variables=["article"]
-        ),
-        name="generate_answer"
-):
-    oai_client.chat.completions.create(...) # autoinstrumented llm call
-
-# if your llm call is manually instrumented ...
-@llm(name="generate_answer")
-def generate_answer():
-  ...
-  LLMObs.annotate(
-            prompt=Prompt(
-                id="generate_answer_prompt",
-                template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
-                variables={"user_question": user_question, "article": article},
-                rag_query_variables=["user_question"],
-                rag_context_variables=["article"]
-            ),
-  )
-{{< /code-block >}}
-The `variables` dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use `rag_query_variables` and `rag_context_variables` to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base).
-
-Hallucination detection does not run if either the rag query, the rag context, or the span output is empty.
-
-Prompt Tracking is available on python starting from the 3.15 version, It also requires an ID for the prompt and the template set up to monitor and track your prompt versions.
-You can find more examples of prompt tracking and instrumentation in the [SDK documentation][2].
-
-##### Hallucination configuration
-<div class="alert alert-info">Hallucination detection is only available for OpenAI.</div>
-Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled.
-
-| Configuration Option | Description |
-|---|---|
-| Contradiction | Claims made in the LLM-generated response that go directly against the provided context |
-| Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context |
-
-Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims.
-
-
 #### Language Mismatch
 
 This check identifies instances where the LLM generates responses in a different language or dialect than the one used by the user, which can lead to confusion or miscommunication. This check ensures that the LLM's responses are clear, relevant, and appropriate for the user's linguistic preferences and needs.
@@ -96,5 +30,3 @@ Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Belarusian, Bengali, Norwegi
 |---|---|---|
 | Evaluated on Input and Output | Evaluated using Open Source Model | Language Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user's question in the same language that the user used.  |
 
-[1]: https://app.datadoghq.com/llm/applications
-[2]: /llm_observability/instrumentation/sdk?tab=python#prompt-tracking
diff --git a/content/en/llm_observability/instrumentation/sdk.md b/content/en/llm_observability/instrumentation/sdk.md
index a3b787fc79d..165e8cfb5b5 100644
--- a/content/en/llm_observability/instrumentation/sdk.md
+++ b/content/en/llm_observability/instrumentation/sdk.md
@@ -1808,8 +1808,8 @@ Supported keys:
 - `template` (str): Template string with placeholders (for example, `"Translate {{text}} to {{lang}}"`).
 - `chat_template` (List[Message]): Multi-message template form. Provide a list of `{ "role": "<role>", "content": "<template string with placeholders>" }` objects.
 - `tags` (Dict[str, str]): Tags to attach to the prompt run.
-- `rag_context_variables` (List[str]): Variable keys that contain ground-truth/context content. Used for [hallucination detection](/llm_observability/evaluations/managed_evaluations/?tab=openai#hallucination).
-- `rag_query_variables` (List[str]): Variable keys that contain the user query. Used for [hallucination detection](/llm_observability/evaluations/managed_evaluations/?tab=openai#hallucination).
+- `rag_context_variables` (List[str]): Variable keys that contain ground-truth/context content. Used for [hallucination detection](/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination).
+- `rag_query_variables` (List[str]): Variable keys that contain the user query. Used for [hallucination detection](/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination).
 
 {{% /collapse-content %}}
 
@@ -1870,8 +1870,8 @@ Supported properties:
 - `variables` (Record<string, string>): Variables used to populate the template placeholders.
 - `template` (string | List[Message]): Template string with placeholders (for example, `"Translate {{text}} to {{lang}}"`). Alternatively, a list of `{ "role": "<role>", "content": "<template string with placeholders>" }` objects.
 - `tags` (Record<string, string>): Tags to attach to the prompt run.
-- `contextVariables` (string[]): Variable keys that contain ground-truth/context content. Used for [hallucination detection](/llm_observability/evaluations/managed_evaluations/?tab=openai#hallucination).
-- `queryVariables` (string[]): Variable keys that contain the user query. Used for [hallucination detection](/llm_observability/evaluations/managed_evaluations/?tab=openai#hallucination).
+- `contextVariables` (string[]): Variable keys that contain ground-truth/context content. Used for [hallucination detection](/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination).
+- `queryVariables` (string[]): Variable keys that contain the user query. Used for [hallucination detection](/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination).
 
 {{% /collapse-content %}}
 

From c8484e4cfad74caadb1eb83a4fdd88949d337904 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 14:39:26 -0400
Subject: [PATCH 02/10] tweaks

---
 .../template_evaluations.md                                 | 6 +-----
 .../evaluations/evaluation_compatibility.md                 | 2 +-
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
index b7b2f9ace63..11b247b8b0c 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
@@ -59,14 +59,10 @@ Datadog provides the following categories of Failure to Answer, listed in the fo
 
 Hallucination evaluations identify instances where the LLM makes a claim that disagrees with the provided input context. This check helps ensure your RAG applications stay grounded in retrieved data and do not fabricate information.
 
-{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
-
 | Evaluation Stage | Evaluation Definition |
 |---|---|
 | Evaluated on Output | Hallucination flags any output that disagrees with the context provided to the LLM. |
 
-<div class="alert alert-info">Hallucination detection is only available for OpenAI.</div>
-
 #### Configure a Hallucination evaluation
 
 Use [Prompt Tracking][26] annotations to track your prompts and set them up for hallucination detection. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data.
@@ -109,7 +105,7 @@ Hallucination detection does not run if either the rag query, the rag context, o
 
 Prompt Tracking is available on Python starting from version 3.15. It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. You can find more examples of prompt tracking and instrumentation in the [SDK documentation][26].
 
-Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled:
+Hallucination detection makes a distinction between two types of hallucinations:
 
 | Configuration Option | Description |
 |---|---|
diff --git a/content/en/llm_observability/evaluations/evaluation_compatibility.md b/content/en/llm_observability/evaluations/evaluation_compatibility.md
index 9272e5a5cba..68eb2c69aad 100644
--- a/content/en/llm_observability/evaluations/evaluation_compatibility.md
+++ b/content/en/llm_observability/evaluations/evaluation_compatibility.md
@@ -33,7 +33,7 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f
 | Evaluation              | DD-trace version | LLM Provider                  | Applicable span |
 | ----------------------- | ---------------- | ----------------------------- | --------------- |
 | [Failure to Answer][5]  | Fully supported  | All third party LLM providers | All span kinds  |
-| [Hallucination][4]      | v2.18+           | OpenAI                        | LLM only        |
+| [Hallucination][4]      | Fully supported  | All third party LLM providers | LLM only        |
 | [Sentiment][6]          | Fully supported  | All third party LLM providers | All span kinds  |
 | [Toxicity][7]           | Fully supported  | All third party LLM providers | All span kinds  |
 | [Prompt Injection][8]   | Fully supported  | All third party LLM providers | All span kinds  |

From 05be54c5f5db789f361e7d7dce11553750bdff17 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 14:52:22 -0400
Subject: [PATCH 03/10] add back screenshot

---
 .../custom_llm_as_a_judge_evaluations/template_evaluations.md   | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
index 11b247b8b0c..b2a90464219 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md
@@ -59,6 +59,8 @@ Datadog provides the following categories of Failure to Answer, listed in the fo
 
 Hallucination evaluations identify instances where the LLM makes a claim that disagrees with the provided input context. This check helps ensure your RAG applications stay grounded in retrieved data and do not fabricate information.
 
+{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
+
 | Evaluation Stage | Evaluation Definition |
 |---|---|
 | Evaluated on Output | Hallucination flags any output that disagrees with the context provided to the LLM. |

From 0afc7c05ce8bc9a9b583012a1f8eea4e89063ff3 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 15:20:43 -0400
Subject: [PATCH 04/10] remove usused code

---
 .../_index.md                                 |  12 ++
 .../evaluations/managed_evaluations/_index.md | 199 +-----------------
 2 files changed, 20 insertions(+), 191 deletions(-)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
index 2113b91ebbe..da4804295f3 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
@@ -334,6 +334,18 @@ You can:
 - **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
 - **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.
 
+## Estimated token usage
+
+You can monitor the token usage of your LLM evaluations using [this dashboard][3].
+
+If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:
+
+- `ml_obs.estimated_usage.llm.input.tokens`
+- `ml_obs.estimated_usage.llm.output.tokens`
+- `ml_obs.estimated_usage.llm.total.tokens`
+
+Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_name`, and `evaluation_name` tags, allowing you to pinpoint specific applications, models, and evaluations contributing to your usage.
+
 ## Further Reading
 
 {{< partial name="whats-next/whats-next.html" >}}
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
index 35a3cc54447..7dae6caf637 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
@@ -23,142 +23,13 @@ LLM Observability associates evaluations with individual spans so you can view t
 
 LLM Observability managed evaluations leverage LLMs. To connect your LLM provider to Datadog, you need a key from the provider.
 
-Learn more about the [compatibility requirements][19].
-
-## Connect your LLM provider account
-
-Configure the LLM provider you would like to use for bring-your-own-key (BYOK) evaluations. You only have to complete this step once.
-
-{{< tabs >}}
-{{% tab "OpenAI" %}}
-
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Connect your OpenAI account to LLM Observability with your OpenAI API key. LLM Observability uses the `GPT-4o mini` model for evaluations.
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. Select **Connect** on the OpenAI tile.
-1. Follow the instructions on the tile.
-   - Provide your OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
-1. Enable **Use this API key to evaluate your LLM applications**.
-
-{{< img src="llm_observability/configuration/openai-tile.png" alt="The OpenAI configuration tile in LLM Observability. Lists instructions for configuring OpenAI and providing your OpenAI API key." style="width:100%;" >}}
-
-LLM Observability does not support [data residency][2] for OpenAI.
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-[2]: https://platform.openai.com/docs/guides/your-data#which-models-and-features-are-eligible-for-data-residency
-{{% /tab %}}
-{{% tab "Azure OpenAI" %}}
-
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Azure OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Connect your Azure OpenAI account to LLM Observability with your OpenAI API key. Datadog strongly recommends using the `GPT-4o mini` model for evaluations. The selected model version must support [structured output][8].
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. Select **Connect** on the Azure OpenAI tile.
-1. Follow the instructions on the tile.
-   - Provide your Azure OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
-   - Provide the Resource Name, Deployment ID, and API version to complete integration.
-
-{{< img src="llm_observability/configuration/azure-openai-tile.png" alt="The Azure OpenAI configuration tile in LLM Observability. Lists instructions for configuring Azure OpenAI and providing your API Key, Resource Name, Deployment ID, and API Version." style="width:100%;" >}}
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-[8]: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/structured-outputs
-{{% /tab %}}
-{{% tab "Anthropic" %}}
-
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Anthropic account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Connect your Anthropic account to LLM Observability with your Anthropic API key. LLM Observability uses the `Haiku` model for evaluations.
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. Select **Connect** on the Anthropic tile.
-1. Follow the instructions on the tile.
-   - Provide your Anthropic API key. Ensure that this key has **write** permission for **model capabilities**.
-
-{{< img src="llm_observability/configuration/anthropic-tile.png" alt="The Anthropic configuration tile in LLM Observability. Lists instructions for configuring Anthropic and providing your Anthropic API key." style="width:100%;" >}}
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-{{% /tab %}}
-{{% tab "Amazon Bedrock" %}}
-
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Amazon Bedrock account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Connect your Amazon Bedrock account to LLM Observability with your AWS Account. LLM Observability uses the `Haiku` model for evaluations.
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. Select **Connect** on the Amazon Bedrock tile.
-1. Follow the instructions on the tile.
-
-   {{< img src="llm_observability/configuration/amazon-bedrock-tile.png" alt="The Amazon Bedrock configuration tile in LLM Observability. Lists instructions for configuring Amazon Bedrock." style="width:100%;" >}}
-
-4. Be sure to configure the **Invoke models from Amazon Bedrock** role to run evaluations. More details about the InvokeModel action can be found in the [Amazon Bedrock API reference documentation][2].
-
-
-   {{< img src="llm_observability/configuration/amazon-bedrock-tile-step-2.png" alt="The second step in configuring Amazon Bedrock requiring users to add permissions to the integration account." style="width:100%;" >}}
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-[2]: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html
-{{% /tab %}}
-
-{{% tab "GCP Vertex AI" %}}
-
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to a Google Cloud Platform account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Connect Vertex AI to LLM Observability with your Google Cloud Platform account. LLM Observability uses the `gemini-2.5-flash` model for evaluations.
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. On the Google Cloud Vertex AI tile, click **Connect** to add a new GCP account, or click **Configure** next to where your existing accounts are listed to begin the onboarding process.
-   - You will see all GCP accounts connected to Datadog in this page. However, you must still go through the onboarding process for an account to use it in LLM Observability.
-1. Follow the onboarding instructions to configure your account.
-   - Add the [**Vertex AI User**][2] role to your account and enable the [**Vertex AI API**][3].
-
-{{< img src="llm_observability/configuration/vertex-ai-pint.png" alt="The Vertex AI onboarding workflow. Follow steps to configure your GCP service account with the right Vertex AI permissions for use with LLM Observability." style="width:100%;" >}}
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-[2]: https://docs.cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user
-[3]: https://console.cloud.google.com/apis/library/aiplatform.googleapis.com
-{{% /tab %}}
-
-{{% tab "AI Gateway" %}}
-<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you only connect to an AI Gateway that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
-
-Your AI Gateway must be compatible with the [OpenAI API specification][2].
-
-Connect your AI Gateway to LLM Observability with your base URL, API key, and headers.
-
-1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
-1. Click the **Configure** tab, then click **New** to create a new gateway.
-1. Follow the instructions on the tile.
-   - Provide a name for your gateway.
-   - Select your provider.
-   - Provide your base URL.
-   - Provide your API key and optionally any headers.
-
-{{< img src="llm_observability/configuration/ai-gateway-tile-3.png" alt="The AI Gateway configuration tile in LLM Observability. Lists instructions for configuring an ai gateway" style="width:100%;" >}}
-
-[1]: https://app.datadoghq.com/llm/settings/integrations
-[2]: https://platform.openai.com/docs/api-reference/introduction
-{{% /tab %}}
-{{< /tabs >}}
-
-If your LLM provider restricts IP addresses, you can obtain the required IP ranges by visiting [Datadog's IP ranges documentation][2], selecting your `Datadog Site`, pasting the `GET` URL into your browser, and copying the `webhooks` section.
+Learn more about the [compatibility requirements][2].
 
 ## Create new evaluations
 
 1. Navigate to [**AI Observability > Evaluations**][1].
 1. Click on the **Create Evaluation** button on the top right corner.
 1. Select a specific managed evaluation. This will open the evalution editor window.
-1. Select the LLM application(s) you want to configure your evaluation for.
-1. Select the LLM provider and corresponding  account.
-    - If you select an **Amazon Bedrock** account, choose the region the account is configured for.
-    - If you select a **Vertex** account, choose a project and location.
-1. Configure the data to run the evaluation on:
-   - Select **Traces** (filtering for the root span of each trace) or **All Spans** (no filtering).
-   - (Optional) Specify any or all **tags** you want this evaluation to run on.
-   - (Optional) Select what percentage of spans you would like this evaluation to run on by configuring the **sampling percentage**. This number must be greater than `0` and less than or equal to `100` (sampling all spans).
-1. (Optional) Configure evaluation options by selecting what subcategories should be flagged. Only available on some evaluations.
 
 After you click **Save and Publish**, LLM Observability uses the LLM account you connected to power the evaluation you enabled. Alternatively, you can **Save as Draft** and edit or enable them later.
 
@@ -167,71 +38,17 @@ After you click **Save and Publish**, LLM Observability uses the LLM account you
 1. Navigate to [**AI Observability > Evaluations**][1].
 1. Hover over the evaluation you want to edit and click the **Edit** button.
 
-### Estimated token usage
-
-You can monitor the token usage of your managed LLM evaluations using [this dashboard][3].
-
-If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:
-
-
-- `ml_obs.estimated_usage.llm.input.tokens`
-- `ml_obs.estimated_usage.llm.output.tokens`
-- `ml_obs.estimated_usage.llm.total.tokens`
-
-Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_name`, and `evaluation_name` tags, allowing you to pinpoint specific applications, models, and evaluations contributing to your usage.
-
-
-### Quality evaluations
-
-[Quality evaluations][4] help ensure your LLM-powered applications generate accurate, relevant, and safe responses. Managed evaluations automatically score model outputs on key quality dimensions and attach results to traces, helping you detect issues, monitor trends, and improve response quality over time. Datadog offers the following quality evaluations:
-
-- [Topic relevancy][5] - Measures whether the model’s response stays relevant to the user’s input or task
-- [Hallucination][6] - Detects when the model generates incorrect or unsupported information presented as fact
-- [Failure to Answer][7] - Identifies cases where the model does not meaningfully answer the user’s question
-- [Language Mismatch][8] - Flags responses that are written in a different language than the user’s input
-- [Sentiment][9] - Evaluates the emotional tone of the model’s response to ensure it aligns with expectations
-
-### Security and Safety evaluations
-
-[Security and Safety evaluations][10] help ensure your LLM-powered applications resist malicious inputs and unsafe outputs. Managed evaluations automatically detect risks like prompt injection and toxic content by scoring model interactions and tying results to trace data for investigation. Datadog offers the following security and safety evaluations:
-
-- [Toxicity][11] - Detects harmful, offensive, or abusive language in model inputs or outputs
-- [Prompt Injection][12] - Identifies attempts to manipulate the model into ignoring instructions or revealing unintended behavior
-- [Sensitive Data Scanning][13] - Flags the presence of sensitive or regulated information in model inputs or outputs
-
-### Session level evaluations
-
-[Session level evaluations][14] help ensure your LLM-powered applications successfully achieve intended user outcomes across entire interactions. These managed evaluations analyze multi-turn sessions to assess higher-level goals and behaviors that span beyond individual spans, giving insight into overall effectiveness and user satisfaction. Datadog offers the following session-level evaluations:
-
-- [Goal Completeness][15] - Evaluates whether the user’s intended goal was successfully achieved over the course of the entire session
-
-### Agent evaluations
+### Supported managed evaluations
 
-[Agent evaluations][16] help ensure your LLM-powered applications are making the right tool calls and successfully resolving user requests. These checks are designed to catch common failure modes when agents interact with external tools, APIs, or workflows. Datadog offers the following agent evaluations:
+- [Language Mismatch][3] - Flags responses that are written in a different language than the user’s input
+- [Sensitive Data Scanning][4] - Flags the presence of sensitive or regulated information in model inputs or outputs
 
-- [Tool selection][17] - Verifies that the tool(s) selected by an agent are correct
-- [Tool argument correctness][18] - Ensures the arguments provided to a tool by the agent are correct
 
 ## Further Reading
 
 {{< partial name="whats-next/whats-next.html" >}}
 
-[1]: https://app.datadoghq.com/llm/evaluations
-[2]: /api/latest/ip-ranges/
-[3]: https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage
-[4]: /llm_observability/evaluations/managed_evaluations/quality_evaluations
-[5]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#topic-relevancy
-[6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
-[7]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#failure-to-answer
-[8]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#language-mismatch
-[9]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#sentiment
-[10]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations
-[11]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations#toxicity
-[12]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations#prompt-injection
-[13]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations#sensitive-data-scanning
-[14]: /llm_observability/evaluations/managed_evaluations/session_level_evaluations
-[15]: /llm_observability/evaluations/managed_evaluations/session_level_evaluations#goal-completeness
-[16]: /llm_observability/evaluations/managed_evaluations/agent_evaluations
-[17]: /llm_observability/evaluations/managed_evaluations/agent_evaluations#tool-selection
-[18]: /llm_observability/evaluations/managed_evaluations/agent_evaluations#tool-argument-correctness
-[19]: /llm_observability/evaluations/evaluation_compatibility
+[2]: https://app.datadoghq.com/llm/evaluations
+[2]: /llm_observability/evaluations/evaluation_compatibility
+[3]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#language-mismatch
+[4]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations#sensitive-data-scanning

From d43571c9006b1be6c93b38d1f10f1c94661714a6 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 16:02:08 -0400
Subject: [PATCH 05/10] fixlinks

---
 .../evaluations/custom_llm_as_a_judge_evaluations/_index.md   | 4 +++-
 .../evaluations/managed_evaluations/_index.md                 | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
index da4804295f3..eb65fc0e82c 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
@@ -336,7 +336,7 @@ You can:
 
 ## Estimated token usage
 
-You can monitor the token usage of your LLM evaluations using [this dashboard][3].
+You can monitor the token usage of your LLM evaluations using [this dashboard][8].
 
 If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:
 
@@ -357,3 +357,5 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam
 [5]: https://arxiv.org/abs/2504.00050
 [6]: /llm_observability/evaluations/evaluation_compatibility
 [7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations/
+[8]: https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage
+
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
index 7dae6caf637..be4b9c2bf8d 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
@@ -48,7 +48,7 @@ After you click **Save and Publish**, LLM Observability uses the LLM account you
 
 {{< partial name="whats-next/whats-next.html" >}}
 
-[2]: https://app.datadoghq.com/llm/evaluations
+[1]: https://app.datadoghq.com/llm/evaluations
 [2]: /llm_observability/evaluations/evaluation_compatibility
 [3]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#language-mismatch
 [4]: /llm_observability/evaluations/managed_evaluations/security_and_safety_evaluations#sensitive-data-scanning

From 6db9d84b86641e99a6a9e64c3b3f23960e67bea1 Mon Sep 17 00:00:00 2001
From: Gergely Svigruha <gsvigruha@users.noreply.github.com>
Date: Mon, 16 Mar 2026 17:00:08 -0400
Subject: [PATCH 06/10] Update
 content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md

Co-authored-by: Joe Peeples <joe.peeples@datadoghq.com>
---
 .../evaluations/custom_llm_as_a_judge_evaluations/_index.md     | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
index eb65fc0e82c..083776af162 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
@@ -336,7 +336,7 @@ You can:
 
 ## Estimated token usage
 
-You can monitor the token usage of your LLM evaluations using [this dashboard][8].
+You can monitor the token usage of your LLM evaluations using the [LLM Evaluations Token Usage dashboard][8].
 
 If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:
 

From 59817435ca501cc7295eb2f3dc594f07201e9942 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 17:08:09 -0400
Subject: [PATCH 07/10] add back account

---
 .../_index.md                                 |   2 +-
 .../connect_to_account.md                     | 123 ++++++++++++++++++
 2 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
index eb65fc0e82c..2e8a8ce3bd2 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/_index.md
@@ -351,7 +351,7 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam
 {{< partial name="whats-next/whats-next.html" >}}
 
 [1]: https://app.datadoghq.com/llm/evaluations
-[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
+[2]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account
 [3]: /events/explorer/facets/
 [4]: /monitors/
 [5]: https://arxiv.org/abs/2504.00050
diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
new file mode 100644
index 00000000000..e43e414f6ae
--- /dev/null
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
@@ -0,0 +1,123 @@
+---
+title: Connect your LLM provider account
+description: How to connect to your LLM provider account to support judge LLM based evaluations
+
+## Connect your LLM provider account
+
+Configure the LLM provider you would like to use for bring-your-own-key (BYOK) evaluations. You only have to complete this step once.
+
+{{< tabs >}}
+{{% tab "OpenAI" %}}
+
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Connect your OpenAI account to LLM Observability with your OpenAI API key. LLM Observability uses the `GPT-4o mini` model for evaluations.
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. Select **Connect** on the OpenAI tile.
+1. Follow the instructions on the tile.
+   - Provide your OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
+1. Enable **Use this API key to evaluate your LLM applications**.
+
+{{< img src="llm_observability/configuration/openai-tile.png" alt="The OpenAI configuration tile in LLM Observability. Lists instructions for configuring OpenAI and providing your OpenAI API key." style="width:100%;" >}}
+
+LLM Observability does not support [data residency][2] for OpenAI.
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[2]: https://platform.openai.com/docs/guides/your-data#which-models-and-features-are-eligible-for-data-residency
+{{% /tab %}}
+{{% tab "Azure OpenAI" %}}
+
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Azure OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Connect your Azure OpenAI account to LLM Observability with your OpenAI API key. Datadog strongly recommends using the `GPT-4o mini` model for evaluations. The selected model version must support [structured output][8].
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. Select **Connect** on the Azure OpenAI tile.
+1. Follow the instructions on the tile.
+   - Provide your Azure OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
+   - Provide the Resource Name, Deployment ID, and API version to complete integration.
+
+{{< img src="llm_observability/configuration/azure-openai-tile.png" alt="The Azure OpenAI configuration tile in LLM Observability. Lists instructions for configuring Azure OpenAI and providing your API Key, Resource Name, Deployment ID, and API Version." style="width:100%;" >}}
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[8]: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/structured-outputs
+{{% /tab %}}
+{{% tab "Anthropic" %}}
+
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Anthropic account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Connect your Anthropic account to LLM Observability with your Anthropic API key. LLM Observability uses the `Haiku` model for evaluations.
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. Select **Connect** on the Anthropic tile.
+1. Follow the instructions on the tile.
+   - Provide your Anthropic API key. Ensure that this key has **write** permission for **model capabilities**.
+
+{{< img src="llm_observability/configuration/anthropic-tile.png" alt="The Anthropic configuration tile in LLM Observability. Lists instructions for configuring Anthropic and providing your Anthropic API key." style="width:100%;" >}}
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+{{% /tab %}}
+{{% tab "Amazon Bedrock" %}}
+
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Amazon Bedrock account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Connect your Amazon Bedrock account to LLM Observability with your AWS Account. LLM Observability uses the `Haiku` model for evaluations.
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. Select **Connect** on the Amazon Bedrock tile.
+1. Follow the instructions on the tile.
+
+   {{< img src="llm_observability/configuration/amazon-bedrock-tile.png" alt="The Amazon Bedrock configuration tile in LLM Observability. Lists instructions for configuring Amazon Bedrock." style="width:100%;" >}}
+
+4. Be sure to configure the **Invoke models from Amazon Bedrock** role to run evaluations. More details about the InvokeModel action can be found in the [Amazon Bedrock API reference documentation][2].
+
+
+   {{< img src="llm_observability/configuration/amazon-bedrock-tile-step-2.png" alt="The second step in configuring Amazon Bedrock requiring users to add permissions to the integration account." style="width:100%;" >}}
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[2]: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html
+{{% /tab %}}
+
+{{% tab "GCP Vertex AI" %}}
+
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to a Google Cloud Platform account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Connect Vertex AI to LLM Observability with your Google Cloud Platform account. LLM Observability uses the `gemini-2.5-flash` model for evaluations.
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. On the Google Cloud Vertex AI tile, click **Connect** to add a new GCP account, or click **Configure** next to where your existing accounts are listed to begin the onboarding process.
+   - You will see all GCP accounts connected to Datadog in this page. However, you must still go through the onboarding process for an account to use it in LLM Observability.
+1. Follow the onboarding instructions to configure your account.
+   - Add the [**Vertex AI User**][2] role to your account and enable the [**Vertex AI API**][3].
+
+{{< img src="llm_observability/configuration/vertex-ai-pint.png" alt="The Vertex AI onboarding workflow. Follow steps to configure your GCP service account with the right Vertex AI permissions for use with LLM Observability." style="width:100%;" >}}
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[2]: https://docs.cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user
+[3]: https://console.cloud.google.com/apis/library/aiplatform.googleapis.com
+{{% /tab %}}
+
+{{% tab "AI Gateway" %}}
+<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you only connect to an AI Gateway that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>
+
+Your AI Gateway must be compatible with the [OpenAI API specification][2].
+
+Connect your AI Gateway to LLM Observability with your base URL, API key, and headers.
+
+1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
+1. Click the **Configure** tab, then click **New** to create a new gateway.
+1. Follow the instructions on the tile.
+   - Provide a name for your gateway.
+   - Select your provider.
+   - Provide your base URL.
+   - Provide your API key and optionally any headers.
+
+{{< img src="llm_observability/configuration/ai-gateway-tile-3.png" alt="The AI Gateway configuration tile in LLM Observability. Lists instructions for configuring an ai gateway" style="width:100%;" >}}
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[2]: https://platform.openai.com/docs/api-reference/introduction
+{{% /tab %}}
+{{< /tabs >}}
+
+If your LLM provider restricts IP addresses, you can obtain the required IP ranges by visiting [Datadog's IP ranges documentation][2], selecting your `Datadog Site`, pasting the `GET` URL into your browser, and copying the `webhooks` section.

From 6713bbbe86d7a6b2c9bff91f38155f05fcabb7fb Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 17:10:32 -0400
Subject: [PATCH 08/10] links

---
 .../custom_llm_as_a_judge_evaluations/connect_to_account.md   | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
index e43e414f6ae..c6586cb8cc4 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
@@ -121,3 +121,7 @@ Connect your AI Gateway to LLM Observability with your base URL, API key, and he
 {{< /tabs >}}
 
 If your LLM provider restricts IP addresses, you can obtain the required IP ranges by visiting [Datadog's IP ranges documentation][2], selecting your `Datadog Site`, pasting the `GET` URL into your browser, and copying the `webhooks` section.
+
+[1]: https://app.datadoghq.com/llm/settings/integrations
+[2]: /api/latest/ip-ranges/
+

From 282d75994879551f5976c49aaed8f3e9c0caa176 Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 17:40:05 -0400
Subject: [PATCH 09/10] fix title

---
 .../custom_llm_as_a_judge_evaluations/connect_to_account.md  | 5 +++++
 .../evaluations/managed_evaluations/_index.md                | 4 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
index c6586cb8cc4..086e07c881f 100644
--- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
+++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account.md
@@ -1,6 +1,11 @@
 ---
 title: Connect your LLM provider account
 description: How to connect to your LLM provider account to support judge LLM based evaluations
+further_reading:
+- link: "/llm_observability/evaluations/custom_llm_as_a_judge_evaluations"
+  tag: "Documentation"
+  text: "Learn about custom LLM-as-a-judge evaluations"
+---
 
 ## Connect your LLM provider account
 
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
index be4b9c2bf8d..c08251be34a 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
@@ -21,8 +21,6 @@ Managed evaluations are built-in tools to assess your LLM application on dimensi
 
 LLM Observability associates evaluations with individual spans so you can view the inputs and outputs that led to a specific evaluation.
 
-LLM Observability managed evaluations leverage LLMs. To connect your LLM provider to Datadog, you need a key from the provider.
-
 Learn more about the [compatibility requirements][2].
 
 ## Create new evaluations
@@ -31,7 +29,7 @@ Learn more about the [compatibility requirements][2].
 1. Click on the **Create Evaluation** button on the top right corner.
 1. Select a specific managed evaluation. This will open the evalution editor window.
 
-After you click **Save and Publish**, LLM Observability uses the LLM account you connected to power the evaluation you enabled. Alternatively, you can **Save as Draft** and edit or enable them later.
+After you click **Save and Publish**, the evaluation goes live. Alternatively, you can **Save as Draft** and edit or enable them later.
 
 ## Edit existing evaluations
 

From c2a424c6360955f3b4db33566a06981082be352f Mon Sep 17 00:00:00 2001
From: Greg Svigruha <gergely.svigruha@datadoghq.com>
Date: Mon, 16 Mar 2026 18:51:00 -0400
Subject: [PATCH 10/10] more fixes

---
 .../evaluations/evaluation_compatibility.md                 | 6 +++---
 .../evaluations/managed_evaluations/_index.md               | 5 ++---
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/content/en/llm_observability/evaluations/evaluation_compatibility.md b/content/en/llm_observability/evaluations/evaluation_compatibility.md
index 68eb2c69aad..2f94db919c9 100644
--- a/content/en/llm_observability/evaluations/evaluation_compatibility.md
+++ b/content/en/llm_observability/evaluations/evaluation_compatibility.md
@@ -38,9 +38,9 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f
 | [Toxicity][7]           | Fully supported  | All third party LLM providers | All span kinds  |
 | [Prompt Injection][8]   | Fully supported  | All third party LLM providers | All span kinds  |
 | [Topic Relevancy][9]    | Fully supported  | All third party LLM providers | All span kinds  |
-| [Tool Selection][1]             | v3.12+            | All third party LLM providers          | LLM only        |
-| [Tool Argument Correctness][2]  | v3.12+            | All third party LLM providers          | LLM only        |
-| [Goal Completeness][3]          | Fully supported   | All third party LLM providers          | LLM only        |
+| [Tool Selection][1]             | Fully supported | All third party LLM providers | LLM only |
+| [Tool Argument Correctness][2]  | Fully supported | All third party LLM providers | LLM only |
+| [Goal Completeness][3]          | Fully supported | All third party LLM providers | LLM only |
 
 
 [1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection
diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
index c08251be34a..9f10534dcb3 100644
--- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md
+++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md
@@ -17,9 +17,8 @@ aliases:
 
 ## Overview
 
-Managed evaluations are built-in tools to assess your LLM application on dimensions like quality, security, and safety. By creating them, you can assess the effectiveness of your application's responses, including detection of sentiment, topic relevancy, toxicity, failure to answer, and hallucination.
-
-LLM Observability associates evaluations with individual spans so you can view the inputs and outputs that led to a specific evaluation.
+Managed evaluations are built-in tools to assess your LLM application. LLM Observability associates evaluations with individual
+spans so you can view the inputs and outputs that led to a specific evaluation.
 
 Learn more about the [compatibility requirements][2].