From e3fe6ff53edd37134595dc05cce3bbc6b16c18a0 Mon Sep 17 00:00:00 2001 From: Greg Svigruha Date: Mon, 16 Mar 2026 14:02:55 -0400 Subject: [PATCH 01/10] move hallucination doc --- .../template_evaluations.md | 70 ++++++++++++++++++- .../evaluations/evaluation_compatibility.md | 4 +- .../evaluations/managed_evaluations/_index.md | 5 +- .../quality_evaluations.md | 68 ------------------ .../llm_observability/instrumentation/sdk.md | 8 +-- 5 files changed, 76 insertions(+), 79 deletions(-) diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md index 7d0f150ac85..b7b2f9ace63 100644 --- a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md +++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations.md @@ -8,6 +8,9 @@ further_reading: - link: "/llm_observability/setup" tag: "Documentation" text: "Learn how to set up LLM Observability" +- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/" + tag: "Blog" + text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability" aliases: - /llm_observability/evaluations/agent_evaluations - /llm_observability/evaluations/managed_evaluations/agent_evaluations @@ -15,7 +18,7 @@ aliases: - /llm_observability/evaluations/managed_evaluations/session_level_evaluations --- -Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation. +Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Hallucination][25], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation. For best practices and details on how to create LLM-as-a-judge evaluations, read [Create a custom LLM-as-a-judge evaluation][17]. @@ -52,6 +55,69 @@ Datadog provides the following categories of Failure to Answer, listed in the fo | Redirection Response | Redirects the user to another source or suggests an alternative approach | If you have additional details, I'd be happy to include them| | Refusal Response | Explicitly declines to provide an answer or to complete the request | Sorry, I can't answer this question | +### Hallucination + +Hallucination evaluations identify instances where the LLM makes a claim that disagrees with the provided input context. This check helps ensure your RAG applications stay grounded in retrieved data and do not fabricate information. + +{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}} + +| Evaluation Stage | Evaluation Definition | +|---|---| +| Evaluated on Output | Hallucination flags any output that disagrees with the context provided to the LLM. | + +
Hallucination detection is only available for OpenAI.
+ +#### Configure a Hallucination evaluation + +Use [Prompt Tracking][26] annotations to track your prompts and set them up for hallucination detection. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data. + +{{< code-block lang="python" >}} +from ddtrace.llmobs import LLMObs +from ddtrace.llmobs.types import Prompt + +# if your llm call is auto-instrumented... +with LLMObs.annotation_context( + prompt=Prompt( + id="generate_answer_prompt", + template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}", + variables={"user_question": user_question, "article": article}, + rag_query_variables=["user_question"], + rag_context_variables=["article"] + ), + name="generate_answer" +): + oai_client.chat.completions.create(...) # autoinstrumented llm call + +# if your llm call is manually instrumented ... +@llm(name="generate_answer") +def generate_answer(): + ... + LLMObs.annotate( + prompt=Prompt( + id="generate_answer_prompt", + template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}", + variables={"user_question": user_question, "article": article}, + rag_query_variables=["user_question"], + rag_context_variables=["article"] + ), + ) +{{< /code-block >}} + +The `variables` dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use `rag_query_variables` and `rag_context_variables` to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base). + +Hallucination detection does not run if either the rag query, the rag context, or the span output is empty. + +Prompt Tracking is available on Python starting from version 3.15. It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. You can find more examples of prompt tracking and instrumentation in the [SDK documentation][26]. + +Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled: + +| Configuration Option | Description | +|---|---| +| Contradiction | Claims made in the LLM-generated response that go directly against the provided context | +| Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context | + +Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims. + ### Prompt Injection Prompt Injection evaluations identify attempts by unauthorized or malicious authors to manipulate the LLM's responses or redirect the conversation in ways not intended by the original author. This check maintains the integrity and authenticity of interactions between users and the LLM. @@ -342,3 +408,5 @@ result = triage_agent.run_sync( [22]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness [23]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness [24]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection +[25]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination +[26]: /llm_observability/instrumentation/sdk?tab=python#prompt-tracking diff --git a/content/en/llm_observability/evaluations/evaluation_compatibility.md b/content/en/llm_observability/evaluations/evaluation_compatibility.md index ed17c51d9a9..9272e5a5cba 100644 --- a/content/en/llm_observability/evaluations/evaluation_compatibility.md +++ b/content/en/llm_observability/evaluations/evaluation_compatibility.md @@ -13,7 +13,6 @@ Managed evaluations are supported for the following configurations. | Evaluation | DD-trace version | LLM Provider | Applicable span | | --------------------------------| ----------------- | ------------------------------| ----------------| -| [Hallucination][4] | v2.18+ | OpenAI | LLM only | | [Language Mismatch][10] | Fully supported | Self hosted | All span kinds | ### Custom LLM-as-a-judge evaluations @@ -34,6 +33,7 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f | Evaluation | DD-trace version | LLM Provider | Applicable span | | ----------------------- | ---------------- | ----------------------------- | --------------- | | [Failure to Answer][5] | Fully supported | All third party LLM providers | All span kinds | +| [Hallucination][4] | v2.18+ | OpenAI | LLM only | | [Sentiment][6] | Fully supported | All third party LLM providers | All span kinds | | [Toxicity][7] | Fully supported | All third party LLM providers | All span kinds | | [Prompt Injection][8] | Fully supported | All third party LLM providers | All span kinds | @@ -46,7 +46,7 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f [1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection [2]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness [3]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness -[4]: /llm_observability/evaluations/managed_evaluations#hallucination +[4]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination [5]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#failure-to-answer [6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#sentiment [7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#toxicity diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md index 051924c82ce..35a3cc54447 100644 --- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md +++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md @@ -11,9 +11,6 @@ further_reading: - link: "/llm_observability/setup" tag: "Documentation" text: "Learn how to set up LLM Observability" -- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/" - tag: "Blog" - text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability" aliases: - /llm_observability/evaluations/ootb_evaluations --- @@ -224,7 +221,7 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam [3]: https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage [4]: /llm_observability/evaluations/managed_evaluations/quality_evaluations [5]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#topic-relevancy -[6]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#hallucination +[6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination [7]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#failure-to-answer [8]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#language-mismatch [9]: /llm_observability/evaluations/managed_evaluations/quality_evaluations#sentiment diff --git a/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md b/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md index 34e097e81f5..f60fb94b9bb 100644 --- a/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md +++ b/content/en/llm_observability/evaluations/managed_evaluations/quality_evaluations.md @@ -8,78 +8,12 @@ further_reading: - link: "/llm_observability/setup" tag: "Documentation" text: "Learn how to set up LLM Observability" -- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/" - tag: "Blog" - text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability" aliases: - /llm_observability/evaluations/quality_evaluations --- Quality evaluations help ensure your LLM-powered applications generate accurate, relevant, and safe responses. Managed evaluations automatically score model outputs on key quality dimensions and attach results to traces, helping you detect issues, monitor trends, and improve response quality over time. -#### Hallucination - -This check identifies instances where the LLM makes a claim that disagrees with the provided input context. - -{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}} - -| Evaluation Stage | Evaluation Method | Evaluation Definition | -|---|---|---| -| Evaluated on Output | Evaluated using LLM | Hallucination flags any output that disagrees with the context provided to the LLM. | - -##### Instrumentation -You can use [Prompt Tracking][2] annotations to track your prompts and set them up for hallucination configuration. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data. - -{{< code-block lang="python" >}} -from ddtrace.llmobs import LLMObs -from ddtrace.llmobs.types import Prompt - -# if your llm call is auto-instrumented... -with LLMObs.annotation_context( - prompt=Prompt( - id="generate_answer_prompt", - template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}", - variables={"user_question": user_question, "article": article}, - rag_query_variables=["user_question"], - rag_context_variables=["article"] - ), - name="generate_answer" -): - oai_client.chat.completions.create(...) # autoinstrumented llm call - -# if your llm call is manually instrumented ... -@llm(name="generate_answer") -def generate_answer(): - ... - LLMObs.annotate( - prompt=Prompt( - id="generate_answer_prompt", - template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}", - variables={"user_question": user_question, "article": article}, - rag_query_variables=["user_question"], - rag_context_variables=["article"] - ), - ) -{{< /code-block >}} -The `variables` dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use `rag_query_variables` and `rag_context_variables` to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base). - -Hallucination detection does not run if either the rag query, the rag context, or the span output is empty. - -Prompt Tracking is available on python starting from the 3.15 version, It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. -You can find more examples of prompt tracking and instrumentation in the [SDK documentation][2]. - -##### Hallucination configuration -
Hallucination detection is only available for OpenAI.
-Hallucination detection makes a distinction between two types of hallucinations, which can be configured when Hallucination is enabled. - -| Configuration Option | Description | -|---|---| -| Contradiction | Claims made in the LLM-generated response that go directly against the provided context | -| Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context | - -Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims. - - #### Language Mismatch This check identifies instances where the LLM generates responses in a different language or dialect than the one used by the user, which can lead to confusion or miscommunication. This check ensures that the LLM's responses are clear, relevant, and appropriate for the user's linguistic preferences and needs. @@ -96,5 +30,3 @@ Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Belarusian, Bengali, Norwegi |---|---|---| | Evaluated on Input and Output | Evaluated using Open Source Model | Language Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user's question in the same language that the user used. | -[1]: https://app.datadoghq.com/llm/applications -[2]: /llm_observability/instrumentation/sdk?tab=python#prompt-tracking diff --git a/content/en/llm_observability/instrumentation/sdk.md b/content/en/llm_observability/instrumentation/sdk.md index a3b787fc79d..165e8cfb5b5 100644 --- a/content/en/llm_observability/instrumentation/sdk.md +++ b/content/en/llm_observability/instrumentation/sdk.md @@ -1808,8 +1808,8 @@ Supported keys: - `template` (str): Template string with placeholders (for example, `"Translate {{text}} to {{lang}}"`). - `chat_template` (List[Message]): Multi-message template form. Provide a list of `{ "role": "", "content": "