Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -334,14 +334,28 @@ You can:
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.

## Estimated token usage

You can monitor the token usage of your LLM evaluations using the [LLM Evaluations Token Usage dashboard][8].

If you need more details, the following metrics allow you to track the LLM resources consumed to power evaluations:

- `ml_obs.estimated_usage.llm.input.tokens`
- `ml_obs.estimated_usage.llm.output.tokens`
- `ml_obs.estimated_usage.llm.total.tokens`

Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_name`, and `evaluation_name` tags, allowing you to pinpoint specific applications, models, and evaluations contributing to your usage.

## Further Reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: https://app.datadoghq.com/llm/evaluations
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
[2]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/connect_to_account
[3]: /events/explorer/facets/
[4]: /monitors/
[5]: https://arxiv.org/abs/2504.00050
[6]: /llm_observability/evaluations/evaluation_compatibility
[7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations/
[8]: https://app.datadoghq.com/dash/integration/llm_evaluations_token_usage

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: Connect your LLM provider account
description: How to connect to your LLM provider account to support judge LLM based evaluations
further_reading:
- link: "/llm_observability/evaluations/custom_llm_as_a_judge_evaluations"
tag: "Documentation"
text: "Learn about custom LLM-as-a-judge evaluations"
---

## Connect your LLM provider account

Configure the LLM provider you would like to use for bring-your-own-key (BYOK) evaluations. You only have to complete this step once.

{{< tabs >}}
{{% tab "OpenAI" %}}

<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Connect your OpenAI account to LLM Observability with your OpenAI API key. LLM Observability uses the `GPT-4o mini` model for evaluations.

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. Select **Connect** on the OpenAI tile.
1. Follow the instructions on the tile.
- Provide your OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
1. Enable **Use this API key to evaluate your LLM applications**.

{{< img src="llm_observability/configuration/openai-tile.png" alt="The OpenAI configuration tile in LLM Observability. Lists instructions for configuring OpenAI and providing your OpenAI API key." style="width:100%;" >}}

LLM Observability does not support [data residency][2] for OpenAI.

[1]: https://app.datadoghq.com/llm/settings/integrations
[2]: https://platform.openai.com/docs/guides/your-data#which-models-and-features-are-eligible-for-data-residency
{{% /tab %}}
{{% tab "Azure OpenAI" %}}

<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Azure OpenAI account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Connect your Azure OpenAI account to LLM Observability with your OpenAI API key. Datadog strongly recommends using the `GPT-4o mini` model for evaluations. The selected model version must support [structured output][8].

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. Select **Connect** on the Azure OpenAI tile.
1. Follow the instructions on the tile.
- Provide your Azure OpenAI API key. Ensure that this key has **write** permission for **model capabilities**.
- Provide the Resource Name, Deployment ID, and API version to complete integration.

{{< img src="llm_observability/configuration/azure-openai-tile.png" alt="The Azure OpenAI configuration tile in LLM Observability. Lists instructions for configuring Azure OpenAI and providing your API Key, Resource Name, Deployment ID, and API Version." style="width:100%;" >}}

[1]: https://app.datadoghq.com/llm/settings/integrations
[8]: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/structured-outputs
{{% /tab %}}
{{% tab "Anthropic" %}}

<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Anthropic account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Connect your Anthropic account to LLM Observability with your Anthropic API key. LLM Observability uses the `Haiku` model for evaluations.

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. Select **Connect** on the Anthropic tile.
1. Follow the instructions on the tile.
- Provide your Anthropic API key. Ensure that this key has **write** permission for **model capabilities**.

{{< img src="llm_observability/configuration/anthropic-tile.png" alt="The Anthropic configuration tile in LLM Observability. Lists instructions for configuring Anthropic and providing your Anthropic API key." style="width:100%;" >}}

[1]: https://app.datadoghq.com/llm/settings/integrations
{{% /tab %}}
{{% tab "Amazon Bedrock" %}}

<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to an Amazon Bedrock account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Connect your Amazon Bedrock account to LLM Observability with your AWS Account. LLM Observability uses the `Haiku` model for evaluations.

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. Select **Connect** on the Amazon Bedrock tile.
1. Follow the instructions on the tile.

{{< img src="llm_observability/configuration/amazon-bedrock-tile.png" alt="The Amazon Bedrock configuration tile in LLM Observability. Lists instructions for configuring Amazon Bedrock." style="width:100%;" >}}

4. Be sure to configure the **Invoke models from Amazon Bedrock** role to run evaluations. More details about the InvokeModel action can be found in the [Amazon Bedrock API reference documentation][2].


{{< img src="llm_observability/configuration/amazon-bedrock-tile-step-2.png" alt="The second step in configuring Amazon Bedrock requiring users to add permissions to the integration account." style="width:100%;" >}}

[1]: https://app.datadoghq.com/llm/settings/integrations
[2]: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html
{{% /tab %}}

{{% tab "GCP Vertex AI" %}}

<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you connect only to a Google Cloud Platform account that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Connect Vertex AI to LLM Observability with your Google Cloud Platform account. LLM Observability uses the `gemini-2.5-flash` model for evaluations.

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. On the Google Cloud Vertex AI tile, click **Connect** to add a new GCP account, or click **Configure** next to where your existing accounts are listed to begin the onboarding process.
- You will see all GCP accounts connected to Datadog in this page. However, you must still go through the onboarding process for an account to use it in LLM Observability.
1. Follow the onboarding instructions to configure your account.
- Add the [**Vertex AI User**][2] role to your account and enable the [**Vertex AI API**][3].

{{< img src="llm_observability/configuration/vertex-ai-pint.png" alt="The Vertex AI onboarding workflow. Follow steps to configure your GCP service account with the right Vertex AI permissions for use with LLM Observability." style="width:100%;" >}}

[1]: https://app.datadoghq.com/llm/settings/integrations
[2]: https://docs.cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user
[3]: https://console.cloud.google.com/apis/library/aiplatform.googleapis.com
{{% /tab %}}

{{% tab "AI Gateway" %}}
<div class="alert alert-danger">If you are subject to HIPAA, you are responsible for ensuring that you only connect to an AI Gateway that is subject to a business associate agreement (BAA) and meets all requirements for HIPAA compliance.</div>

Your AI Gateway must be compatible with the [OpenAI API specification][2].

Connect your AI Gateway to LLM Observability with your base URL, API key, and headers.

1. In Datadog, navigate to [**LLM Observability > Settings > Integrations**][1].
1. Click the **Configure** tab, then click **New** to create a new gateway.
1. Follow the instructions on the tile.
- Provide a name for your gateway.
- Select your provider.
- Provide your base URL.
- Provide your API key and optionally any headers.

{{< img src="llm_observability/configuration/ai-gateway-tile-3.png" alt="The AI Gateway configuration tile in LLM Observability. Lists instructions for configuring an ai gateway" style="width:100%;" >}}

[1]: https://app.datadoghq.com/llm/settings/integrations
[2]: https://platform.openai.com/docs/api-reference/introduction
{{% /tab %}}
{{< /tabs >}}

If your LLM provider restricts IP addresses, you can obtain the required IP ranges by visiting [Datadog's IP ranges documentation][2], selecting your `Datadog Site`, pasting the `GET` URL into your browser, and copying the `webhooks` section.

[1]: https://app.datadoghq.com/llm/settings/integrations
[2]: /api/latest/ip-ranges/

Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,17 @@ further_reading:
- link: "/llm_observability/setup"
tag: "Documentation"
text: "Learn how to set up LLM Observability"
- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/"
tag: "Blog"
text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability"
aliases:
- /llm_observability/evaluations/agent_evaluations
- /llm_observability/evaluations/managed_evaluations/agent_evaluations
- /llm_observability/evaluations/session_level_evaluations
- /llm_observability/evaluations/managed_evaluations/session_level_evaluations
---

Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation.
Datadog provides LLM-as-a-judge templates for the following evaluations: [Failure to Answer][16], [Goal Completeness][22], [Hallucination][25], [Prompt Injection][14], [Sentiment][12], [Tool Argument Correctness][23], [Tool Selection][24], [Topic Relevancy][15], and [Toxicity][13]. After you select a template, you can modify any aspect of the evaluation.

For best practices and details on how to create LLM-as-a-judge evaluations, read [Create a custom LLM-as-a-judge evaluation][17].

Expand Down Expand Up @@ -52,6 +55,67 @@ Datadog provides the following categories of Failure to Answer, listed in the fo
| Redirection Response | Redirects the user to another source or suggests an alternative approach | If you have additional details, I'd be happy to include them|
| Refusal Response | Explicitly declines to provide an answer or to complete the request | Sorry, I can't answer this question |

### Hallucination

Hallucination evaluations identify instances where the LLM makes a claim that disagrees with the provided input context. This check helps ensure your RAG applications stay grounded in retrieved data and do not fabricate information.

{{< img src="llm_observability/evaluations/hallucination_5.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}

| Evaluation Stage | Evaluation Definition |
|---|---|
| Evaluated on Output | Hallucination flags any output that disagrees with the context provided to the LLM. |

#### Configure a Hallucination evaluation

Use [Prompt Tracking][26] annotations to track your prompts and set them up for hallucination detection. Annotate your LLM spans with the user query and context so hallucination detection can evaluate model outputs against the retrieved data.

{{< code-block lang="python" >}}
from ddtrace.llmobs import LLMObs
from ddtrace.llmobs.types import Prompt

# if your llm call is auto-instrumented...
with LLMObs.annotation_context(
prompt=Prompt(
id="generate_answer_prompt",
template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
variables={"user_question": user_question, "article": article},
rag_query_variables=["user_question"],
rag_context_variables=["article"]
),
name="generate_answer"
):
oai_client.chat.completions.create(...) # autoinstrumented llm call

# if your llm call is manually instrumented ...
@llm(name="generate_answer")
def generate_answer():
...
LLMObs.annotate(
prompt=Prompt(
id="generate_answer_prompt",
template="Generate an answer to this question :{user_question}. Only answer based on the information from this article : {article}",
variables={"user_question": user_question, "article": article},
rag_query_variables=["user_question"],
rag_context_variables=["article"]
),
)
{{< /code-block >}}

The `variables` dictionary should contain the key-value pairs your app uses to construct the LLM input prompt (for example, the messages for an OpenAI chat completion request). Use `rag_query_variables` and `rag_context_variables` to specify which variables represent the user query and which represent the retrieval context. A list of variables is allowed to account for cases where multiple variables make up the context (for example, multiple articles retrieved from a knowledge base).

Hallucination detection does not run if either the rag query, the rag context, or the span output is empty.

Prompt Tracking is available on Python starting from version 3.15. It also requires an ID for the prompt and the template set up to monitor and track your prompt versions. You can find more examples of prompt tracking and instrumentation in the [SDK documentation][26].

Hallucination detection makes a distinction between two types of hallucinations:

| Configuration Option | Description |
|---|---|
| Contradiction | Claims made in the LLM-generated response that go directly against the provided context |
| Unsupported Claim | Claims made in the LLM-generated response that are not grounded in the context |

Contradictions are always detected, while Unsupported Claims can be optionally included. For sensitive use cases, we recommend including Unsupported Claims.

### Prompt Injection

Prompt Injection evaluations identify attempts by unauthorized or malicious authors to manipulate the LLM's responses or redirect the conversation in ways not intended by the original author. This check maintains the integrity and authenticity of interactions between users and the LLM.
Expand Down Expand Up @@ -342,3 +406,5 @@ result = triage_agent.run_sync(
[22]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness
[23]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness
[24]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection
[25]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
[26]: /llm_observability/instrumentation/sdk?tab=python#prompt-tracking
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ Managed evaluations are supported for the following configurations.

| Evaluation | DD-trace version | LLM Provider | Applicable span |
| --------------------------------| ----------------- | ------------------------------| ----------------|
| [Hallucination][4] | v2.18+ | OpenAI | LLM only |
| [Language Mismatch][10] | Fully supported | Self hosted | All span kinds |

### Custom LLM-as-a-judge evaluations
Expand All @@ -34,19 +33,20 @@ Existing templates for custom LLM-as-a-judge evaluations are supported for the f
| Evaluation | DD-trace version | LLM Provider | Applicable span |
| ----------------------- | ---------------- | ----------------------------- | --------------- |
| [Failure to Answer][5] | Fully supported | All third party LLM providers | All span kinds |
| [Hallucination][4] | Fully supported | All third party LLM providers | LLM only |
| [Sentiment][6] | Fully supported | All third party LLM providers | All span kinds |
| [Toxicity][7] | Fully supported | All third party LLM providers | All span kinds |
| [Prompt Injection][8] | Fully supported | All third party LLM providers | All span kinds |
| [Topic Relevancy][9] | Fully supported | All third party LLM providers | All span kinds |
| [Tool Selection][1] | v3.12+ | All third party LLM providers | LLM only |
| [Tool Argument Correctness][2] | v3.12+ | All third party LLM providers | LLM only |
| [Goal Completeness][3] | Fully supported | All third party LLM providers | LLM only |
| [Tool Selection][1] | Fully supported | All third party LLM providers | LLM only |
| [Tool Argument Correctness][2] | Fully supported | All third party LLM providers | LLM only |
| [Goal Completeness][3] | Fully supported | All third party LLM providers | LLM only |


[1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-selection
[2]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#tool-argument-correctness
[3]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#goal-completeness
[4]: /llm_observability/evaluations/managed_evaluations#hallucination
[4]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#hallucination
[5]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#failure-to-answer
[6]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#sentiment
[7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations/template_evaluations#toxicity
Expand Down
Loading
Loading