An educational project for learning about OpenTelemetry instrumentation for AI agents and implementing evaluations (evals) for LLM-based applications.
This project demonstrates:
- OpenTelemetry Integration: How to instrument AI agents with distributed tracing using OpenTelemetry semantic conventions
- Multi-Backend Export: Sending telemetry data to multiple observability backends simultaneously (Langfuse + Azure Application Insights)
- Agent Evals: Running experiments against datasets and evaluating agent performance with custom evaluators
- LLM Observability Patterns: Tracing generations, tool calls, chains, and agent loops
- Agent generates traces using OpenTelemetry SDK with GenAI semantic conventions
- OTLP Exporter sends traces to the OpenTelemetry Collector via gRPC
- Collector batches and forwards traces to multiple backends:
- Langfuse: For LLM-specific observability, prompt management, and evals
- Azure Application Insights: For APM, dashboards, and alerting
agent-observability/
├── agent-core/ # Core agent abstractions
│ ├── ChatCompletion/ # Chat completion interfaces and models
│ ├── Prompts/ # Prompt provider interfaces
│ ├── Providers/ # OpenAI provider, Langfuse prompt provider
│ └── Tools/ # Tool registry and execution
│
├── agent-telemetry/ # OpenTelemetry instrumentation
│ ├── Services/ # AgentTelemetry implementation
│ ├── Models/ # TelemetryScope, GenerationScope
│ └── Constants/ # GenAI semantic attributes
│
├── agent-evals/ # Evaluation framework
│ ├── Evaluators/ # NLP and trajectory evaluators
│ ├── Services/ # ExperimentRunner, EvaluationRunner
│ └── Models/ # EvaluationContext, EvaluationResult
│
├── agent-cli/ # CLI utilities for interactive chat
│
├── demo/
│ └── manual-instrumented-agent/ # Demo agent with manual instrumentation
│ ├── Commands/ # CLI commands (chat, experiment, evaluate)
│ └── Tools/ # Sample tools (dice, cards)
│
├── docker-compose.yml # OpenTelemetry Collector setup
├── otel-collector-config.yaml
└── env.template # Environment variables template
- .NET 8 SDK
- Docker (for OpenTelemetry Collector)
- OpenAI API key
- Langfuse account (free tier available at cloud.langfuse.com)
- (Optional) Azure Application Insights resource
Copy the environment template and fill in your credentials:
cp env.template .envEdit .env with your values:
# Langfuse - Base64 encode your keys: echo -n "pk-lf-xxx:sk-lf-xxx" | base64
LANGFUSE_BASE_URL=https://cloud.langfuse.com
LANGFUSE_AUTH=<base64-encoded-public:secret>
# Azure Application Insights (optional)
APPLICATIONINSIGHTS_CONNECTION_STRING=<your-connection-string>Create demo/manual-instrumented-agent/appsettings.local.json:
{
"OpenAI": {
"ApiKey": "sk-..."
},
"Langfuse": {
"PublicKey": "pk-lf-...",
"SecretKey": "sk-lf-..."
}
}docker-compose up -dVerify it's running:
docker-compose logs otel-collectorStart a chat session with the agent:
cd demo/manual-instrumented-agent
dotnet runOr explicitly run the chat command:
dotnet run -- chatThe agent has access to sample tools (dice rolling, card dealing) and all interactions are traced to your configured backends.
Execute the agent against a Langfuse dataset:
dotnet run -- experiment --dataset <dataset-name>Options:
--datasetor-d: Name of the Langfuse dataset to run against--runor-r: Custom name for this experiment run
Evaluate a completed experiment run:
dotnet run -- evaluate --dataset <dataset-name> --run <run-name>Options:
--datasetor-d: Name of the dataset--runor-r: Name of the run to evaluate
- Go to cloud.langfuse.com
- Navigate to Traces to see agent executions
- Click on a trace to see the full hierarchy:
- Root trace (session)
- Agent span
- Generation spans (LLM calls)
- Tool spans
- Go to your Application Insights resource in Azure Portal
- Navigate to Transaction Search or Application Map
- Use End-to-end transaction details to explore distributed traces
This project follows GenAI semantic conventions for LLM observability:
| Attribute | Description |
|---|---|
gen_ai.system |
The AI provider (e.g., "openai") |
gen_ai.request.model |
Model used for generation |
gen_ai.operation.name |
Type of operation (chat, tool, etc.) |
gen_ai.usage.input_tokens |
Input token count |
gen_ai.usage.output_tokens |
Output token count |
Trace: "Manual Agent"
└── Agent: "Manual Agent"
├── Chain: "Call LLM"
│ └── Generation: "gpt-4o-mini"
├── Chain: "Tools"
│ └── Tool: "roll_dice"
└── Chain: "Call LLM"
└── Generation: "gpt-4o-mini"
The agent-evals library includes:
- NLP Evaluators: Semantic similarity, keyword matching, response length
- Trajectory Evaluators: Tool usage validation, step counting, expected behavior verification
dotnet build- Create a tool class in
demo/manual-instrumented-agent/Tools/ - Decorate with
[Tool]attribute - Register in
DemoAgent.cs
Example:
[Tool("my_tool", "Description of what the tool does")]
public class MyTool
{
public static string Execute(string param1, int param2)
{
return $"Result: {param1}, {param2}";
}
}- Check collector logs:
docker-compose logs otel-collector - Verify the agent is pointing to
localhost:4317 - Ensure no firewall is blocking the port
- Verify your
LANGFUSE_AUTHis correctly base64 encoded - Check collector logs for HTTP errors
- Ensure your Langfuse keys have the correct permissions
- Verify your connection string is complete
- Check for firewall/network restrictions
- Review collector logs for specific error messages
- OpenTelemetry .NET SDK
- GenAI Semantic Conventions
- Langfuse Documentation
- Azure Monitor OpenTelemetry
This project is for educational purposes.
