feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai) by ashragrawal · Pull Request #463 · microsoft/Agent365-devTools

ashragrawal · 2026-06-18T21:12:34Z

Score each MCP tool-schema checklist item with your own Azure OpenAI deployment via the OpenAI Responses API, authenticated with Microsoft Entra ID (DefaultAzureCredential; API-key auth is not supported).

Check-by-check: each assertion is one model call at temperature 0 with the full tool schema as context, fanned out in parallel (A365_EVAL_AZURE_OPENAI_MAX_CONCURRENCY, default 100). Avoids the output truncation the whole-tool approach hit on large tools.
Single-flight Entra token cache so high concurrency does not storm az.
Checklist checkpointed each round; a re-run resumes only unscored checks.
Explicit-only engine (never auto-selected); fixed retry of 3 with backoff.
New --report-name overrides the host-derived output/report name (needed when multiple servers share one gateway host).
OpenAI pinned to 2.8.0 to keep System.Text.Json and System.Diagnostics.DiagnosticSource at 9.0.8 (no shared-dependency bump).
MCP discovery now tracks Mcp-Session-Id for session-required servers.
Docs (evaluate instructions + FAQ), CHANGELOG, and tests included.

Score each MCP tool-schema checklist item with your own Azure OpenAI deployment via the OpenAI Responses API, authenticated with Microsoft Entra ID (DefaultAzureCredential; API-key auth is not supported). - Check-by-check: each assertion is one model call at temperature 0 with the full tool schema as context, fanned out in parallel (A365_EVAL_AZURE_OPENAI_MAX_CONCURRENCY, default 100). Avoids the output truncation the whole-tool approach hit on large tools. - Single-flight Entra token cache so high concurrency does not storm `az`. - Checklist checkpointed each round; a re-run resumes only unscored checks. - Explicit-only engine (never auto-selected); fixed retry of 3 with backoff. - New --report-name overrides the host-derived output/report name (needed when multiple servers share one gateway host). - OpenAI pinned to 2.8.0 to keep System.Text.Json and System.Diagnostics.DiagnosticSource at 9.0.8 (no shared-dependency bump). - MCP discovery now tracks Mcp-Session-Id for session-required servers. - Docs (evaluate instructions + FAQ), CHANGELOG, and tests included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-18T21:12:57Z

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

The following issues were found:

✅ 0 vulnerable package(s)
✅ 0 package(s) with incompatible licenses
✅ 0 package(s) with invalid SPDX license definitions
⚠️ 1 package(s) with unknown licenses.

See the Details below.

License Issues

src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj

Package	Version	License	Issue Type
OpenAI	>= 0	Null	Unknown License

Denied Licenses:
GPL-3.0-only, AGPL-3.0-only

OpenSSF Scorecard

Package	Version	Score	Details
nuget/OpenAI	>= 0	Unknown	Unknown

Scanned Files

src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj

Copilot

Pull request overview

Adds an explicit Azure OpenAI-backed evaluation engine to develop-mcp evaluate, enabling per-check scoring via the OpenAI Responses API with Entra ID auth, plus supporting plumbing (concurrency, availability gating, report naming, MCP session header tracking) and accompanying tests/docs.

Changes:

Introduces --eval-engine azure-openai with per-check parallel scoring and Entra ID authentication.
Adds --report-name to override host-derived report/output naming.
Updates MCP schema discovery to track and echo Mcp-Session-Id for session-required servers.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs	Adds tests for trust-boundary warning messaging per engine type.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherTests.cs	Adds pure-function tests for Azure OpenAI response parsing and prompt building.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherAvailabilityTests.cs	Adds non-parallel env-var based availability tests for the Azure OpenAI engine.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs	Updates invocation tests for new pipeline signature and adds forwarding tests.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs	Adds per-check prompt builder for direct model endpoints.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs	Captures/echoes `Mcp-Session-Id` across discovery calls.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs	Extends `RunAsync` with optional `reportName`.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs	Extends launcher contract for concurrency, auto-detectability, per-check scoring, and hints.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs	Adds Azure OpenAI-specific trust warning and `--report-name` override; parses `azure-openai`.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs	Adds Azure OpenAI env vars, scope, and max-concurrency parsing.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs	Provides default implementations for new launcher interface members.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs	Adds per-check scoring path and parallelizes whole-tool evaluation respecting engine concurrency.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/AzureOpenAiLauncher.cs	New engine implementation using OpenAI Responses API + Entra ID token bridge and retry logging.
src/Microsoft.Agents.A365.DevTools.Cli/Program.cs	Registers the Azure OpenAI launcher in DI.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs	Adds `EvalEngine.AzureOpenAI`.
src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj	Adds OpenAI SDK package reference.
src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs	Updates help text and adds `--report-name` option forwarding to the pipeline.
src/Directory.Packages.props	Pins OpenAI SDK version (2.8.0) with rationale.
docs/agent365-guided-setup/a365-evaluate-instructions.md	Documents Azure OpenAI engine setup, env vars, and Entra auth.
docs/agent365-guided-setup/a365-evaluate-faq.md	Adds FAQ guidance for Azure OpenAI engine behavior and auth constraints.
CHANGELOG.md	Adds release notes entries for `azure-openai` engine and `--report-name`.

SchemaDiscovery: replace the shared mutable _mcpSessionId field with a per-call McpSession holder so concurrent discoveries on the DI singleton no longer race; validate the server-supplied session id as visible ASCII (0x21-0x7E) and echo it via Headers.Add to block CR/LF header injection. AzureOpenAiLauncher: require an HTTPS endpoint in IsAvailableAsync (+ test) so tool metadata is never sent over plaintext http. Remove the --report-name option across command, interface, pipeline, tests, and CHANGELOG. Docs: recommend gpt-5.4 (the tested model) for A365_EVAL_AZURE_OPENAI_DEPLOYMENT. Trim the azure-openai CHANGELOG entry and remove the OpenAI version-pin rationale comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 18, 2026 21:12

ashragrawal requested review from a team as code owners June 18, 2026 21:12

github-actions Bot added documentation Improvements or additions to documentation feature labels Jun 18, 2026

Copilot started reviewing on behalf of ashragrawal June 18, 2026 21:13 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463

feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463
ashragrawal wants to merge 2 commits into
mainfrom
users/ashragrawal/byollm

ashragrawal commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ashragrawal commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

License Issues

src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj

OpenSSF Scorecard

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 18, 2026 •

edited

Loading