Skip to content

feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463

Open
ashragrawal wants to merge 2 commits into
mainfrom
users/ashragrawal/byollm
Open

feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463
ashragrawal wants to merge 2 commits into
mainfrom
users/ashragrawal/byollm

Conversation

@ashragrawal

Copy link
Copy Markdown
Contributor

Score each MCP tool-schema checklist item with your own Azure OpenAI deployment via the OpenAI Responses API, authenticated with Microsoft Entra ID (DefaultAzureCredential; API-key auth is not supported).

  • Check-by-check: each assertion is one model call at temperature 0 with the full tool schema as context, fanned out in parallel (A365_EVAL_AZURE_OPENAI_MAX_CONCURRENCY, default 100). Avoids the output truncation the whole-tool approach hit on large tools.
  • Single-flight Entra token cache so high concurrency does not storm az.
  • Checklist checkpointed each round; a re-run resumes only unscored checks.
  • Explicit-only engine (never auto-selected); fixed retry of 3 with backoff.
  • New --report-name overrides the host-derived output/report name (needed when multiple servers share one gateway host).
  • OpenAI pinned to 2.8.0 to keep System.Text.Json and System.Diagnostics.DiagnosticSource at 9.0.8 (no shared-dependency bump).
  • MCP discovery now tracks Mcp-Session-Id for session-required servers.
  • Docs (evaluate instructions + FAQ), CHANGELOG, and tests included.

Score each MCP tool-schema checklist item with your own Azure OpenAI
deployment via the OpenAI Responses API, authenticated with Microsoft
Entra ID (DefaultAzureCredential; API-key auth is not supported).

- Check-by-check: each assertion is one model call at temperature 0 with
  the full tool schema as context, fanned out in parallel
  (A365_EVAL_AZURE_OPENAI_MAX_CONCURRENCY, default 100). Avoids the output
  truncation the whole-tool approach hit on large tools.
- Single-flight Entra token cache so high concurrency does not storm `az`.
- Checklist checkpointed each round; a re-run resumes only unscored checks.
- Explicit-only engine (never auto-selected); fixed retry of 3 with backoff.
- New --report-name overrides the host-derived output/report name (needed
  when multiple servers share one gateway host).
- OpenAI pinned to 2.8.0 to keep System.Text.Json and
  System.Diagnostics.DiagnosticSource at 9.0.8 (no shared-dependency bump).
- MCP discovery now tracks Mcp-Session-Id for session-required servers.
- Docs (evaluate instructions + FAQ), CHANGELOG, and tests included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 18, 2026 21:12
@ashragrawal ashragrawal requested review from a team as code owners June 18, 2026 21:12
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 1 package(s) with unknown licenses.
See the Details below.

License Issues

src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj

PackageVersionLicenseIssue Type
OpenAI>= 0NullUnknown License
Denied Licenses: GPL-3.0-only, AGPL-3.0-only

OpenSSF Scorecard

PackageVersionScoreDetails
nuget/OpenAI >= 0 UnknownUnknown

Scanned Files

  • src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj

@github-actions github-actions Bot added documentation Improvements or additions to documentation feature labels Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an explicit Azure OpenAI-backed evaluation engine to develop-mcp evaluate, enabling per-check scoring via the OpenAI Responses API with Entra ID auth, plus supporting plumbing (concurrency, availability gating, report naming, MCP session header tracking) and accompanying tests/docs.

Changes:

  • Introduces --eval-engine azure-openai with per-check parallel scoring and Entra ID authentication.
  • Adds --report-name to override host-derived report/output naming.
  • Updates MCP schema discovery to track and echo Mcp-Session-Id for session-required servers.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs Adds tests for trust-boundary warning messaging per engine type.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherTests.cs Adds pure-function tests for Azure OpenAI response parsing and prompt building.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherAvailabilityTests.cs Adds non-parallel env-var based availability tests for the Azure OpenAI engine.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs Updates invocation tests for new pipeline signature and adds forwarding tests.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs Adds per-check prompt builder for direct model endpoints.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs Captures/echoes Mcp-Session-Id across discovery calls.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs Extends RunAsync with optional reportName.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs Extends launcher contract for concurrency, auto-detectability, per-check scoring, and hints.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs Adds Azure OpenAI-specific trust warning and --report-name override; parses azure-openai.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs Adds Azure OpenAI env vars, scope, and max-concurrency parsing.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs Provides default implementations for new launcher interface members.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Adds per-check scoring path and parallelizes whole-tool evaluation respecting engine concurrency.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/AzureOpenAiLauncher.cs New engine implementation using OpenAI Responses API + Entra ID token bridge and retry logging.
src/Microsoft.Agents.A365.DevTools.Cli/Program.cs Registers the Azure OpenAI launcher in DI.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs Adds EvalEngine.AzureOpenAI.
src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj Adds OpenAI SDK package reference.
src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs Updates help text and adds --report-name option forwarding to the pipeline.
src/Directory.Packages.props Pins OpenAI SDK version (2.8.0) with rationale.
docs/agent365-guided-setup/a365-evaluate-instructions.md Documents Azure OpenAI engine setup, env vars, and Entra auth.
docs/agent365-guided-setup/a365-evaluate-faq.md Adds FAQ guidance for Azure OpenAI engine behavior and auth constraints.
CHANGELOG.md Adds release notes entries for azure-openai engine and --report-name.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/AzureOpenAiLauncher.cs Outdated
Comment thread CHANGELOG.md Outdated
SchemaDiscovery: replace the shared mutable _mcpSessionId field with a per-call McpSession holder so concurrent discoveries on the DI singleton no longer race; validate the server-supplied session id as visible ASCII (0x21-0x7E) and echo it via Headers.Add to block CR/LF header injection.

AzureOpenAiLauncher: require an HTTPS endpoint in IsAvailableAsync (+ test) so tool metadata is never sent over plaintext http.

Remove the --report-name option across command, interface, pipeline, tests, and CHANGELOG.

Docs: recommend gpt-5.4 (the tested model) for A365_EVAL_AZURE_OPENAI_DEPLOYMENT. Trim the azure-openai CHANGELOG entry and remove the OpenAI version-pin rationale comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants