feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463
Open
ashragrawal wants to merge 2 commits into
Open
feat(evaluate): add Azure OpenAI judge (--eval-engine azure-openai)#463ashragrawal wants to merge 2 commits into
ashragrawal wants to merge 2 commits into
Conversation
Score each MCP tool-schema checklist item with your own Azure OpenAI deployment via the OpenAI Responses API, authenticated with Microsoft Entra ID (DefaultAzureCredential; API-key auth is not supported). - Check-by-check: each assertion is one model call at temperature 0 with the full tool schema as context, fanned out in parallel (A365_EVAL_AZURE_OPENAI_MAX_CONCURRENCY, default 100). Avoids the output truncation the whole-tool approach hit on large tools. - Single-flight Entra token cache so high concurrency does not storm `az`. - Checklist checkpointed each round; a re-run resumes only unscored checks. - Explicit-only engine (never auto-selected); fixed retry of 3 with backoff. - New --report-name overrides the host-derived output/report name (needed when multiple servers share one gateway host). - OpenAI pinned to 2.8.0 to keep System.Text.Json and System.Diagnostics.DiagnosticSource at 9.0.8 (no shared-dependency bump). - MCP discovery now tracks Mcp-Session-Id for session-required servers. - Docs (evaluate instructions + FAQ), CHANGELOG, and tests included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dependency ReviewThe following issues were found:
License Issuessrc/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj
OpenSSF Scorecard
Scanned Files
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an explicit Azure OpenAI-backed evaluation engine to develop-mcp evaluate, enabling per-check scoring via the OpenAI Responses API with Entra ID auth, plus supporting plumbing (concurrency, availability gating, report naming, MCP session header tracking) and accompanying tests/docs.
Changes:
- Introduces
--eval-engine azure-openaiwith per-check parallel scoring and Entra ID authentication. - Adds
--report-nameto override host-derived report/output naming. - Updates MCP schema discovery to track and echo
Mcp-Session-Idfor session-required servers.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationPipelineServiceTests.cs | Adds tests for trust-boundary warning messaging per engine type. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherTests.cs | Adds pure-function tests for Azure OpenAI response parsing and prompt building. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/AzureOpenAiLauncherAvailabilityTests.cs | Adds non-parallel env-var based availability tests for the Azure OpenAI engine. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandInvocationTests.cs | Updates invocation tests for new pipeline signature and adds forwarding tests. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs | Adds per-check prompt builder for direct model endpoints. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs | Captures/echoes Mcp-Session-Id across discovery calls. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationPipelineService.cs | Extends RunAsync with optional reportName. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ICodingAgentLauncher.cs | Extends launcher contract for concurrency, auto-detectability, per-check scoring, and hints. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs | Adds Azure OpenAI-specific trust warning and --report-name override; parses azure-openai. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvalModelConstants.cs | Adds Azure OpenAI env vars, scope, and max-concurrency parsing. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentLauncherBase.cs | Provides default implementations for new launcher interface members. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs | Adds per-check scoring path and parallelizes whole-tool evaluation respecting engine concurrency. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/AzureOpenAiLauncher.cs | New engine implementation using OpenAI Responses API + Entra ID token bridge and retry logging. |
| src/Microsoft.Agents.A365.DevTools.Cli/Program.cs | Registers the Azure OpenAI launcher in DI. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs | Adds EvalEngine.AzureOpenAI. |
| src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj | Adds OpenAI SDK package reference. |
| src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs | Updates help text and adds --report-name option forwarding to the pipeline. |
| src/Directory.Packages.props | Pins OpenAI SDK version (2.8.0) with rationale. |
| docs/agent365-guided-setup/a365-evaluate-instructions.md | Documents Azure OpenAI engine setup, env vars, and Entra auth. |
| docs/agent365-guided-setup/a365-evaluate-faq.md | Adds FAQ guidance for Azure OpenAI engine behavior and auth constraints. |
| CHANGELOG.md | Adds release notes entries for azure-openai engine and --report-name. |
SchemaDiscovery: replace the shared mutable _mcpSessionId field with a per-call McpSession holder so concurrent discoveries on the DI singleton no longer race; validate the server-supplied session id as visible ASCII (0x21-0x7E) and echo it via Headers.Add to block CR/LF header injection. AzureOpenAiLauncher: require an HTTPS endpoint in IsAvailableAsync (+ test) so tool metadata is never sent over plaintext http. Remove the --report-name option across command, interface, pipeline, tests, and CHANGELOG. Docs: recommend gpt-5.4 (the tested model) for A365_EVAL_AZURE_OPENAI_DEPLOYMENT. Trim the azure-openai CHANGELOG entry and remove the OpenAI version-pin rationale comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Score each MCP tool-schema checklist item with your own Azure OpenAI deployment via the OpenAI Responses API, authenticated with Microsoft Entra ID (DefaultAzureCredential; API-key auth is not supported).
az.