Skip to content

feat(perf): add --runtime [ort|openvino] to compare ORT vs OpenVINO#960

Open
xieofxie wants to merge 3 commits into
mainfrom
hualxie/run_ov
Open

feat(perf): add --runtime [ort|openvino] to compare ORT vs OpenVINO#960
xieofxie wants to merge 3 commits into
mainfrom
hualxie/run_ov

Conversation

@xieofxie

Copy link
Copy Markdown
Contributor

What

Adds a --runtime [ort|openvino] flag to winml perf so the same ONNX file can be benchmarked on ONNX Runtime vs OpenVINO Runtime for a side-by-side comparison.

winml perf -m model.onnx --runtime openvino --device gpu
winml perf -m model.onnx --runtime ort --ep cpu   # ORT-native baseline
  • Default is ort — existing behavior is unchanged.
  • --runtime openvino reads the raw ONNX directly via OpenVINO Runtime (no quantize/optimize/compile build), which is the fair, simple comparison on the same graph. ONNX input only.

How

  • OpenVINOSession (session/openvino/openvino_session.py) mirrors the subset of WinMLSession the perf engine uses — compile() / run() / perf() plus io_config / device / ep_name / running_model_path. It reuses get_io_config, load_onnx, PerfStats, and WinMLSession._get_precision, so I/O metadata and reports match the ORT path. No model-specific logic.
  • _OpenVINOModel adapter in perf.py exposes the _single surface the benchmark engine reads, so _run_single / _run_benchmark* / reporting are untouched. PerfBenchmark._load_model() branches to it and skips WinMLAutoModel + ORT EP resolution entirely (OpenVINO is independent of ORT's EPs).
  • --device maps cpu/gpu/npu/auto → OpenVINO CPU/GPU/NPU/AUTO. compile() fails fast against Core().available_devices with a readable message instead of a raw backend stack trace.
  • RuntimeName Literal + RUNTIME_NAMES in constants.py (mirrors CompilerName) — the CLI choice list and the typed config field derive from one source.
  • CLI guards: --runtime openvino requires a .onnx input and rejects --module.

Verified locally

  • --runtime openvino runs on CPU and GPU end-to-end; latency/throughput populated.
  • --monitor works on CPU and GPU (HW utilization via PDH; falls back to NullEPMonitor like most EPs — no OV-specific ep_proof telemetry yet).
  • Absent device (NPU here) → friendly error: OpenVINO device 'NPU' ... is not available. OpenVINO sees: ['CPU', 'GPU'].
  • New unit tests (tests/unit/session/test_openvino_session.py, gated on importorskip("openvino")) + CLI guard tests; all existing perf tests pass; ruff clean.

Notes / follow-ups

  • --ep and quant/optimize flags are intentional no-ops under --runtime openvino (raw ONNX) — documented in the flag help.
  • On machines where the WinML registry installs the OpenVINO EP, --runtime ort --device cpu already routes ORT→OpenVINO EP; use --runtime ort --ep cpu for a true ORT-native baseline.
  • Possible follow-up: surface EXECUTION_DEVICES in the report so AUTO-mode fallbacks are visible.

Closes #948

🤖 Generated with Claude Code

xieofxie added 2 commits June 24, 2026 16:04
- Add RuntimeName Literal + RUNTIME_NAMES to constants (mirrors CompilerName),
  thread it through BenchmarkConfig and the perf CLI instead of bare str.
- Fail fast in OpenVINOSession.compile() when the requested device is absent
  from Core().available_devices, with a readable message instead of a raw
  backend stack trace. AUTO is exempt; matches plain (GPU) and indexed (GPU.0)
  device names.
- Add a hardware-independent unit test for the unavailable-device path.
@xieofxie xieofxie requested a review from a team as a code owner June 24, 2026 08:25
…ssing

Wrap the openvino import in OpenVINOSession.compile() so an absent package
raises a clear install hint (pip install winml-cli[openvino]) instead of a
bare ModuleNotFoundError. Add a unit test that simulates the missing module.

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the PR is well-structured: the adapter pattern cleanly separates OpenVINO from the ORT pipeline, error handling is thoughtful (lazy import, device pre-check, file existence), and test coverage is solid. Three findings below.

f"(not a HuggingFace model ID), got: {hf_model}"
)
if module_class:
raise click.UsageError("--runtime openvino does not support --module benchmarking.")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent discard of --ep / --ep-options when --runtime openvino — no user feedback

When --runtime openvino is combined with --ep cuda (or any EP), the value is forwarded into BenchmarkConfig but silently ignored because _load_model() returns before _resolve_device_ep() runs. A pattern already used in this file (--shape-config warning in --module mode) would work here — emit a yellow console warning so users know the flag had no effect.

return self._io_config

@property
def device(self) -> str:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device property returns input string, not resolved OpenVINO device

After AUTO compilation self._ov_device holds the concrete target (e.g. 'GPU.0'), but this property still returns self._device (e.g. 'auto'). Since the perf engine reads model.device for report labelling, an AUTO-mode run will appear as device='auto' in JSON output rather than the true hardware. Consider return self._ov_device or self._device once compiled.

# output-name normalization mismatches (the order of model.outputs
# matches the ONNX graph output order get_io_config reads).
out_names = self.io_config["output_names"]
return {name: np.asarray(result[i]) for i, name in enumerate(out_names)}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output index mapping assumes io_config and compiled output count agree

If the two disagree (e.g. optional outputs interpreted differently), the dict comprehension silently truncates or raises an IndexError with no context. A defensive length check before the comprehension would make failures easier to diagnose.

@xieofxie

xieofxie commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Pros

  • Internal check our top200 models on ORT VS Native
    • should cert team do this?
  • From ISV case, they want to know the difference between ORT VS Native

Cons

  • What if user finds that Native if better than ORT?
  • Are we measuring correctly?
  • It will lead to many different implementations for IHVs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement openvino run in perf to compare ?

2 participants