Skip to content

fix(build): surface clear disk-full error instead of misleading opset ValueError#987

Open
timenick wants to merge 3 commits into
mainfrom
timenick-fix-misleading-valueerror-on-disk-full
Open

fix(build): surface clear disk-full error instead of misleading opset ValueError#987
timenick wants to merge 3 commits into
mainfrom
timenick-fix-misleading-valueerror-on-disk-full

Conversation

@timenick

Copy link
Copy Markdown
Collaborator

Closes #259

Problem

When the disk fills during the optimize step of winml build, the ONNX write fails: onnx.save_model truncates the target to 0 bytes and then write() raises OSError(ENOSPC), leaving a corrupt/zero-byte optimized.onnx behind. The quantize step then loads that file — onnx.load parses the empty bytes into an empty ModelProto with no opset_import, and ORT raises the opaque:

ValueError: Failed to find proper ai.onnx domain

which surfaces as Quantization failed: .... The real cause (out of disk space) is never reported, so users chase a phantom code bug.

Fix

Root-cause fix at the ONNX write boundary, plus a defensive guard:

  • ONNXSaveError (onnx/persistence.py) — new OSError subclass with path and disk_full attributes. Exported from onnx/__init__.py.
  • save_onnx — both write paths (external-data and inline) now catch OSError, remove the partial .onnx/.data artifact, and raise ONNXSaveError. Disk-full (errno.ENOSPC / Windows ERROR_DISK_FULL 112) gets a clear "Insufficient disk space" message; other errors get a generic write-failure message.
  • copy_onnx_model (onnx/external_data.py) — same treatment for the copy boundary, cleaning up dst (+ sidecar).
  • Quantizer guard (quant/quantizer.py) — _quantize_single_pass now detects an empty/opset-less/unparseable input model up front and returns a clear QuantizeResult failure instead of letting ORT raise its opaque opset error.
  • Build hint (commands/build.py) — disk-space errors map to an actionable hint (free up disk / clear ~/.cache/winml).
  • Docs (docs/troubleshooting.md) — note on the new behavior and automatic partial-file cleanup.

Subclassing OSError keeps existing except OSError callers working and lets the build command's top-level handler surface the clear message verbatim. The write-boundary cleanup guarantees no truncated artifact is left for a later stage to misread.

Out of scope (decided during triage): no proactive free-disk pre-check before build.

timenick added 2 commits June 26, 2026 15:51
When the disk fills during the optimize step, onnx.save_model truncates the target to 0 bytes then raises OSError(ENOSPC), leaving a corrupt optimized.onnx. The quantize step loaded it into an empty ModelProto and ORT raised the opaque 'Failed to find proper ai.onnx domain', hiding the real cause (issue #259).

Catch OSError at every ONNX write boundary (save_onnx, copy_onnx_model), remove the partial .onnx/.data artifact, and raise a clear ONNXSaveError (subclass of OSError, attrs path/disk_full). Add a defensive guard in the quantizer that detects an empty/corrupt input model and returns a clear failure instead of ORT's opset error. Map disk-space errors to an actionable hint in the build command.
@timenick timenick requested a review from a team as a code owner June 26, 2026 07:52
@timenick timenick changed the title Surface clear disk-full error instead of misleading opset ValueError fix(build): surface clear disk-full error instead of misleading opset ValueError Jun 26, 2026
Comment thread src/winml/modelkit/quant/quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
CodeQL's py/import-and-import-from flagged the quantizer guard and its tests for importing the onnx module via both 'import onnx' and 'from onnx import ...' in the same file (the guard's plain 'import onnx' collided by short name with the relative 'from ..onnx import save_onnx'). Use 'from onnx import load_model' in the guard and drop the redundant 'from onnx import' in the tests so each file uses a single, consistent import form.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: winml build shows misleading ValueError (opset domain) when disk is full during quantize step

2 participants