fix(build): surface clear disk-full error instead of misleading opset ValueError#987
Open
timenick wants to merge 3 commits into
Open
fix(build): surface clear disk-full error instead of misleading opset ValueError#987timenick wants to merge 3 commits into
timenick wants to merge 3 commits into
Conversation
When the disk fills during the optimize step, onnx.save_model truncates the target to 0 bytes then raises OSError(ENOSPC), leaving a corrupt optimized.onnx. The quantize step loaded it into an empty ModelProto and ORT raised the opaque 'Failed to find proper ai.onnx domain', hiding the real cause (issue #259). Catch OSError at every ONNX write boundary (save_onnx, copy_onnx_model), remove the partial .onnx/.data artifact, and raise a clear ONNXSaveError (subclass of OSError, attrs path/disk_full). Add a defensive guard in the quantizer that detects an empty/corrupt input model and returns a clear failure instead of ORT's opset error. Map disk-space errors to an actionable hint in the build command.
…ng-valueerror-on-disk-full
CodeQL's py/import-and-import-from flagged the quantizer guard and its tests for importing the onnx module via both 'import onnx' and 'from onnx import ...' in the same file (the guard's plain 'import onnx' collided by short name with the relative 'from ..onnx import save_onnx'). Use 'from onnx import load_model' in the guard and drop the redundant 'from onnx import' in the tests so each file uses a single, consistent import form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #259
Problem
When the disk fills during the optimize step of
winml build, the ONNX write fails:onnx.save_modeltruncates the target to 0 bytes and thenwrite()raisesOSError(ENOSPC), leaving a corrupt/zero-byteoptimized.onnxbehind. The quantize step then loads that file —onnx.loadparses the empty bytes into an emptyModelProtowith noopset_import, and ORT raises the opaque:which surfaces as
Quantization failed: .... The real cause (out of disk space) is never reported, so users chase a phantom code bug.Fix
Root-cause fix at the ONNX write boundary, plus a defensive guard:
ONNXSaveError(onnx/persistence.py) — newOSErrorsubclass withpathanddisk_fullattributes. Exported fromonnx/__init__.py.save_onnx— both write paths (external-data and inline) now catchOSError, remove the partial.onnx/.dataartifact, and raiseONNXSaveError. Disk-full (errno.ENOSPC/ WindowsERROR_DISK_FULL112) gets a clear "Insufficient disk space" message; other errors get a generic write-failure message.copy_onnx_model(onnx/external_data.py) — same treatment for the copy boundary, cleaning updst(+ sidecar).quant/quantizer.py) —_quantize_single_passnow detects an empty/opset-less/unparseable input model up front and returns a clearQuantizeResultfailure instead of letting ORT raise its opaque opset error.commands/build.py) — disk-space errors map to an actionable hint (free up disk / clear~/.cache/winml).docs/troubleshooting.md) — note on the new behavior and automatic partial-file cleanup.Subclassing
OSErrorkeeps existingexcept OSErrorcallers working and lets the build command's top-level handler surface the clear message verbatim. The write-boundary cleanup guarantees no truncated artifact is left for a later stage to misread.Out of scope (decided during triage): no proactive free-disk pre-check before build.