Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new CUDA-accelerated Hierarchical Speculative Decoding (HSD) inference path for vision-language document parsing in oar-ocr-vl, aligns/bridges draft vs target output formats (notably tables), and improves observability + CI coverage. It also updates structure/formula handling in the core OCR pipeline and removes deprecated VLM backends (UniRec, LightOnOCR) from the VLM crate.
Changes:
- Introduce the
hsdfeature inoar-ocr-vlwith new core modules (draft types, matching, prefix-tree batching, trimmable KV cache, backend utilities) and integrate KV trimming/gather support into multiple VLM backbones. - Improve structure/formula/OCR batching and diagnostics (formula ORT config override, formula batch sizing, timing logs), plus table HTML↔OTSL conversion utilities + tests.
- Expand CI with rustdoc warnings-as-errors, feature-matrix checks, and a CUDA-toolkit-backed
hsdjob; update docs/READMEs and workspace versioning.
Reviewed changes
Copilot reviewed 74 out of 75 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/oarocr/structure.rs | Adds formula-only ORT session override, formula batching, and per-stage timing logs in structure pipeline. |
| README.md | Updates crates badge link and adds HSD overview + docs link. |
| oar-ocr-vl/src/utils/table.rs | Adds HTML→OTSL conversion for PaddleOCR-VL table drafts plus unit tests. |
| oar-ocr-vl/src/utils/image.rs | Adds safety documentation around UnsafeSlice Send/Sync and parallel write contract. |
| oar-ocr-vl/src/utils.rs | Makes Metal device parsing/macOS gating explicit and improves error messaging/tests cfgs. |
| oar-ocr-vl/src/unirec/model.rs | Removes UniRec implementation (deleted). |
| oar-ocr-vl/src/unirec/mod.rs | Removes UniRec module exports (deleted). |
| oar-ocr-vl/src/unirec/decoder.rs | Removes UniRec decoder implementation (deleted). |
| oar-ocr-vl/src/unirec/config.rs | Removes UniRec config implementation (deleted). |
| oar-ocr-vl/src/paddleocr_vl/vision.rs | Adds chunked attention path for large seq lengths to avoid OOM while keeping small seq byte-stability. |
| oar-ocr-vl/src/paddleocr_vl/ernie.rs | Switches to TrimmableKvCache and adds HSD-only cache length/keep-index hooks. |
| oar-ocr-vl/src/mineru/text.rs | Switches to TrimmableKvCache and adds HSD-only cache length/keep-index hooks. |
| oar-ocr-vl/src/lightonocr/text.rs | Removes LightOnOCR text model (deleted). |
| oar-ocr-vl/src/lightonocr/processing.rs | Removes LightOnOCR image preprocessing (deleted). |
| oar-ocr-vl/src/lightonocr/mod.rs | Removes LightOnOCR module exports (deleted). |
| oar-ocr-vl/src/lightonocr/config.rs | Removes LightOnOCR config definitions (deleted). |
| oar-ocr-vl/src/lib.rs | Adds hsd module/feature wiring, root kv_trim module, and adjusts crate docs/exports accordingly. |
| oar-ocr-vl/src/hunyuanocr/processing.rs | Aligns resample behavior with upstream (ignores config resample) for output parity. |
| oar-ocr-vl/src/hunyuanocr/mod.rs | Re-exports HSD prompt helpers behind hsd feature. |
| oar-ocr-vl/src/hunyuanocr/llm.rs | Switches to TrimmableKvCache, aligns XDRoPE numerics (F32 mix then cast back), adds optional one-shot activation dumps, and adds HSD-only KV gather hooks. |
| oar-ocr-vl/src/hsd/types.rs | Introduces shared HSD/DSV configs, drafts, region kinds, and detailed timing/acceptance stats with tests. |
| oar-ocr-vl/src/hsd/prefix_tree.rs | Adds prefix-tree builder for candidate suffix batching with tests. |
| oar-ocr-vl/src/hsd/mod.rs | Adds HSD module entrypoint and re-exports core types + KV cache wrapper. |
| oar-ocr-vl/src/hsd/matching.rs | Adds sliding-window candidate collection + capping heuristics with extensive tests. |
| oar-ocr-vl/src/hsd/kv_trim.rs | Adds a trim/gather-capable KV cache wrapper with unit tests. |
| oar-ocr-vl/src/hsd/backend_util.rs | Adds shared helper utilities for pos-id and KV keep-index construction with tests. |
| oar-ocr-vl/src/glmocr/text.rs | Switches to TrimmableKvCache and adds HSD-only cache length/keep-index hooks. |
| oar-ocr-vl/src/doc_parser.rs | Removes UniRec/LightOnOCR backends from DocParser integration and updates docs/comments accordingly. |
| oar-ocr-vl/README.md | Updates supported models list, clarifies DocParser scope, adds HSD docs + updated batch API examples, and removes UniRec docs. |
| oar-ocr-vl/examples/utils/structure_match.rs | Adds source-aware structure-result matching utility + tests for OmniDocBench drafting. |
| oar-ocr-vl/examples/utils/mod.rs | Adds HSD example helpers (config helpers, auto-tune, stats printing) and exports structure_match. |
| oar-ocr-vl/examples/unirec.rs | Removes UniRec example (deleted). |
| oar-ocr-vl/examples/mineru.rs | Updates MinerU example docs to use models/... paths. |
| oar-ocr-vl/examples/hunyuanocr.rs | Updates HunyuanOCR example docs to use models/... paths. |
| oar-ocr-vl/examples/glmocr.rs | Updates GLM-OCR example docs to use models/... paths. |
| oar-ocr-vl/examples/doc_parser.rs | Restricts DocParser example to layout-first backends and simplifies layout handling. |
| oar-ocr-vl/Cargo.toml | Adds hsd feature (implies cuda) and adds a dev-dependency on oar-ocr. |
| oar-ocr-vl/build.rs | Adds build-time guard to prevent enabling metal feature on non-macOS targets. |
| oar-ocr-core/src/utils/dict.rs | Adjusts docs to use relative paths and improves missing-file test portability. |
| oar-ocr-core/src/processors/layout_sorting.rs | Adds unit tests for enhanced layout sorting and child-block association behavior. |
| oar-ocr-core/src/processors/formula_preprocess.rs | Improves static regex construction and fixes LaTeX thin-space handling in normalization loop. |
| oar-ocr-core/src/processors/decode.rs | Switches static regex init to expect for clearer panic context. |
| oar-ocr-core/src/predictors/table_structure_recognition.rs | Refines model-family detection/mapping and documents wired vs wireless sizing defaults. |
| oar-ocr-core/src/predictors/formula_recognition.rs | Adds default batch_size to formula predictor config. |
| oar-ocr-core/src/models/recognition/unimernet.rs | Adds vocab sentinel filtering and tests to avoid decoding invalid/padded token ids. |
| oar-ocr-core/src/models/recognition/pp_formulanet.rs | Adds robust output selection for token-id tensor, vocab sentinel filtering, and tests. |
| oar-ocr-core/src/domain/tasks/layout_detection.rs | Clarifies score threshold docs and updates source/reference notes. |
| oar-ocr-core/src/domain/tasks/formula_recognition.rs | Adds batch_size to task config with validation + defaults. |
| oar-ocr-core/src/domain/adapters/preprocessing.rs | Changes doc code fences to non-runnable snippets to avoid doctest/rustdoc issues. |
| oar-ocr-core/src/domain/adapters/formula_recognition_adapter.rs | Adds timing/debug diagnostics, vocab-size-aware filtering, and batch_size plumbing into builder + adapter. |
| oar-ocr-core/src/core/macros.rs | Changes macro docs code fences to text to avoid doctest/rustdoc issues. |
| oar-ocr-core/src/core/inference/tensor_output.rs | Adds dtype_name() helper for logging/diagnostics. |
| oar-ocr-core/src/core/inference/mod.rs | Adds OrtInfer::output_shapes() helper for adapters needing output introspection. |
| oar-ocr-core/src/core/errors/mod.rs | Adjusts module docs to avoid intra-doc-link warnings. |
| oar-ocr-core/Cargo.toml | Removes unused dependencies (toml, html-escape). |
| examples/utils/visualization.rs | Switches system font loading to fontdb for better cross-platform behavior. |
| examples/table_structure_recognition.rs | Updates dict-path examples to be relative paths. |
| examples/structure.rs | Adds CLI arg to control formula recognition batch size and passes it into config. |
| docs/hsd.md | Adds detailed HSD algorithm/config documentation and usage guidance. |
| Cargo.toml | Bumps workspace version to 0.7.0 and adds fontdb dev-dependency at workspace root. |
| .github/workflows/test.yml | Adds rustdoc warning checks, feature-matrix job, and CUDA-backed hsd clippy/tests job. |
Comments suppressed due to low confidence (1)
oar-ocr-vl/Cargo.toml:67
oar-ocris added as a path-only dev-dependency. This will typically breakcargo publish/crates.io builds because path dependencies are not allowed in published manifests unless they also specify aversion(Cargo can then rewrite the path dep to the registry version). Consider addingversion = "0.7.0"here or movingoar-ocrinto[workspace.dependencies]and usingoar-ocr.workspace = true.
[dev-dependencies]
clap = { version = "4.5.42", features = ["derive"] }
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
criterion = { version = "0.8", features = ["html_reports"] }
hayro = "0.6"
oar-ocr = { path = "..", default-features = false }
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces Hierarchical Speculative Decoding (HSD) to the oar-ocr-vl crate, providing a training-free CUDA acceleration path for document parsing VLMs. Key changes include the addition of a TrimmableKvCache to support HSD's tree-verification requirements, new HSD-specific backends for supported models (HunyuanOCR, GLM-OCR, MinerU, PaddleOCR-VL), and utilities for converting between HTML and OTSL table formats. My feedback highlights potential performance bottlenecks in the repetition penalty implementation and KV-cache management, as well as concerns regarding the robustness of regex-based HTML parsing for table conversion.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements Hierarchical Speculative Decoding (HSD) for VLM acceleration, adds support for GLM-OCR and MinerU2.5, and includes significant numerical fixes for HunyuanOCR to achieve parity with upstream implementations. Additionally, it introduces batching for formula recognition, improves font loading via fontdb, and removes the UniRec and LightOnOCR backends. Feedback highlights opportunities to improve performance by caching environment variable lookups in vision encoder forward passes and reducing code duplication in the HunyuanOCR prefill logic.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request upgrades the project to version 0.7.0, introducing support for HunyuanOCR, GLM-OCR, and MinerU2.5 models while removing the UniRec implementation. A major addition is Hierarchical Speculative Decoding (HSD) for CUDA-accelerated inference, supported by architectural updates like a trimmable KV cache. The update also includes critical fixes for vision encoder drift and optimizations for large-image attention. Feedback from the review suggests several technical improvements: optimizing the HSD verification loop by performing argmax on-device to reduce host-transfer overhead, ensuring consistency between Stage 1 and Stage 2 draft surfaces, hoisting redundant RoPE operations out of transformer layers, and improving the robustness of the HTML table attribute parser.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Hierarchical Speculative Decoding (HSD) to improve document parsing performance by leveraging a lightweight drafter pipeline and a target VLM verification path. Key changes include the addition of a TrimmableKvCache to support HSD's tree-verification requirements, a new hsd module for draft-target matching and prefix-tree construction, and integration across multiple VLM backends (HunyuanOCR, GLM-OCR, MinerU, PaddleOCR-VL). I have identified several areas for improvement, including correcting a Rust version typo in Cargo.toml, optimizing GPU-to-CPU data transfers in the HSD verification driver, and refactoring unstable let_chains syntax to ensure broader compiler compatibility.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Hierarchical Speculative Decoding (HSD), a training-free CUDA acceleration scheme for Vision-Language Model (VLM) backends including HunyuanOCR, GLM-OCR, and MinerU2.5. The implementation includes a new hsd module for prefix-tree batching and speculative verification, alongside a TrimmableKvCache to support rolling back unaccepted tokens. Significant improvements were made to HunyuanOCR to fix numerical drift through the use of gelu_erf and F32 attention in late layers, and PaddleOCR-VL now utilizes chunked vision attention to handle large images without exceeding VRAM. Review feedback highlights opportunities to improve performance by reducing redundant data transfers between device and host during logit processing, optimizing tensor selection in pp_formulanet.rs into a single pass, and ensuring robustness against zero-sized reshapes in the HSD driver.
This pull request introduces Hierarchical Speculative Decoding (HSD) as a CUDA-accelerated inference path for Vision-Language Models (VLMs) in the
oar-ocr-vlcrate, updates documentation and usage guides, and improves CI coverage for new features. It also bumps the workspace version to0.7.0and updates dependencies accordingly. The most important changes are:Major Feature: Hierarchical Speculative Decoding (HSD)
docs/hsd.mdexplaining the HSD algorithm, configuration, and usage, including references to the original paper and implementation-specific details.README.mdto introduce HSD, explain its acceleration benefits, and link to the new documentation.Vision-Language Model (VLM) and Documentation Updates
README.mdanddocs/usage.mdto document support for HunyuanOCR, GLM-OCR, and HSD, and improved code samples to use batch-oriented APIs. [1] [2] [3] [4]Continuous Integration and Testing
hsdjob to the GitHub Actions workflow that installs the CUDA toolkit and runs clippy and tests for the HSD feature, ensuring CI coverage for CUDA-dependent code paths.feature-matrixjob to test all feature combinations foroar-ocrandoar-ocr-corein CI.Version and Dependency Updates
0.7.0and updated dependencies inCargo.tomlto match, includingoar-ocr-coreandoar-ocr-derive. [1] [2]fontdbdependency toCargo.tomlfor improved font handling.Miscellaneous
These changes collectively introduce a major new acceleration feature, improve documentation and usability, and ensure robust CI coverage for new and existing features.