GreatV · GreatV · May 15, 2026 · May 7, 2026 · May 11, 2026 · May 14, 2026
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -33,6 +33,34 @@ jobs:
           cargo clippy --all-targets --all -- -D warnings
           cargo clippy --all-targets -p oar-ocr-vl -- -D warnings
 
+      - name: Check rustdoc warnings
+        env:
+          RUSTDOCFLAGS: -D warnings
+        run: cargo doc --workspace --no-deps
+
+  feature-matrix:
+    name: Feature matrix (${{ matrix.package }})
+    needs: preflight
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - package: oar-ocr
+          - package: oar-ocr-core
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@stable
+
+      - name: Cache Cargo dependencies
+        uses: Swatinem/rust-cache@v2
+
+      - name: Check all feature combinations
+        run: cargo check -p ${{ matrix.package }} --all-features
+
   test:
     name: Test (${{ matrix.os }})
     needs: preflight

diff --git a/Cargo.toml b/Cargo.toml
@@ -3,7 +3,7 @@ members = [".", "oar-ocr-derive", "oar-ocr-core", "oar-ocr-vl"]
 resolver = "2"
 
 [workspace.package]
-version = "0.6.3"
+version = "0.7.0"
 edition = "2024"
 rust-version = "1.95"
 license = "Apache-2.0"
@@ -12,8 +12,8 @@ repository = "https://github.com/greatv/oar-ocr"
 homepage = "https://github.com/greatv/oar-ocr"
 
 [workspace.dependencies]
-oar-ocr-core = { version = "0.6.3", path = "oar-ocr-core", default-features = false }
-oar-ocr-derive = { version = "0.6.3", path = "oar-ocr-derive", default-features = false }
+oar-ocr-core = { version = "0.7.0", path = "oar-ocr-core", default-features = false }
+oar-ocr-derive = { version = "0.7.0", path = "oar-ocr-derive", default-features = false }
 
 [package]
 name = "oar-ocr"
@@ -68,5 +68,6 @@ tracing-subscriber = { version = "0.3", features = ["env-filter"] }
 clap = { version = "4.5.42", features = ["derive"] }
 tempfile = "3.19"
 ab_glyph = "0.2"
+fontdb = "0.23"
 hayro = "0.5"
 regex = "1"
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # OAR-OCR
 
-![Crates.io Version](https://img.shields.io/crates/v/oar-ocr)
+[![Crates.io Version](https://img.shields.io/crates/v/oar-ocr)](https://crates.io/crates/oar-ocr)
 ![Crates.io Downloads (recent)](https://img.shields.io/crates/dr/oar-ocr)
 [![dependency status](https://deps.rs/repo/github/GreatV/oar-ocr/status.svg)](https://deps.rs/repo/github/GreatV/oar-ocr)
 ![GitHub License](https://img.shields.io/github/license/GreatV/oar-ocr)
@@ -84,12 +84,17 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
 
 ## Vision-Language Models (VLM)
 
-For advanced document understanding using Vision-Language Models (like PaddleOCR-VL, **PaddleOCR-VL-1.5**, UniRec, and MinerU2.5), check out the [`oar-ocr-vl`](oar-ocr-vl/README.md) crate.
+For advanced document understanding using Vision-Language Models (like PaddleOCR-VL, **PaddleOCR-VL-1.5**, GLM-OCR, HunyuanOCR, and MinerU2.5), check out the [`oar-ocr-vl`](oar-ocr-vl/README.md) crate.
+
+### Hierarchical Speculative Decoding (HSD)
+
+`oar-ocr-vl` ships a training-free CUDA acceleration scheme for the VLM backbones above. A cheap pipeline drafter (layout + OCR) proposes text candidates and the target VLM verifies them in batches via tree-attention, typically delivering several-fold wall-time speedups on document-heavy pages at `τ = 0.75`. Build with `--features hsd` (implies `cuda`); see [`docs/hsd.md`](docs/hsd.md) for the algorithm overview, config knobs, supported backbones, and AAL guidance.
 
 ## Documentation
 
 - [**Usage Guide**](docs/usage.md) - Detailed API usage, builder patterns, GPU configuration
 - [**Pre-trained Models**](docs/models.md) - Model download links and recommended configurations
+- [**HSD**](docs/hsd.md) - Hierarchical Speculative Decoding for VLM inference
 
 ## Examples
 
@@ -117,6 +122,4 @@ This project builds upon the excellent work of several open-source projects:
 
 - **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**: Baidu's awesome multilingual OCR toolkits based on PaddlePaddle. This project utilizes PaddleOCR's pre-trained models, which provide excellent accuracy and performance for text detection and recognition across multiple languages.
 
-- **[OpenOCR](https://github.com/Topdu/OpenOCR)**: An open-source toolkit for general OCR research and applications by the FVL Laboratory at Fudan University. We use the UniRec model for unified text, formula, and table recognition.
-
 - **[Candle](https://github.com/huggingface/candle)**: A minimalist ML framework for Rust by Hugging Face. We use Candle to implement Vision-Language model inference.
diff --git a/docs/hsd.md b/docs/hsd.md
@@ -0,0 +1,97 @@
+# Hierarchical Speculative Decoding (HSD)
+
+HSD is an optional CUDA acceleration path for document VLM decoding. It leaves the target model unchanged. A cheaper document pipeline — the paper uses PP-StructureV3 (layout analysis + element recognition) — prepares draft text once per page. The VLM then verifies those drafts with tree-batched speculative decoding and commits only accepted tokens.
+
+Reference: Liao et al., *"HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding"* (arXiv:2602.12957). Section references below cite that paper.
+
+## When to use it
+
+HSD helps when the draft is close to what the VLM would generate on its own. That is common on text-heavy pages, tables with regular structure, and repeated document boilerplate. A good draft lets one verify pass accept several tokens.
+
+It is not a general CPU speedup. The implementation is intended for CUDA, where a wider tree-attention verify pass is cheap compared with repeated single-token decoding. On CPU or Metal, the verify work is effectively serialized and the benefit usually disappears.
+
+The paper defines the acceptance threshold on the open interval $\tau \in (0, 1)$ (§3.2). Lower values accept more near-tie tokens, which can improve speed but may change the output. This implementation also accepts `tau = 1.0` as a degenerate boundary: at $\tau = 1.0$ the acceptance test collapses to "child equals the unrestricted argmax", so HSD follows the target model's greedy path. That extension is not part of the paper.
+
+## Document flow
+
+The document-level path has two stages (§3.1):
+
+- **Stage 1: region-level local verification.** For each region $r_i \in \mathcal{R}$, the target VLM verifies the region draft set $\tilde{\mathcal{Y}}^{(i)}$ on the cropped image $z_i = x|_{r_i}$:
+   $$\hat{y}^{(i)} = \mathrm{SpecDecode}(p_\theta, z_i, \tilde{\mathcal{Y}}^{(i)}).$$
+- **Stage 2: page-level global verification.** Stage 1 outputs are aggregated into an unordered page-level draft set
+   $$\tilde{\mathcal{Y}}^{\mathrm{pg}} = \{\hat{y}^{(i)} \mid r_i \in \mathcal{R}\},$$
+   which the target VLM then verifies in a single full-page pass: $\hat{y}^{\mathrm{pg}} = \mathrm{SpecDecode}(p_\theta, x, \tilde{\mathcal{Y}}^{\mathrm{pg}})$. Because the matcher scans each $\hat{y}^{(i)}$ independently, draft order has no semantic effect; the target model resolves the final reading order during verification.
+
+Backends that implement the full document path can turn either stage off through `HsdConfig`. PaddleOCR-VL is not evaluated in the paper; in this implementation it stays element-oriented by model design and uses only the region path.
+
+## One SpecDecode step
+
+For the accepted prefix $\hat{y}_{1:t}$ and a draft set $\tilde{\mathcal{Y}}$ (§3.2):
+
+1. **Draft-target matching.** Let the reference window be the most recent $n$ accepted tokens, $w = \hat{y}_{t-n+1:t}$. For each draft $\tilde{y} \in \tilde{\mathcal{Y}}$, record every start index $j$ with $\tilde{y}_{j:j+n-1} = w$. Collect the suffixes that follow each match:
+   $$\mathcal{C} = \big\{\, \tilde{y}_{j+n:|\tilde{y}|} \,\big|\, \tilde{y} \in \tilde{\mathcal{Y}},\; j \in \mathcal{J}(\tilde{y}),\; j + n \le |\tilde{y}|\,\big\}.$$
+2. **Prefix-tree batching.** Merge $\mathcal{C}$ into a prefix tree $\mathcal{T}$ whose root represents the empty prefix and whose every root-to-leaf path is one element of $\mathcal{C}$. For a node $v$, $\pi(v)$ is the token sequence on the path root → $v$, and $\mathrm{Next}(v) = \{c_{|\pi(v)|+1} \mid c \in \mathcal{C},\; c_{1:|\pi(v)|} = \pi(v)\}$ is the set of distinct next tokens shared by candidates that pass through $v$.
+3. **One tree-batched forward.** Linearize $\mathcal{T}$ into a packed sequence and run the target VLM under a tree-ancestry attention mask: a token at node $v$ attends only to $\hat{y}_{1:t}$ and to the tokens on $v$'s ancestor path. This produces $p_\theta(\cdot \mid z, \hat{y}_{1:t} \oplus \pi(v))$ at every node in one pass.
+4. **Greedy traversal with the $\tau$-test.** Start at the root $s$. At each step, select the best child token in the tree's local candidate set and compare it with the unrestricted argmax over the full vocabulary $\mathcal{V}$:
+   $$u^\star = \arg\max_{u \in \mathrm{Next}(s)} p_\theta(u \mid z, \hat{y}_{1:t} \oplus \pi(s)), \qquad \hat{u} = \arg\max_{u \in \mathcal{V}} p_\theta(u \mid z, \hat{y}_{1:t} \oplus \pi(s)).$$
+   Accept $u^\star$ and descend to its child node iff
+   $$\log p_\theta(u^\star \mid z, \hat{y}_{1:t} \oplus \pi(s)) - \log p_\theta(\hat{u} \mid z, \hat{y}_{1:t} \oplus \pi(s)) \ge \log \tau.$$
+   Stop when the test fails, when $\mathrm{Next}(s) = \emptyset$, or when $s$ is a leaf.
+5. **Bonus target token.** At the terminal node $s$, append the unrestricted argmax $\hat{u}$ to extend the accepted sequence by one extra target token:
+   $$\hat{y}_{1:t_\mathrm{new}} = \hat{y}_{1:t} \oplus \pi(s) \oplus \hat{u}.$$
+6. **Commit KV state.** Gather the KV cache so it keeps only the accepted prefix and the path through $s$, then continue decoding from $\hat{u}$.
+
+If $\mathcal{C} = \emptyset$ (no draft matches the current window), $\mathcal{T}$ contains only the root, $\mathrm{Next}(\mathrm{root}) = \emptyset$, the traversal stops immediately, and step 5 falls back to a single greedy token — the paper's algorithm with no special case.
+
+## Correctness at `tau = 1.0`
+
+The paper proves training-free, near-lossless acceleration over its stated domain $\tau \in (0, 1)$. This implementation also exposes $\tau = 1.0$ as a degenerate boundary: with $\log \tau = 0$, the acceptance test in step 4 reduces to $u^\star = \hat{u}$, so a child token is accepted only when it coincides with the unrestricted argmax. The committed sequence is then independent of the drafter and identical to the target model's greedy decode.
+
+By default this is enforced via a strict replay path (`strict_at_tau_one = true`, see Configuration). That replay path is an OAR-side correctness oracle, not part of the paper. Set `strict_at_tau_one = false` to keep $\tau = 1.0$ on the same tree-batched verify path the paper describes.
+
+The demo harness runs this oracle check and compares HSD output with baseline output byte-for-byte.
+
+## Reading AAL
+
+The main debug metric is **Average Acceptance Length (AAL)** (§4.2). At verification step $k$, let $\alpha_k$ be the number of consecutive draft tokens accepted before the first mismatch ($\alpha_k = 0$ on a full rejection). Over $N$ verification steps:
+
+$$\mathrm{AAL} = \frac{1}{N} \sum_{k=1}^{N} \alpha_k.$$
+
+The bonus target token appended at step 5 is not counted. Larger AAL means more decoding steps are saved per verify pass; the realized end-to-end speedup also depends on per-step verify overhead and parallel efficiency.
+
+For reference, the paper reports overall AAL on OmniDocBench v1.5 (Tab. 1) in the **2.5 to 4.6** range across the evaluated backbones (HunyuanOCR 4.55, dots.ocr 3.98, Qwen3-VL-8B 3.98, Qwen3-VL-2B 3.33, Qwen2.5-VL-7B 3.56, Qwen2.5-VL-3B 2.52). The ranges below are operational rules of thumb observed on this implementation, not paper numbers; use AAL as a draft-quality signal, not as a correctness metric:
+
+- `AAL` around `1`: the draft is doing little work. Check tokenization, window length, reading order, and whether the drafter output resembles the target output.
+- `AAL` from `3` to `6`: a normal range for many text-heavy pages with OCR drafts.
+- `AAL` from `8` to `15`: strong alignment, often from tables, lists, or repeated layout.
+- `AAL > 20`: usually a long exact span. Inspect output quality as well as speed.
+
+## Configuration
+
+`HsdConfig` controls the two-stage document path. Its `dsv` field contains the per-step speculative decoding knobs. The first two fields (`window_len`, `tau`) follow the paper's defaults (§4.3); the rest are OAR-side engineering knobs not present in the paper.
+
+| Field | Default | Source | Notes |
+|-------|---------|--------|-------|
+| `window_len` | `3` | paper §4.3 | Reference-window length $n$. Longer windows reduce false matches on repetitive text but also reduce draft hits. |
+| `tau` | `0.75` | paper §4.3 | Acceptance threshold. Paper domain is $\tau \in (0, 1)$; lower accepts more borderline tokens. `1.0` is a boundary extension that recovers greedy decoding. |
+| `max_candidates_per_step` | `32` | OAR addition | Bounds the number of candidate suffixes used to build each tree. The paper's ablations use uncapped trees. |
+| `max_suffix_len` | `256` | OAR addition | Bounds candidate depth so long drafts do not create oversized trees. |
+| `cold_start_full_draft` | `true` | OAR addition | Seeds the first step from draft prefixes before any accepted window exists. The paper's matcher has no cold-start fallback. |
+| `strict_at_tau_one` | `true` | OAR addition | When `true` and $\tau \ge 1.0$, route through a strict replay oracle. Set `false` to keep $\tau = 1.0$ on the paper's tree-batched verify path. |
+
+The candidate caps are guardrails for long tables, formulas, and repeated boilerplate. To reproduce a paper-faithful matcher, set both caps to `usize::MAX`, `cold_start_full_draft = false`, and `strict_at_tau_one = false`.
+
+## Running it
+
+Build the VLM crate with the `hsd` feature. It enables CUDA transitively:
+
+```bash
+cargo run -p oar-ocr-vl --release --features hsd,download-binaries \
+    --example hsd_demo -- \
+    --backend hunyuanocr \
+    --model-dir models/HunyuanOCR \
+    --device cuda \
+    --image document.jpg
+```
+
+The supported backbones expose `generate_hsd*` methods next to their baseline `generate` methods: `PaddleOcrVl`, `HunyuanOcr`, `GlmOcr`, and `MinerU`.