Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ openvino = ["oar-ocr-core/openvino"]
# Set ORT_LIB_LOCATION to use a custom installation (skips download).
# Use --no-default-features for offline/enterprise environments.
download-binaries = ["oar-ocr-core/download-binaries"]
# Auto-download OCR model files at runtime from ModelScope into `$OAR_HOME`
# (default `~/.oar`). When enabled, builder methods that take a model path
# accept either a filesystem path or a bare registered file name.
auto-download = ["oar-ocr-core/auto-download"]

[dependencies]
# Core library with all types and predictors
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ With GPU support:
cargo add oar-ocr --features cuda
```

With auto-download of model files from ModelScope:

```bash
cargo add oar-ocr --features auto-download
```

Bare file names passed to the builders are then fetched from [`greatv/oar-ocr` on ModelScope](https://www.modelscope.cn/models/greatv/oar-ocr) into `$OAR_HOME` (default `~/.oar`) and verified against their expected SHA-256. See [docs/models.md](docs/models.md#auto-download-via-the-auto-download-feature) for the exact path resolution rules.

### Basic Usage

```rust
Expand Down
67 changes: 37 additions & 30 deletions docs/models.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pre-trained Models

OAROCR provides pre-trained models for OCR and document understanding tasks. Download them from the [GitHub Releases](https://github.com/GreatV/oar-ocr/releases) page.
OAROCR provides pre-trained models for OCR and document understanding tasks. Download them manually from the [GitHub Releases](https://github.com/GreatV/oar-ocr/releases) page (linked in the tables below), or have the library fetch them on demand from ModelScope — see [§ Auto-download](#auto-download-via-the-auto-download-feature) at the bottom.

## Text Detection Models

Expand Down Expand Up @@ -155,41 +155,48 @@ Models for document structure analysis with `OARStructureBuilder`:
| Seal Detection (Mobile) | [`pp-ocrv4_mobile_seal_det.onnx`](https://github.com/GreatV/oar-ocr/releases/download/v0.3.0/pp-ocrv4_mobile_seal_det.onnx) | 4.6MB | Fast seal detection |
| Seal Detection (Server) | [`pp-ocrv4_server_seal_det.onnx`](https://github.com/GreatV/oar-ocr/releases/download/v0.3.0/pp-ocrv4_server_seal_det.onnx) | 108.2MB | Accurate seal detection |

## Recommended Configurations

### Fast Processing (Real-time)
## Auto-download (via the `auto-download` feature)

```bash
cargo add oar-ocr --features auto-download
```
Detection: pp-ocrv5_mobile_det.onnx
Recognition: pp-ocrv5_mobile_rec.onnx
Dictionary: ppocrv5_dict.txt

```rust,no_run
use oar_ocr::prelude::*;
Comment thread
GreatV marked this conversation as resolved.
let ocr = OAROCRBuilder::new(
"pp-ocrv5_mobile_det.onnx", // bare name → resolved via registry
"pp-ocrv5_mobile_rec.onnx",
"ppocrv5_dict.txt",
).build()?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

### High Accuracy
When the feature is enabled, registered file names are fetched from [`greatv/oar-ocr` on ModelScope](https://www.modelscope.cn/models/greatv/oar-ocr) into `$OAR_HOME` (default `~/.oar`) and verified against the expected SHA-256 before use. Subsequent runs reuse the cached copy. The bundled registry lives at [`oar_ocr::download::REGISTRY`](../oar-ocr-core/src/core/download/registry.rs).

```
Detection: pp-ocrv5_server_det.onnx
Recognition: pp-ocrv5_server_rec.onnx
Dictionary: ppocrv5_dict.txt
```
### Path resolution rules

### Document Processing
For each model path argument the builder applies these checks in order:

```
Detection: pp-ocrv4_server_det.onnx
Recognition: pp-ocrv4_server_rec_doc.onnx
Dictionary: ppocrv4_doc_dict.txt
Orientation: pp-lcnet_x1_0_doc_ori.onnx
Rectification: uvdoc.onnx
```
1. **Existing file wins.** If the path refers to a real file on disk it is used as-is — no registry lookup, no hash check, no network. A `./pp-ocrv5_mobile_det.onnx` next to the binary always shadows the registry.
2. **Only bare names or `$OAR_HOME`-rooted paths are eligible for auto-download.** A path is considered for registry resolution only when it has no parent component (e.g. `"pp-ocrv5_mobile_det.onnx"`) or when its parent equals the cache directory. Explicit paths like `./models/foo.onnx` or `/data/foo.onnx` are returned verbatim even if their file name is registered — the library never silently overrides an explicit path.
3. **Registry hit → cache or download.** If the file name appears in `REGISTRY`:
- `$OAR_HOME/<name>` exists with matching size + SHA-256 → cached copy is used (no network).
- Cached copy is missing or its hash mismatches → download from ModelScope, verify SHA-256, atomically replace.
4. **Unregistered + missing → returned verbatim** so the builder produces its normal "model not found" error.

### Document Structure Analysis
| Input | On disk | Behaviour |
|---|---|---|
| `"pp-ocrv5_mobile_det.onnx"` | `./pp-ocrv5_mobile_det.onnx` exists | Use the local CWD file |
| `"pp-ocrv5_mobile_det.onnx"` | `$OAR_HOME/...` exists, hash OK | Use cached copy, no network |
| `"pp-ocrv5_mobile_det.onnx"` | absent or hash mismatch | Download to `$OAR_HOME`, verify, use |
| `"./models/det.onnx"` | absent | Returned as-is → "model not found" |
| `"$OAR_HOME/pp-ocrv5_mobile_det.onnx"` (absolute) | (any) | Parent equals the cache dir → same as bare name |

```
Layout: pp-doclayout_plus-l.onnx
Table Classification: pp-lcnet_x1_0_table_cls.onnx
Table Structure (Wired): slanext_wired.onnx
Table Structure (Wireless): slanet_plus.onnx
Table Structure Dict: table_structure_dict_ch.txt
Formula: pp-formulanet_plus-l.onnx
```
Note: the resolver compares paths verbatim — `~` is not expanded. Pass a bare filename, an absolute path under `$OAR_HOME`, or let your shell expand `~` for you.

Comment thread
GreatV marked this conversation as resolved.
### Cache layout

- Override the cache root with the `OAR_HOME` environment variable. Defaults to `~/.oar` (resolved via the platform home directory; the literal `~` is not expanded by the library).
- Files land at `$OAR_HOME/<name>`, flat (no per-revision subdirectories).
- Downloads stream into a unique `$OAR_HOME/.<name>.<pid>.<n>.part` and are renamed atomically once the SHA-256 matches, so a crash mid-download won't poison the cache and concurrent processes don't clobber each other.
- After verification a `$OAR_HOME/.<name>.sha256` sidecar records the verified hash. Future loads with a matching cache file + sidecar skip the multi-second rehash; deleting the sidecar forces a fresh hash check.
Comment thread
GreatV marked this conversation as resolved.
68 changes: 68 additions & 0 deletions examples/auto_download.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
//! Minimal OCR pipeline using the `auto-download` feature.
//!
//! Demonstrates how the high-level builder transparently fetches missing model
//! files from ModelScope. Any model path that isn't an on-disk file but
//! matches an entry in [`oar_ocr::download::REGISTRY`] is downloaded into
//! `$OAR_HOME` (default `~/.oar`) and verified against its SHA-256.
//!
//! # Build
//!
//! ```bash
//! cargo run --features auto-download --example auto_download -- <image.jpg>
//! ```
//!
//! Without the `auto-download` feature this example refuses to compile so the
//! intent is explicit.

#[cfg(not(feature = "auto-download"))]
fn main() {
eprintln!(
"This example requires the `auto-download` feature.\n\
Re-run with: cargo run --features auto-download --example auto_download -- <image.jpg>"
);
std::process::exit(2);
}

#[cfg(feature = "auto-download")]
fn main() -> Result<(), Box<dyn std::error::Error>> {
use oar_ocr::oarocr::OAROCRBuilder;
use oar_ocr::utils::load_image;
use std::path::PathBuf;
use tracing_subscriber::EnvFilter;

tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")),
)
.init();

let image: PathBuf = std::env::args()
.nth(1)
.map(PathBuf::from)
.ok_or("usage: auto_download <image>")?;

println!("OAR cache: {}", oar_ocr::download::cache_dir().display());

// Bare file names → resolved through the registry on `build()`.
// The first run downloads to ~/.oar (or $OAR_HOME); subsequent runs reuse
// the cached copies after verifying their SHA-256.
let ocr = OAROCRBuilder::new(
"pp-ocrv5_mobile_det.onnx",
"pp-ocrv5_mobile_rec.onnx",
"ppocrv5_dict.txt",
)
.with_text_line_orientation_classification("pp-lcnet_x1_0_textline_ori.onnx")
.build()?;

let img = load_image(&image)?;
let results = ocr.predict(vec![img])?;
for (i, page) in results.iter().enumerate() {
println!("--- page {i} ---");
for region in &page.text_regions {
if let Some(ref text) = region.text {
println!("{text}");
}
}
}
Ok(())
}
8 changes: 8 additions & 0 deletions oar-ocr-core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ webgpu = ["ort/webgpu"]
openvino = ["ort/openvino"]
# Auto-download ONNX Runtime binaries during build (enabled by default).
download-binaries = ["ort/download-binaries", "ort/tls-native"]
# Auto-download OCR model files at runtime from ModelScope into the on-disk
# cache (`$OAR_HOME` or `~/.oar`). When enabled, model paths handed to the
# OCR builders that do not exist on disk but match the bundled registry are
# resolved by fetching the file over HTTPS and verifying its SHA-256.
auto-download = ["dep:ureq", "dep:sha2", "dep:dirs"]

[dependencies]
oar-ocr-derive.workspace = true
Expand All @@ -47,6 +52,9 @@ tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
tokenizers = { version = "0.23", default-features = false, features = ["progressbar", "onig"] }
clipper2-rust = "1.0.3"
ureq = { version = "3.0", default-features = false, features = ["rustls", "platform-verifier"], optional = true }
sha2 = { version = "0.10", optional = true }
dirs = { version = "5.0", optional = true }

[dev-dependencies]
tempfile = "3.19"
Loading
Loading