DocOcr is a lightweight, pipeline-based PHP library that turns documents like PDF, CSV, and XLSX into structured data using Mistral's OCR API.
- Ingestion pipelines
- AI preprocessing
- Finance / accounting docs
- Backend automation
- Supports PDF, CSV, XLSX
- Normalization layer for OCR-friendly input
- OCR powered by Mistral AI
- Extracts structured content (pages, text, tables)
- Fluent pipeline API (
normalize → ocr → toArray) - PHPUnit automated testing
- Custom OCR client injection
Most OCR libraries return raw text blobs. DocOcr focuses on pipeline-friendly, structured extraction designed for backend systems, AI preprocessing, and financial workflows.
It handles:
- File normalization (CSV/XLSX → OCR-friendly layout)
- OCR execution
- Predictable output for downstream processing
composer require darlanschmeller/doc-ocr-phpInclude in your project:
require __DIR__ . '/vendor/autoload.php';
use DocOcr\Document;git clone https://github.com/DarlanSchmeller/doc-ocr-php.gitInclude in your project:
require __DIR__ . '/src/Document.php';
use DocOcr\Document;Set your Mistral API key in your .env file:
MISTRAL_API_KEY=your_api_key_here
MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included$ocr = Document::from(__DIR__ . '<your_file_path>')
->normalize()
->ocr()
->toArray();
$ocrResult = $ocr->getResult();If you wish to use a different api key or custom OCR client you may inject it this way:
$client = new MistralOcrClient(new OcrClient('<your_mistral_api_key>'));
return Document::fromWithClient(__DIR__ . $fixture, $client)
->normalize()
->ocr()
->toArray();-
normalize()- Converts CSV and XLSX files into OCR-friendly layouts
- Reads PDFs and images as-is
-
ocr()- Sends the document to Mistral OCR
- Stores the raw OCR response
-
toArray()- Decodes the OCR JSON response into a PHP array
All pipeline stages are idempotent and safe to call multiple times.
[
'pages' => [
[
'index' => 0,
'markdown' => '
Invoice Number: #20130304
ATTENTION TO: Denny Gunawan
221 Queen St, Melbourne 3000
Total: $39.60
',
'images' => [],
'tables' => [
[
'id' => 'tbl-0.html',
'format' => 'html',
'content' => '
Organic Items | Price/kg | Quantity | Subtotal
Apple | $5.00 | 1 | $5.00
Orange | $1.99 | 2 | $3.98
'
]
]
]
]
]| Format | Normalized | OCR |
|---|---|---|
| ✅ | ✅ | |
| CSV | ✅ | ✅ |
| XLSX | ✅ | ✅ |
| Images (png, jpg, webp) | ⏭ skipped | ✅ |
./vendor/bin/phpunit testsOCR tests are skipped automatically if
MISTRAL_API_KEYis not set.