Skip to content

DarlanSchmeller/doc-ocr-php

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧾 Doc Ocr PHP Doc OCR PHP

PHP Packagist License PHP Composer

DocOcr is a lightweight, pipeline-based PHP library that turns documents like PDF, CSV, and XLSX into structured data using Mistral's OCR API.

Designed for:

  • Ingestion pipelines
  • AI preprocessing
  • Finance / accounting docs
  • Backend automation

Features

  • Supports PDF, CSV, XLSX
  • Normalization layer for OCR-friendly input
  • OCR powered by Mistral AI
  • Extracts structured content (pages, text, tables)
  • Fluent pipeline API (normalize → ocr → toArray)
  • PHPUnit automated testing
  • Custom OCR client injection

Why DocOcr?

Most OCR libraries return raw text blobs. DocOcr focuses on pipeline-friendly, structured extraction designed for backend systems, AI preprocessing, and financial workflows.

It handles:

  • File normalization (CSV/XLSX → OCR-friendly layout)
  • OCR execution
  • Predictable output for downstream processing

Installation

Via Composer (recommended)

composer require darlanschmeller/doc-ocr-php

Include in your project:

require __DIR__ . '/vendor/autoload.php';

use DocOcr\Document;

From source (for development only)

git clone https://github.com/DarlanSchmeller/doc-ocr-php.git

Include in your project:

require __DIR__ . '/src/Document.php';

use DocOcr\Document;

Configuration

Set your Mistral API key in your .env file:

MISTRAL_API_KEY=your_api_key_here
MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included

Usage

Basic Usage

$ocr = Document::from(__DIR__ . '<your_file_path>')
    ->normalize()
    ->ocr()
    ->toArray();

$ocrResult = $ocr->getResult();

Injecting your own client instance

If you wish to use a different api key or custom OCR client you may inject it this way:

 $client = new MistralOcrClient(new OcrClient('<your_mistral_api_key>'));
        return Document::fromWithClient(__DIR__ . $fixture, $client)
            ->normalize()
            ->ocr()
            ->toArray();

Pipeline Stages

  1. normalize()

    • Converts CSV and XLSX files into OCR-friendly layouts
    • Reads PDFs and images as-is
  2. ocr()

    • Sends the document to Mistral OCR
    • Stores the raw OCR response
  3. toArray()

    • Decodes the OCR JSON response into a PHP array

All pipeline stages are idempotent and safe to call multiple times.

Output Example

[
  'pages' => [
    [
      'index' => 0,
      'markdown' => '
        Invoice Number: #20130304
        ATTENTION TO: Denny Gunawan
        221 Queen St, Melbourne 3000
        Total: $39.60
      ',
      'images' => [],
      'tables' => [
        [
          'id' => 'tbl-0.html',
          'format' => 'html',
          'content' => '
            Organic Items | Price/kg | Quantity | Subtotal
            Apple         | $5.00    | 1        | $5.00
            Orange        | $1.99    | 2        | $3.98
          '
        ]
      ]
    ]
  ]
]

📂 Supported Formats

Format Normalized OCR
PDF
CSV
XLSX
Images (png, jpg, webp) ⏭ skipped

Run automated tests

./vendor/bin/phpunit tests

OCR tests are skipped automatically if MISTRAL_API_KEY is not set.

About

Document OCR and ingestion pipeline library for PHP applications, powered by Mistral AI. Includes automated tests and GitHub Actions CI

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages