Skip to content

astonysh/DocuClaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DocuClaw Logo

DocuClaw

๐Ÿฆ€ Universal Sovereign Data Infrastructure for Individuals & Teams

English | ็ฎ€ไฝ“ไธญๆ–‡ | Deutsch | Franรงais | Espaรฑol | Italiano | ๆ—ฅๆœฌ่ชž

CI PyPI Python License OpenClaw

Your invoices. Your contracts. Your letters. Your data. Your rules.


๐Ÿš€ Why DocuClaw?

In a world drowning in SaaS lock-in and cloud surveillance, DocuClaw gives you back control.

Whether you're a freelancer managing personal tax receipts, a startup juggling B2B invoices across borders, or a growing SME facing GoBD compliance audits โ€” DocuClaw is your local-first, privacy-native, AI-powered document brain.

๐Ÿ“„ Physical Mail โ†’ ๐Ÿ“ธ Scan โ†’ ๐Ÿค– AI Extract โ†’ ๐Ÿ“ Local Markdown Archive
๐Ÿ“ง Email Receipt โ†’ ๐Ÿ”— Webhook โ†’ ๐Ÿค– AI Extract โ†’ ๐Ÿ“ Local Markdown Archive
๐Ÿงพ API Invoice  โ†’ ๐Ÿ”Œ Plugin  โ†’ ๐Ÿค– AI Extract โ†’ ๐Ÿ“ Local Markdown Archive

โœจ Key Features

Feature Description
๐Ÿ›ก๏ธ 100% Sovereign All data stays on YOUR machine. Zero cloud dependency. Zero telemetry.
๐Ÿข Multi-Entity Manage personal docs, company invoices, and team files โ€” all in one install.
๐Ÿ”Œ Plugin Architecture Country-specific parsers (DE, US, CN, ...) snap in like LEGO bricks.
๐Ÿ“ Markdown-Native Every document becomes a searchable .md file with structured YAML frontmatter.
๐Ÿค– AI-Powered Extraction Multimodal LLM extracts structured data from scans, photos, and emails.
โœ… Compliance-Ready Designed with GoBD (Germany), GDPR, and audit-trail principles baked in.
๐Ÿ” RAG-Ready Full-text originals preserved for retrieval-augmented generation workflows.

๐Ÿ—๏ธ Architecture

DocuClaw follows a Core Engine + Pluggable Parsers architecture, designed for enterprise-grade extensibility:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   CLI / API                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚               Core Engine                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚  Schema   โ”‚  โ”‚ Storage  โ”‚  โ”‚  Registry โ”‚ โ”‚
โ”‚  โ”‚(Pydantic) โ”‚  โ”‚  Layer   โ”‚  โ”‚  (Plugin) โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚             Parser Plugins                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ DE ๐Ÿ‡ฉ๐Ÿ‡ช  โ”‚  โ”‚ US ๐Ÿ‡บ๐Ÿ‡ธ  โ”‚  โ”‚ Custom ...  โ”‚  โ”‚
โ”‚  โ”‚Invoice โ”‚  โ”‚Invoice โ”‚  โ”‚  Your Parser โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚        Input Adapters (Future)               โ”‚
โ”‚  ๐Ÿ“ท Scanner โ”‚ ๐Ÿ“ง Email โ”‚ ๐Ÿ”— Webhook โ”‚ ๐Ÿ”Œ API โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Data Contract

Every document, whether a โ‚ฌ10K enterprise invoice or a personal electricity bill, is normalized into a universal Markdown schema with structured YAML frontmatter:

---
id: doc_20260215_a1b2c3d4
entity_id: "org_acme_01"
entity_type: "company"
source_type: physical_mail
country: DE
document_type: b2b_invoice
date_received: "2026-02-15"
sender_name: "AWS EMEA SARL"
amount_total: 125.50
currency: EUR
status: pending
tags: [IT_Infrastructure, Q1_Expense]
---
### Raw Content
[Full OCR / email body preserved for compliance & RAG]

### AI Summary
This is the February AWS bill containing โ‚ฌ20.04 input VAT...

โšก Quick Start

Installation

# Clone the repository
git clone https://github.com/openclaw-ai/docuclaw.git
cd docuclaw

# Install dependencies
pip install -e .

Usage

# Process a German invoice scan
docuclaw process \
  --entity-id "org_mycompany_01" \
  --entity-type company \
  --country DE \
  --source-type physical_mail \
  --input ./scans/invoice_aws_feb.png

# Output: ./docuclaw_data/org_mycompany_01/2026/02/doc_20260215_xxxx.md

Python API

from docuclaw.schema import DocuClawDocument, EntityType, SourceType
from docuclaw.core.storage import MarkdownStorageEngine
from docuclaw.parsers.de_invoice_parser import DEInvoiceParser

# Initialize
storage = MarkdownStorageEngine(base_path="./docuclaw_data")
parser = DEInvoiceParser()

# Parse a document
doc = parser.parse(
    file_path="./scans/invoice.png",
    entity_id="org_mycompany_01",
    entity_type=EntityType.COMPANY,
)

# Persist as structured Markdown
output_path = storage.save(doc)
print(f"๐Ÿ“„ Saved: {output_path}")

๐Ÿงฉ Writing Custom Parsers

Extend DocuClaw for any country or document type:

from docuclaw.parsers.base import BaseDocumentParser
from docuclaw.schema import DocuClawDocument

class USReceiptParser(BaseDocumentParser):
    """Parser for US retail receipts."""

    @property
    def supported_countries(self) -> list[str]:
        return ["US"]

    @property
    def supported_document_types(self) -> list[str]:
        return ["receipt", "b2c_invoice"]

    def parse(self, file_path, entity_id, entity_type, **kwargs):
        # Your extraction logic here
        ...

๐Ÿ‡ช๐Ÿ‡บ GDPR & EU Compliance

DocuClaw is architected from the ground up with EU General Data Protection Regulation (GDPR) compliance as a core design principle โ€” not an afterthought.

Why DocuClaw is inherently GDPR-friendly

GDPR Requirement How DocuClaw Fulfills It
Art. 5(1)(c) โ€” Data Minimization Only extracts and stores the structured fields you explicitly define. Zero telemetry, zero usage analytics, zero behavioral tracking.
Art. 5(1)(f) โ€” Integrity & Confidentiality All data processing happens locally on your machine. No network transmission = no interception risk.
Art. 5(2) โ€” Accountability Built-in audit logging with hash-chain integrity verification provides tamper-evident processing records.
Art. 17 โ€” Right to Erasure Data is stored as plain Markdown files on your local filesystem. Deletion is as simple as removing a file โ€” no vendor tickets, no retention policies to fight.
Art. 25 โ€” Data Protection by Design Privacy-first architecture: local-only processing is not a feature toggle, it's the only mode. No cloud fallback exists.
Art. 44โ€“49 โ€” International Transfers No data ever leaves your machine. No third-party servers, no sub-processors, no cross-border transfers. Full compliance by architectural design.

GoBD Compliance (Germany) ๐Ÿ‡ฉ๐Ÿ‡ช

For users in Germany, DocuClaw additionally supports GoBD (Grundsรคtze zur ordnungsmรครŸigen Fรผhrung und Aufbewahrung von Bรผchern, Aufzeichnungen und Unterlagen in elektronischer Form):

  • Immutability: Hash-chain audit trail ensures archived documents cannot be silently altered
  • Traceability: Every processing step is logged with timestamps and checksums
  • Retention: Local storage with structured dating supports the 10-year retention requirement
  • Accessibility: Markdown-native format ensures documents remain human-readable without proprietary software

No Cloud? No Problem.

Unlike SaaS alternatives, DocuClaw never requires you to:

  • Sign a Data Processing Agreement (DPA) with a third party
  • Conduct a Data Protection Impact Assessment (DPIA) for cloud transfers
  • Maintain Records of Processing Activities (RoPA) for external processors
  • Worry about the adequacy decisions of third-country data flows

Your data stays on your machine. Period. That's the simplest โ€” and most secure โ€” compliance strategy.


๐Ÿค– AI-Powered Output โ€” From Archive to Action

DocuClaw doesn't just archive your documents โ€” it turns them into an actionable knowledge base. Through AI agent integration (via OpenClaw or any compatible LLM agent), your structured Markdown data becomes a living system that can answer questions, automate workflows, and feed directly into the tools you already use.

๐Ÿ’ฌ Ask Your Documents

Talk to your document archive like you'd talk to a colleague:

You:    "How much did I spend on AWS in Q4 2025?"
Agent:  "Based on 3 invoices archived in DocuClaw, your total AWS spend
         in Q4 2025 was โ‚ฌ387.42 (Oct: โ‚ฌ125.50, Nov: โ‚ฌ131.88, Dec: โ‚ฌ130.04)."

You:    "When does my office lease expire?"
Agent:  "Your lease contract (doc_20240301_lease) shows an expiration date
         of March 31, 2027, with a 3-month notice period starting Jan 1, 2027."

Because documents are stored as structured Markdown with YAML frontmatter, any LLM or RAG pipeline can instantly query, filter, and reason over your entire archive โ€” without sending data to the cloud.

๐Ÿ“… Calendar, Reminders & To-Do Lists

Auto-extract actionable dates and deadlines from your documents:

Source Document Auto-Generated Action
Invoice with due date ๐Ÿ“… Calendar event: "Pay AWS invoice โ‚ฌ125.50" on Feb 28
Contract with renewal clause โฐ Reminder: "Lease renewal notice deadline" 90 days before expiry
Insurance policy โœ… To-Do: "Review and renew car insurance by April 15"
Quarterly VAT summary ๐Ÿ“‹ Task: "Submit Q1 VAT return โ€” total: โ‚ฌ2,340.00"

Push these directly to Apple Calendar, Google Calendar, Todoist, Things, Notion, or any task manager via standard APIs.

๐Ÿงพ Tax Filing & Financial Reports

Generate tax-ready outputs directly from your archived documents:

  • Expense Summaries: Categorized by type, vendor, tax rate, and period
  • VAT/GST Reports: Pre-calculated input tax, output tax, and net amounts
  • Annual Tax Packages: Formatted for your accountant or tax advisor
  • ELSTER-ready data (Germany): Export structured data compatible with German tax filing
  • Custom Financial Reports: Monthly P&L, quarterly cash flow, annual overviews

๐Ÿ”— Third-Party System Integration

DocuClaw generates and submits data in the exact format required by external systems:

System Type Examples What DocuClaw Generates
Accounting Software DATEV, Xero, QuickBooks, Lexware Booking entries, invoice records, expense imports
ERP Systems SAP, Odoo, ERPNext Structured purchase orders, vendor records
Government Portals ELSTER (DE), HMRC (UK), IRS (US) Tax declarations, compliance reports
Banking Platforms SWIFT, SEPA Payment instructions, reconciliation data
CRM Systems Salesforce, HubSpot Contract metadata, vendor information

All data stays local until you decide to export or submit. DocuClaw generates the output โ€” you control when and where it goes.


๐Ÿ—บ๏ธ Roadmap

Our vision for DocuClaw is to become the ultimate Sovereign Data Hub for your personal and business documents. Here is what we are building next:

Phase 1: Core Engine & Expanded Parsers (Current)

  • Milestone 1: Core schema, storage engine, parser framework, CLI skeleton
  • Milestone 2: Email ingestion adapter (IMAP/POP3)
  • Milestone 3: Real multimodal LLM integration (Ollama, OpenAI Vision)
  • Milestone 4: Web UI dashboard (local-only, no cloud)
  • Milestone 5: GoBD-compliant audit trail with hash chains
  • Milestone 6: Multi-entity permission model & team collaboration
  • Milestone 7: Webhook & API ingestion endpoints
  • Multi-Country Parser Ecosystem: Specialized extraction logic for highly-bureaucratic regions:
    • ๐Ÿ‡ฉ๐Ÿ‡ช Germany (e.g., Steuerbescheid, GoBD compliance considerations)
    • ๐Ÿ‡ซ๐Ÿ‡ท France (e.g., CAF, URSSAF, CPAM documents)
    • ๐Ÿ‡ฎ๐Ÿ‡น Italy (e.g., Raccomandata, Fattura Elettronica)
    • ๐Ÿ‡ช๐Ÿ‡ธ Spain, ๐Ÿ‡บ๐Ÿ‡ธ United States (Medical bills, IRS notices), ๐Ÿ‡ฏ๐Ÿ‡ต Japan (Hanko documents).
  • Advanced OCR Pipeline: Better layout recognition for complex tabular data (e.g., invoices).

Phase 2: Omnichannel Ingestion (Meeting data where it lives)

  • Seamless Email Integration:
    • One-click OAuth for Gmail, Outlook, and iCloud.
    • Standard IMAP support for privacy-focused providers (ProtonMail) and regional giants (GMX, Web.de).
  • Native OS & Media Sync:
    • Apple Photos Integration: Auto-import receipts and documents directly from your macOS/iOS photo library.
    • Local Watchdogs: Auto-process files dropped into specific local folders (perfect for network scanners).

Phase 3: Pluggable AI Engines (Bring Your Own Brain)

  • Cloud AI Integration: Easy API key setup for OpenAI (GPT-4o), Anthropic (Claude), and Google (Gemini).
  • Local-First LLMs: Out-of-the-box support for local inference engines like Ollama and LM Studio. Process highly sensitive documents (like medical records) completely offline.

Phase 4: Automated Export Workflows (The Data Router)

  • Calendar & Tasks: Automatically push deadlines (e.g., invoice due dates) to Google Calendar, Apple iCal, or Todoist.
  • Tax & Accounting Sync: Export parsed financial data to tools like DATEV, Lexoffice, SevDesk (EU), or QuickBooks (US).
  • Knowledge Base Integrations: Seamlessly sync structured Markdown data into Obsidian or Notion for your Second Brain.

๐Ÿ“ฆ Part of the OpenClaw Ecosystem

DocuClaw is a core component of openclaw.ai โ€” an open-source ecosystem for sovereign AI-powered productivity tools.

Project Description
DocuClaw Sovereign document intelligence & archival
DeepReader AI-powered web content ingestion
ClawHub Plugin marketplace & community hub

๐Ÿค Contributing

We welcome contributions! Whether it's a new country parser, a bug fix, or documentation improvements.

# Development setup
git clone https://github.com/openclaw-ai/docuclaw.git
cd docuclaw
pip install -e ".[dev]"

# Run tests
pytest

# Run linters
ruff check .
mypy docuclaw/

See CONTRIBUTING.md for detailed guidelines.


๐Ÿ“„ License

Licensed under the MIT License. Use it freely. Own your data.


Built with ๐Ÿฆ€ by the OpenClaw community
"Your data should work for you, not against you."

About

100% local, AI-powered document intelligence for individuals & SMEs. Securely extract structured data from scans and emails into a private Markdown archive. Privacy-native, multi-entity support, and GoBD-ready. Own your data. Part of the OpenClaw ecosystem. ๐Ÿฆ€

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages