๐ฆ Universal Sovereign Data Infrastructure for Individuals & Teams
English | ็ฎไฝไธญๆ | Deutsch | Franรงais | Espaรฑol | Italiano | ๆฅๆฌ่ช
Your invoices. Your contracts. Your letters. Your data. Your rules.
In a world drowning in SaaS lock-in and cloud surveillance, DocuClaw gives you back control.
Whether you're a freelancer managing personal tax receipts, a startup juggling B2B invoices across borders, or a growing SME facing GoBD compliance audits โ DocuClaw is your local-first, privacy-native, AI-powered document brain.
๐ Physical Mail โ ๐ธ Scan โ ๐ค AI Extract โ ๐ Local Markdown Archive
๐ง Email Receipt โ ๐ Webhook โ ๐ค AI Extract โ ๐ Local Markdown Archive
๐งพ API Invoice โ ๐ Plugin โ ๐ค AI Extract โ ๐ Local Markdown Archive
| Feature | Description |
|---|---|
| ๐ก๏ธ 100% Sovereign | All data stays on YOUR machine. Zero cloud dependency. Zero telemetry. |
| ๐ข Multi-Entity | Manage personal docs, company invoices, and team files โ all in one install. |
| ๐ Plugin Architecture | Country-specific parsers (DE, US, CN, ...) snap in like LEGO bricks. |
| ๐ Markdown-Native | Every document becomes a searchable .md file with structured YAML frontmatter. |
| ๐ค AI-Powered Extraction | Multimodal LLM extracts structured data from scans, photos, and emails. |
| โ Compliance-Ready | Designed with GoBD (Germany), GDPR, and audit-trail principles baked in. |
| ๐ RAG-Ready | Full-text originals preserved for retrieval-augmented generation workflows. |
DocuClaw follows a Core Engine + Pluggable Parsers architecture, designed for enterprise-grade extensibility:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI / API โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Core Engine โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ Schema โ โ Storage โ โ Registry โ โ
โ โ(Pydantic) โ โ Layer โ โ (Plugin) โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Parser Plugins โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ DE ๐ฉ๐ช โ โ US ๐บ๐ธ โ โ Custom ... โ โ
โ โInvoice โ โInvoice โ โ Your Parser โ โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Input Adapters (Future) โ
โ ๐ท Scanner โ ๐ง Email โ ๐ Webhook โ ๐ API โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Every document, whether a โฌ10K enterprise invoice or a personal electricity bill, is normalized into a universal Markdown schema with structured YAML frontmatter:
---
id: doc_20260215_a1b2c3d4
entity_id: "org_acme_01"
entity_type: "company"
source_type: physical_mail
country: DE
document_type: b2b_invoice
date_received: "2026-02-15"
sender_name: "AWS EMEA SARL"
amount_total: 125.50
currency: EUR
status: pending
tags: [IT_Infrastructure, Q1_Expense]
---
### Raw Content
[Full OCR / email body preserved for compliance & RAG]
### AI Summary
This is the February AWS bill containing โฌ20.04 input VAT...# Clone the repository
git clone https://github.com/openclaw-ai/docuclaw.git
cd docuclaw
# Install dependencies
pip install -e .# Process a German invoice scan
docuclaw process \
--entity-id "org_mycompany_01" \
--entity-type company \
--country DE \
--source-type physical_mail \
--input ./scans/invoice_aws_feb.png
# Output: ./docuclaw_data/org_mycompany_01/2026/02/doc_20260215_xxxx.mdfrom docuclaw.schema import DocuClawDocument, EntityType, SourceType
from docuclaw.core.storage import MarkdownStorageEngine
from docuclaw.parsers.de_invoice_parser import DEInvoiceParser
# Initialize
storage = MarkdownStorageEngine(base_path="./docuclaw_data")
parser = DEInvoiceParser()
# Parse a document
doc = parser.parse(
file_path="./scans/invoice.png",
entity_id="org_mycompany_01",
entity_type=EntityType.COMPANY,
)
# Persist as structured Markdown
output_path = storage.save(doc)
print(f"๐ Saved: {output_path}")Extend DocuClaw for any country or document type:
from docuclaw.parsers.base import BaseDocumentParser
from docuclaw.schema import DocuClawDocument
class USReceiptParser(BaseDocumentParser):
"""Parser for US retail receipts."""
@property
def supported_countries(self) -> list[str]:
return ["US"]
@property
def supported_document_types(self) -> list[str]:
return ["receipt", "b2c_invoice"]
def parse(self, file_path, entity_id, entity_type, **kwargs):
# Your extraction logic here
...DocuClaw is architected from the ground up with EU General Data Protection Regulation (GDPR) compliance as a core design principle โ not an afterthought.
| GDPR Requirement | How DocuClaw Fulfills It |
|---|---|
| Art. 5(1)(c) โ Data Minimization | Only extracts and stores the structured fields you explicitly define. Zero telemetry, zero usage analytics, zero behavioral tracking. |
| Art. 5(1)(f) โ Integrity & Confidentiality | All data processing happens locally on your machine. No network transmission = no interception risk. |
| Art. 5(2) โ Accountability | Built-in audit logging with hash-chain integrity verification provides tamper-evident processing records. |
| Art. 17 โ Right to Erasure | Data is stored as plain Markdown files on your local filesystem. Deletion is as simple as removing a file โ no vendor tickets, no retention policies to fight. |
| Art. 25 โ Data Protection by Design | Privacy-first architecture: local-only processing is not a feature toggle, it's the only mode. No cloud fallback exists. |
| Art. 44โ49 โ International Transfers | No data ever leaves your machine. No third-party servers, no sub-processors, no cross-border transfers. Full compliance by architectural design. |
For users in Germany, DocuClaw additionally supports GoBD (Grundsรคtze zur ordnungsmรครigen Fรผhrung und Aufbewahrung von Bรผchern, Aufzeichnungen und Unterlagen in elektronischer Form):
- Immutability: Hash-chain audit trail ensures archived documents cannot be silently altered
- Traceability: Every processing step is logged with timestamps and checksums
- Retention: Local storage with structured dating supports the 10-year retention requirement
- Accessibility: Markdown-native format ensures documents remain human-readable without proprietary software
Unlike SaaS alternatives, DocuClaw never requires you to:
- Sign a Data Processing Agreement (DPA) with a third party
- Conduct a Data Protection Impact Assessment (DPIA) for cloud transfers
- Maintain Records of Processing Activities (RoPA) for external processors
- Worry about the adequacy decisions of third-country data flows
Your data stays on your machine. Period. That's the simplest โ and most secure โ compliance strategy.
DocuClaw doesn't just archive your documents โ it turns them into an actionable knowledge base. Through AI agent integration (via OpenClaw or any compatible LLM agent), your structured Markdown data becomes a living system that can answer questions, automate workflows, and feed directly into the tools you already use.
Talk to your document archive like you'd talk to a colleague:
You: "How much did I spend on AWS in Q4 2025?"
Agent: "Based on 3 invoices archived in DocuClaw, your total AWS spend
in Q4 2025 was โฌ387.42 (Oct: โฌ125.50, Nov: โฌ131.88, Dec: โฌ130.04)."
You: "When does my office lease expire?"
Agent: "Your lease contract (doc_20240301_lease) shows an expiration date
of March 31, 2027, with a 3-month notice period starting Jan 1, 2027."
Because documents are stored as structured Markdown with YAML frontmatter, any LLM or RAG pipeline can instantly query, filter, and reason over your entire archive โ without sending data to the cloud.
Auto-extract actionable dates and deadlines from your documents:
| Source Document | Auto-Generated Action |
|---|---|
| Invoice with due date | ๐ Calendar event: "Pay AWS invoice โฌ125.50" on Feb 28 |
| Contract with renewal clause | โฐ Reminder: "Lease renewal notice deadline" 90 days before expiry |
| Insurance policy | โ To-Do: "Review and renew car insurance by April 15" |
| Quarterly VAT summary | ๐ Task: "Submit Q1 VAT return โ total: โฌ2,340.00" |
Push these directly to Apple Calendar, Google Calendar, Todoist, Things, Notion, or any task manager via standard APIs.
Generate tax-ready outputs directly from your archived documents:
- Expense Summaries: Categorized by type, vendor, tax rate, and period
- VAT/GST Reports: Pre-calculated input tax, output tax, and net amounts
- Annual Tax Packages: Formatted for your accountant or tax advisor
- ELSTER-ready data (Germany): Export structured data compatible with German tax filing
- Custom Financial Reports: Monthly P&L, quarterly cash flow, annual overviews
DocuClaw generates and submits data in the exact format required by external systems:
| System Type | Examples | What DocuClaw Generates |
|---|---|---|
| Accounting Software | DATEV, Xero, QuickBooks, Lexware | Booking entries, invoice records, expense imports |
| ERP Systems | SAP, Odoo, ERPNext | Structured purchase orders, vendor records |
| Government Portals | ELSTER (DE), HMRC (UK), IRS (US) | Tax declarations, compliance reports |
| Banking Platforms | SWIFT, SEPA | Payment instructions, reconciliation data |
| CRM Systems | Salesforce, HubSpot | Contract metadata, vendor information |
All data stays local until you decide to export or submit. DocuClaw generates the output โ you control when and where it goes.
Our vision for DocuClaw is to become the ultimate Sovereign Data Hub for your personal and business documents. Here is what we are building next:
- Milestone 1: Core schema, storage engine, parser framework, CLI skeleton
- Milestone 2: Email ingestion adapter (IMAP/POP3)
- Milestone 3: Real multimodal LLM integration (Ollama, OpenAI Vision)
- Milestone 4: Web UI dashboard (local-only, no cloud)
- Milestone 5: GoBD-compliant audit trail with hash chains
- Milestone 6: Multi-entity permission model & team collaboration
- Milestone 7: Webhook & API ingestion endpoints
- Multi-Country Parser Ecosystem: Specialized extraction logic for highly-bureaucratic regions:
- ๐ฉ๐ช Germany (e.g., Steuerbescheid, GoBD compliance considerations)
- ๐ซ๐ท France (e.g., CAF, URSSAF, CPAM documents)
- ๐ฎ๐น Italy (e.g., Raccomandata, Fattura Elettronica)
- ๐ช๐ธ Spain, ๐บ๐ธ United States (Medical bills, IRS notices), ๐ฏ๐ต Japan (Hanko documents).
- Advanced OCR Pipeline: Better layout recognition for complex tabular data (e.g., invoices).
- Seamless Email Integration:
- One-click OAuth for Gmail, Outlook, and iCloud.
- Standard IMAP support for privacy-focused providers (ProtonMail) and regional giants (GMX, Web.de).
- Native OS & Media Sync:
- Apple Photos Integration: Auto-import receipts and documents directly from your macOS/iOS photo library.
- Local Watchdogs: Auto-process files dropped into specific local folders (perfect for network scanners).
- Cloud AI Integration: Easy API key setup for OpenAI (GPT-4o), Anthropic (Claude), and Google (Gemini).
- Local-First LLMs: Out-of-the-box support for local inference engines like Ollama and LM Studio. Process highly sensitive documents (like medical records) completely offline.
- Calendar & Tasks: Automatically push deadlines (e.g., invoice due dates) to Google Calendar, Apple iCal, or Todoist.
- Tax & Accounting Sync: Export parsed financial data to tools like DATEV, Lexoffice, SevDesk (EU), or QuickBooks (US).
- Knowledge Base Integrations: Seamlessly sync structured Markdown data into Obsidian or Notion for your Second Brain.
DocuClaw is a core component of openclaw.ai โ an open-source ecosystem for sovereign AI-powered productivity tools.
| Project | Description |
|---|---|
| DocuClaw | Sovereign document intelligence & archival |
| DeepReader | AI-powered web content ingestion |
| ClawHub | Plugin marketplace & community hub |
We welcome contributions! Whether it's a new country parser, a bug fix, or documentation improvements.
# Development setup
git clone https://github.com/openclaw-ai/docuclaw.git
cd docuclaw
pip install -e ".[dev]"
# Run tests
pytest
# Run linters
ruff check .
mypy docuclaw/See CONTRIBUTING.md for detailed guidelines.
Licensed under the MIT License. Use it freely. Own your data.
Built with ๐ฆ by the OpenClaw community
"Your data should work for you, not against you."