Multi-tenant document ingestion platform. Connects to cloud storage providers, watches for changes, and builds a searchable knowledge base with versioning, content extraction, AI summaries, and vector embeddings.
- Watch — Connects to cloud storage providers via flux and monitors files the tenant has registered for observation
- Extract — Pulls document content using format-specific extractors; delegates OCR to a Tesseract sidecar over gRPC for scanned/handwritten documents
- Version — Tracks every revision of every document, maintaining a complete history
- Enrich — Generates AI summaries via zyn and vector embeddings via vex for each document version
- Index — Stores extracted content, summaries, and embeddings in OpenSearch, providing full-text and semantic search across all ingested documents for a given tenant
Each provider implements a common interface and is developed independently:
- Google Drive
- OneDrive / SharePoint
- Dropbox
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF, TXT, Markdown |
| Spreadsheets | XLSX, XLS, CSV, ODS |
| Presentations | PPTX, PPT, ODP |
| Images (OCR) | PNG, JPEG, TIFF, BMP, WebP |
| Scanned Documents | PDF (image-only), multi-page TIFF |
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Provider │────▶│ Ingestion │────▶│ Enrichment │
│ Watchers │ │ Pipeline │ │ Pipeline │
│ (flux) │ │ │ │ (zyn + vex) │
└──────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
┌──────▼───────┐ ┌──────▼───────┐
│ Tesseract │ │ OpenSearch │
│ Sidecar │ │ Cluster │
│ (gRPC) │ │ │
└──────────────┘ └──────────────┘
- Provider Watchers — flux capacitors monitor registered files/folders across cloud storage providers
- Ingestion Pipeline — Extracts content, normalises formats, manages document versions; delegates OCR to a Tesseract sidecar via gRPC
- Enrichment Pipeline — Generates AI summaries (zyn) and vector embeddings (vex) per document version
- OpenSearch — Full-text and semantic search index per tenant
The system is designed for horizontal scalability from day one — pipelines are queue-driven and stateless, allowing independent scaling of ingestion, OCR, and enrichment workloads.
| Component | Technology |
|---|---|
| Language | Go |
| Framework | sum |
| Configuration | flux |
| LLM Orchestration | zyn |
| Embeddings | vex |
| OCR | Tesseract (gRPC sidecar) |
| Search & Storage | OpenSearch |
| Database | PostgreSQL |
| Object Storage | MinIO (dev) / S3-compatible (prod) |
| Observability | OpenTelemetry |
make dev # Start local infrastructure
make run # Run the application
make test # Run tests
make check # Run tests + lintMIT