Standalone OCR-first platform for digitizing Arabic books (starting with Lisan al-Arab) and serving citation-grade search APIs.
Status: active scaffold. API, worker, and admin apps are in place; OCR providers are currently stubbed.
Arabic language corpus, Arabic OCR, Lisan al-Arab, lexicography, digital humanities, computational linguistics, citation search, full-text search, Fastify, Next.js, BullMQ, PostgreSQL, MinIO, Supabase
- OCR ingestion pipeline with retries and DLQ.
- Canonical citation chain:
edition -> volume -> page -> line -> passage -> token -> bbox. - Public versioned API (
/api/v1/*) with API keys. - Admin QA app for reviewer workflows.
- Monthly immutable release workflow.
apps/
api/ # Fastify API
admin/ # Next.js reviewer UI
workers/
ingest/ # BullMQ jobs + Bull Board
packages/
core/ # Shared types + normalization
sdk-js/ # Public JS client
infra/
db/migrations/ # SQL migrations
fly/ # Fly.io configs
Use the full setup guide: docs/SETUP.md
Minimal quick start:
- Copy
.env.exampleto.env(local scripts use.env; Docker Compose currently uses.env.example). - Install dependencies:
pnpm install. - Run migration:
pnpm db:migrate. - Start services:
pnpm devordocker compose up --build.
- Setup:
docs/SETUP.md - Deployment:
docs/DEPLOYMENT.md - API:
docs/API.md - Architecture:
docs/ARCHITECTURE.md - Governance:
docs/GOVERNANCE.md - Search gold set:
docs/GOLD_SET.md
GET /api/v1/booksGET /api/v1/books/:bookId/editionsGET /api/v1/books/:bookId/search?q=...GET /api/v1/passages/:passageIdGET /api/v1/pages/:pageIdGET /api/v1/pages/:pageId/imageGET /api/v1/releasesPOST /api/v1/auth/keysPOST /api/v1/auth/keys/:keyId/rotateDELETE /api/v1/auth/keys/:keyIdPOST /api/v1/ingest/jobs(admin)GET /api/v1/ingest/jobs/:jobId(admin)
- Install deps:
pnpm install - Dev all services:
pnpm dev - Build:
pnpm build - Typecheck:
pnpm typecheck - Test:
pnpm test
- Admin UI: Supabase Auth session (or
x-dev-user-idin non-production). - Public API: custom API keys (hashed at rest) via Fastify middleware.
- Exact normalized token match.
- Postgres full-text search on normalized passages.
pg_trgmfallback for OCR/noise tolerance.
- Curated maintainer-only ingestion.
- Monthly release manifests plus checksums.
- No publish without provenance metadata and QA pass.
- Contributing guide:
CONTRIBUTING.md - Security policy:
SECURITY.md
This repository is currently UNLICENSED.