A simple Python pipeline for processing Norwegian legal documents from Lovdata into a searchable vector database.
# Install dependencies
make install
# Run pipeline (process all changed files)
uv run lg process --storage jsonlOne command processes all changed files. Each file completes fully (parse → chunk → embed → index) before moving to the next. Simple JSON state tracking.
New users: Start with Quick Start above, then read the complete User Guide
Developers: See Development Guide for architecture and extending the pipeline
Contributors: Read CONTRIBUTING.md for contribution guidelines and PR process
Need help? Check the Troubleshooting section in the User Guide
- User Guide - Complete manual covering installation, configuration, usage, and troubleshooting
- Development Guide - Architecture reference, extension guide, and testing
- Contributing Guide - Contribution guidelines, code standards, and PR process
Processes Norwegian legal documents into searchable vectors:
- Sync - Download from Lovdata (via
lovliglibrary) - Parse & Chunk - Extract and split articles into semantic chunks
- Embed - Generate vectors via OpenAI API
- Index - Store in JSONL files or ChromaDB
Atomic Processing: Each file completes all steps before the next file starts. If processing fails, the file is marked as failed and retried on the next run.
Clean architecture design with dependency injection and protocol-based interfaces for extensibility. See DEVELOPMENT.md for architecture details.
Processing:
- Atomic per-file execution (all-or-nothing)
- Automatic change detection and cleanup
- Simple JSON state tracking
- Single-command operation
Storage:
- JSONL files (simple, portable) or ChromaDB (production-ready)
- Automatic migration between storage types
Development:
- Quality tools (Ruff, Pylint, pre-commit hooks)
- Full test coverage with pytest
- Dev container for reproducible environment
- Python ≥ 3.11
- OpenAI API key (for embeddings)
MIT License
