lovdata-pipeline

A simple Python pipeline for processing Norwegian legal documents from Lovdata into a searchable vector database.

Quick Start

# Install dependencies
make install

# Run pipeline (process all changed files)
uv run lg process --storage jsonl

One command processes all changed files. Each file completes fully (parse → chunk → embed → index) before moving to the next. Simple JSON state tracking.

Documentation

New users: Start with Quick Start above, then read the complete User Guide

Developers: See Development Guide for architecture and extending the pipeline

Contributors: Read CONTRIBUTING.md for contribution guidelines and PR process

Need help? Check the Troubleshooting section in the User Guide

Available Documentation

User Guide - Complete manual covering installation, configuration, usage, and troubleshooting
Development Guide - Architecture reference, extension guide, and testing
Contributing Guide - Contribution guidelines, code standards, and PR process

What It Does

Processes Norwegian legal documents into searchable vectors:

Sync - Download from Lovdata (via lovlig library)
Parse & Chunk - Extract and split articles into semantic chunks
Embed - Generate vectors via OpenAI API
Index - Store in JSONL files or ChromaDB

Atomic Processing: Each file completes all steps before the next file starts. If processing fails, the file is marked as failed and retried on the next run.

Architecture

Clean architecture design with dependency injection and protocol-based interfaces for extensibility. See DEVELOPMENT.md for architecture details.

Key Features

Processing:

Atomic per-file execution (all-or-nothing)
Automatic change detection and cleanup
Simple JSON state tracking
Single-command operation

Storage:

JSONL files (simple, portable) or ChromaDB (production-ready)
Automatic migration between storage types

Development:

Quality tools (Ruff, Pylint, pre-commit hooks)
Full test coverage with pytest
Dev container for reproducible environment

Requirements

Python ≥ 3.11
OpenAI API key (for embeddings)

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
docs		docs
lovdata_pipeline		lovdata_pipeline
scripts		scripts
tests		tests
.copier-answers.yml		.copier-answers.yml
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc.json		.prettierrc.json
.secrets.baseline		.secrets.baseline
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lovdata-pipeline

Quick Start

Documentation

Available Documentation

What It Does

Architecture

Key Features

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lovdata-pipeline

Quick Start

Documentation

Available Documentation

What It Does

Architecture

Key Features

Requirements

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages