Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Architecture

This document describes the high-level design of the `data-processing` repository, focused on the Kubeflow Pipelines (KFP) subsystem.

## System Purpose

This repository provides reference data-processing pipelines for Open Data Hub / Red Hat OpenShift AI. It converts documents (primarily PDFs) into structured formats (JSON, Markdown) and optionally chunks them for RAG workflows, using the [Docling](https://docling-project.github.io/docling/) toolkit.

## High-Level Data Flow

```
┌─────────────────────────────────────────────────────┐
│ Kubeflow Pipeline (KFP) │
│ │
PDF Source │ ┌────────────┐ ┌──────────────┐ ┌───────────┐ │ Output
(HTTP/S3) ────────────►│ │ import_pdfs│──►│create_splits │──►│ ParallelFor│ │──► JSON + MD
│ └────────────┘ └──────────────┘ │ │ │ (+ JSONL
│ │ ┌───────┐ │ │ chunks)
│ ┌──────────────────┐ │ │convert│ │ │
│ │download_models │────────────────►│ │ ──► │ │ │
│ └──────────────────┘ │ │ chunk │ │ │
│ │ └───────┘ │ │
│ └───────────┘ │
└─────────────────────────────────────────────────────┘
```

### Pipeline Steps

1. **`import_pdfs`** — Downloads PDFs from HTTP URLs or S3-compatible storage
2. **`create_pdf_splits`** — Divides PDF list into N splits for parallel processing
3. **`download_docling_models`** — Downloads ML models (layout, tableformer, or VLM)
4. **`docling_convert_*`** — Converts PDFs to Docling JSON + Markdown (runs in parallel via `ParallelFor`)
5. **`docling_chunk`** *(optional)* — Splits converted documents into semantic chunks (JSONL output)

## KFP Pipeline Architecture (`kubeflow-pipelines/`)

Two pipeline variants exist:

| Pipeline | Conversion Strategy | Models Used |
|---|---|---|
| **Standard** (`docling-standard/`) | Traditional OCR + layout analysis | Layout model, TableFormer |
| **VLM** (`docling-vlm/`) | Vision-language model inference | SmolVLM, SmolDocling, or remote API |

**Key architectural constraint**: KFP `@dsl.component` functions are serialized by the KFP SDK and executed in isolated containers. This means:
- All imports must be inside the function body (not at module top)
- Components cannot import from project-level modules at runtime
- Each component is self-contained with its own dependencies
- The `common/` module is only used at **compile time** to define shared component functions

See [ADR-001](docs/adr/001-kfp-component-serialization.md) for details.

**Component ownership**:
- `common/components.py` — Shared components used by both pipelines: `import_pdfs`, `create_pdf_splits`, `download_docling_models`, `docling_chunk`
- `docling-standard/standard_components.py` — Standard-specific: `docling_convert_standard`
- `docling-vlm/vlm_components.py` — VLM-specific: `docling_convert_vlm`

**Compiled YAML**: Each pipeline has a `*_compiled.yaml` file that is the KFP-deployable artifact. These are **generated code** — produced by running `python *_convert_pipeline.py`. CI enforces that committed YAML matches what the compiler produces.

## Container Images

Two base images are used across all KFP components, configurable via environment variables:

| Variable | Default | Purpose |
|---|---|---|
| `PYTHON_BASE_IMAGE` | `registry.access.redhat.com/ubi9/python-311:9.6-*` | Lightweight components (import, split) |
| `DOCLING_BASE_IMAGE` | `quay.io/fabianofranz/docling-ubi9:2.54.0` | Components needing Docling + ML models |

Defined in `kubeflow-pipelines/common/constants.py`.

## Secrets Architecture

KFP pipelines use Kubernetes Secrets mounted as volumes (not environment variables):

| Secret Name | Mount Path | Used When | Keys |
|---|---|---|---|
| `data-processing-docling-pipeline` | `/mnt/secrets` | `pdf_from_s3=True` | `S3_ENDPOINT_URL`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, `S3_BUCKET`, `S3_PREFIX` |
| `data-processing-docling-pipeline` | `/mnt/secrets` | `remote_model_enabled=True` (VLM only) | `REMOTE_MODEL_ENDPOINT_URL`, `REMOTE_MODEL_API_KEY`, `REMOTE_MODEL_NAME` |

Both use the same secret name. Components read individual keys as files from the mount path.

## Key Design Decisions

Documented as ADRs in `docs/adr/`:

- [ADR-001: KFP Component Serialization](docs/adr/001-kfp-component-serialization.md) — Why imports are inside function bodies
- [ADR-002: Compiled YAML as Source of Truth](docs/adr/002-compiled-yaml-source-of-truth.md) — Why generated YAML is committed
- [ADR-003: Shared Secret Name](docs/adr/003-shared-secret-name.md) — Why S3 and VLM configs share one secret

## Directory Map

```
data-processing/
├── kubeflow-pipelines/
│ ├── common/ ← Shared KFP components + constants
│ │ ├── components.py ← import_pdfs, create_pdf_splits, download_docling_models, docling_chunk
│ │ └── constants.py ← Base image defaults
│ ├── docling-standard/ ← Standard conversion pipeline
│ │ ├── standard_components.py ← docling_convert_standard
│ │ ├── standard_convert_pipeline.py ← Pipeline definition + compiler
│ │ ├── standard_convert_pipeline_compiled.yaml ← GENERATED — do not edit
│ │ └── local_run.py ← Docker-based local testing
│ └── docling-vlm/ ← VLM conversion pipeline
│ ├── vlm_components.py ← docling_convert_vlm
│ ├── vlm_convert_pipeline.py ← Pipeline definition + compiler
│ ├── vlm_convert_pipeline_compiled.yaml ← GENERATED — do not edit
│ └── local_run.py ← Docker-based local testing
├── tests/ ← Pytest suite
├── docs/
│ └── adr/ ← Architecture Decision Records
├── Makefile ← Build/test/format targets
├── pyproject.toml ← Ruff + pytest config
└── .pre-commit-config.yaml ← Pre-commit hooks
```
40 changes: 40 additions & 0 deletions docs/adr/001-kfp-component-serialization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# ADR-001: KFP Component Serialization

## Status

Accepted

## Context

Kubeflow Pipelines (KFP) SDK v2 serializes `@dsl.component`-decorated functions into self-contained container steps. The KFP compiler inspects the function source code, extracts it, and packages it to run inside a container image at runtime.

This means the function must be entirely self-contained — any imports, helper functions, or constants it uses must either:
1. Be defined inside the function body
2. Come from packages listed in `packages_to_install`
3. Already exist in the `base_image`

Project-level modules (like a shared `logging_config.py` or utility library) are **not available at runtime** because the container only has the serialized function code, not the full repository.

## Decision

All KFP component functions place their imports inside the function body, not at module top level.

```python
@dsl.component(base_image=DOCLING_BASE_IMAGE)
def docling_convert_standard(...):
from pathlib import Path # Inside function
from docling.document_converter import ... # Inside function
...
```

The `# pylint: disable=import-outside-toplevel` comments throughout `components.py`, `standard_components.py`, and `vlm_components.py` exist because of this constraint, not as a style choice.

The `common/` module (`components.py`, `constants.py`) is used at **compile time only** — when `python standard_convert_pipeline.py` runs, it imports the component functions to build the pipeline graph and compile to YAML. At runtime in Kubernetes, each component runs in isolation.

## Consequences

- **Positive**: Components are fully portable — the compiled YAML works on any KFP instance without needing the source repository
- **Positive**: Dependencies are explicit per component via `base_image` and `packages_to_install`
- **Negative**: Code cannot be shared between components at runtime (some duplication is unavoidable)
- **Negative**: Import statements look unusual (inside functions) and trigger linter warnings
- **Constraint**: Any shared logging, error handling, or utility code must be duplicated in each component function or provided by the base image
33 changes: 33 additions & 0 deletions docs/adr/002-compiled-yaml-source-of-truth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# ADR-002: Compiled YAML as Source of Truth

## Status

Accepted

## Context

KFP pipelines are defined in Python (`*_convert_pipeline.py`) but deployed as compiled YAML files (`*_compiled.yaml`). Users install pipelines by uploading the compiled YAML to a KFP instance — they never run the Python source directly in production.

Two approaches exist for managing compiled YAML:
1. **Commit the compiled YAML** alongside source code
2. **Generate YAML in CI only** and publish as a release artifact

## Decision

Compiled YAML files are committed to the repository and treated as checked-in artifacts.

CI enforces consistency: the `compile-kfp.yml` workflow recompiles from source and diffs against the committed YAML. If they differ, CI fails with instructions to regenerate locally.

```
standard_convert_pipeline_compiled.yaml ← committed, GENERATED
vlm_convert_pipeline_compiled.yaml ← committed, GENERATED
```

## Consequences

- **Positive**: Users can install pipelines directly from the repository without building from source — just download the YAML via URL
- **Positive**: PR reviewers can see YAML changes alongside Python changes
- **Positive**: The `main` branch always has a deployable pipeline
- **Negative**: Contributors must remember to regenerate YAML after pipeline code changes (`python *_convert_pipeline.py`)
- **Negative**: PRs that modify pipeline code will have larger diffs (Python changes + YAML changes)
- **Mitigation**: CI catches stale YAML and provides the exact regeneration command in the failure message
31 changes: 31 additions & 0 deletions docs/adr/003-shared-secret-name.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# ADR-003: Shared Secret Name for S3 and VLM Configuration

## Status

Accepted

## Context

KFP pipelines need credentials for two optional features:
1. **S3 access** — downloading PDFs from S3-compatible storage (`import_pdfs` component)
2. **Remote VLM access** — calling a remote vision-language model API (`docling_convert_vlm` component, VLM pipeline only)

These could use separate Kubernetes Secrets (e.g., `data-processing-s3` and `data-processing-vlm`) or a single shared secret.

## Decision

Both features use a single Kubernetes Secret named `data-processing-docling-pipeline`, mounted at `/mnt/secrets`. The secret is mounted as `optional: true` so pipelines work without it when using HTTP sources and local models.

S3 keys: `S3_ENDPOINT_URL`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, `S3_BUCKET`, `S3_PREFIX`
VLM keys: `REMOTE_MODEL_ENDPOINT_URL`, `REMOTE_MODEL_API_KEY`, `REMOTE_MODEL_NAME`

Components validate only the keys they need — `import_pdfs` checks for S3 keys when `from_s3=True`, and `docling_convert_vlm` checks for VLM keys when `remote_model_enabled=True`.

## Consequences

- **Positive**: Users manage one secret instead of two — simpler setup
- **Positive**: Single `kubernetes.use_secret_as_volume()` call per component
- **Positive**: Keys have distinct prefixes so there's no collision
- **Negative**: The secret can grow large if both S3 and VLM are configured
- **Negative**: Changing S3 credentials requires updating a secret that also contains VLM credentials (and vice versa)
- **Negative**: RBAC cannot be scoped separately to S3 vs VLM access
43 changes: 43 additions & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Architecture Decision Records (ADRs)

This directory contains Architecture Decision Records for the data-processing repository.

ADRs document significant technical decisions with their context, rationale, and consequences. They help contributors (and AI agents) understand *why* the code is structured the way it is.

## Index

| ADR | Title | Status |
|---|---|---|
| [001](001-kfp-component-serialization.md) | KFP Component Serialization | Accepted |
| [002](002-compiled-yaml-source-of-truth.md) | Compiled YAML as Source of Truth | Accepted |
| [003](003-shared-secret-name.md) | Shared Secret Name for S3 and VLM Configuration | Accepted |

## Creating a New ADR

1. Copy the template below
2. Name the file `NNN-short-title.md` (e.g., `004-new-decision.md`)
3. Fill in each section
4. Add an entry to the index table above
5. Submit as part of your PR

### Template

```markdown
# ADR-NNN: Title

## Status

Proposed | Accepted | Deprecated | Superseded by [ADR-NNN](NNN-title.md)

## Context

What is the issue that we're seeing that motivates this decision?

## Decision

What is the change that we're making?

## Consequences

What becomes easier or harder because of this change?
```