From 0b4e341a983d81f4bfdbb5cbc9a541db4f4c45c6 Mon Sep 17 00:00:00 2001 From: Ambient Code Bot Date: Thu, 26 Mar 2026 19:00:49 +0000 Subject: [PATCH] Add ARCHITECTURE.md and Architecture Decision Records Document KFP pipeline architecture including data flow, component ownership, container images, secrets architecture, and directory map. ADRs cover three key design decisions: - ADR-001: KFP component serialization (inline imports) - ADR-002: Compiled YAML as source of truth - ADR-003: Shared secret name for S3 and VLM configs Co-Authored-By: Claude Opus 4.6 --- ARCHITECTURE.md | 113 ++++++++++++++++++ docs/adr/001-kfp-component-serialization.md | 40 +++++++ docs/adr/002-compiled-yaml-source-of-truth.md | 33 +++++ docs/adr/003-shared-secret-name.md | 31 +++++ docs/adr/README.md | 43 +++++++ 5 files changed, 260 insertions(+) create mode 100644 ARCHITECTURE.md create mode 100644 docs/adr/001-kfp-component-serialization.md create mode 100644 docs/adr/002-compiled-yaml-source-of-truth.md create mode 100644 docs/adr/003-shared-secret-name.md create mode 100644 docs/adr/README.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 00000000..4e44cb97 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,113 @@ +# Architecture + +This document describes the high-level design of the `data-processing` repository, focused on the Kubeflow Pipelines (KFP) subsystem. + +## System Purpose + +This repository provides reference data-processing pipelines for Open Data Hub / Red Hat OpenShift AI. It converts documents (primarily PDFs) into structured formats (JSON, Markdown) and optionally chunks them for RAG workflows, using the [Docling](https://docling-project.github.io/docling/) toolkit. + +## High-Level Data Flow + +``` + ┌─────────────────────────────────────────────────────┐ + │ Kubeflow Pipeline (KFP) │ + │ │ + PDF Source │ ┌────────────┐ ┌──────────────┐ ┌───────────┐ │ Output + (HTTP/S3) ────────────►│ │ import_pdfs│──►│create_splits │──►│ ParallelFor│ │──► JSON + MD + │ └────────────┘ └──────────────┘ │ │ │ (+ JSONL + │ │ ┌───────┐ │ │ chunks) + │ ┌──────────────────┐ │ │convert│ │ │ + │ │download_models │────────────────►│ │ ──► │ │ │ + │ └──────────────────┘ │ │ chunk │ │ │ + │ │ └───────┘ │ │ + │ └───────────┘ │ + └─────────────────────────────────────────────────────┘ +``` + +### Pipeline Steps + +1. **`import_pdfs`** — Downloads PDFs from HTTP URLs or S3-compatible storage +2. **`create_pdf_splits`** — Divides PDF list into N splits for parallel processing +3. **`download_docling_models`** — Downloads ML models (layout, tableformer, or VLM) +4. **`docling_convert_*`** — Converts PDFs to Docling JSON + Markdown (runs in parallel via `ParallelFor`) +5. **`docling_chunk`** *(optional)* — Splits converted documents into semantic chunks (JSONL output) + +## KFP Pipeline Architecture (`kubeflow-pipelines/`) + +Two pipeline variants exist: + +| Pipeline | Conversion Strategy | Models Used | +|---|---|---| +| **Standard** (`docling-standard/`) | Traditional OCR + layout analysis | Layout model, TableFormer | +| **VLM** (`docling-vlm/`) | Vision-language model inference | SmolVLM, SmolDocling, or remote API | + +**Key architectural constraint**: KFP `@dsl.component` functions are serialized by the KFP SDK and executed in isolated containers. This means: +- All imports must be inside the function body (not at module top) +- Components cannot import from project-level modules at runtime +- Each component is self-contained with its own dependencies +- The `common/` module is only used at **compile time** to define shared component functions + +See [ADR-001](docs/adr/001-kfp-component-serialization.md) for details. + +**Component ownership**: +- `common/components.py` — Shared components used by both pipelines: `import_pdfs`, `create_pdf_splits`, `download_docling_models`, `docling_chunk` +- `docling-standard/standard_components.py` — Standard-specific: `docling_convert_standard` +- `docling-vlm/vlm_components.py` — VLM-specific: `docling_convert_vlm` + +**Compiled YAML**: Each pipeline has a `*_compiled.yaml` file that is the KFP-deployable artifact. These are **generated code** — produced by running `python *_convert_pipeline.py`. CI enforces that committed YAML matches what the compiler produces. + +## Container Images + +Two base images are used across all KFP components, configurable via environment variables: + +| Variable | Default | Purpose | +|---|---|---| +| `PYTHON_BASE_IMAGE` | `registry.access.redhat.com/ubi9/python-311:9.6-*` | Lightweight components (import, split) | +| `DOCLING_BASE_IMAGE` | `quay.io/fabianofranz/docling-ubi9:2.54.0` | Components needing Docling + ML models | + +Defined in `kubeflow-pipelines/common/constants.py`. + +## Secrets Architecture + +KFP pipelines use Kubernetes Secrets mounted as volumes (not environment variables): + +| Secret Name | Mount Path | Used When | Keys | +|---|---|---|---| +| `data-processing-docling-pipeline` | `/mnt/secrets` | `pdf_from_s3=True` | `S3_ENDPOINT_URL`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, `S3_BUCKET`, `S3_PREFIX` | +| `data-processing-docling-pipeline` | `/mnt/secrets` | `remote_model_enabled=True` (VLM only) | `REMOTE_MODEL_ENDPOINT_URL`, `REMOTE_MODEL_API_KEY`, `REMOTE_MODEL_NAME` | + +Both use the same secret name. Components read individual keys as files from the mount path. + +## Key Design Decisions + +Documented as ADRs in `docs/adr/`: + +- [ADR-001: KFP Component Serialization](docs/adr/001-kfp-component-serialization.md) — Why imports are inside function bodies +- [ADR-002: Compiled YAML as Source of Truth](docs/adr/002-compiled-yaml-source-of-truth.md) — Why generated YAML is committed +- [ADR-003: Shared Secret Name](docs/adr/003-shared-secret-name.md) — Why S3 and VLM configs share one secret + +## Directory Map + +``` +data-processing/ +├── kubeflow-pipelines/ +│ ├── common/ ← Shared KFP components + constants +│ │ ├── components.py ← import_pdfs, create_pdf_splits, download_docling_models, docling_chunk +│ │ └── constants.py ← Base image defaults +│ ├── docling-standard/ ← Standard conversion pipeline +│ │ ├── standard_components.py ← docling_convert_standard +│ │ ├── standard_convert_pipeline.py ← Pipeline definition + compiler +│ │ ├── standard_convert_pipeline_compiled.yaml ← GENERATED — do not edit +│ │ └── local_run.py ← Docker-based local testing +│ └── docling-vlm/ ← VLM conversion pipeline +│ ├── vlm_components.py ← docling_convert_vlm +│ ├── vlm_convert_pipeline.py ← Pipeline definition + compiler +│ ├── vlm_convert_pipeline_compiled.yaml ← GENERATED — do not edit +│ └── local_run.py ← Docker-based local testing +├── tests/ ← Pytest suite +├── docs/ +│ └── adr/ ← Architecture Decision Records +├── Makefile ← Build/test/format targets +├── pyproject.toml ← Ruff + pytest config +└── .pre-commit-config.yaml ← Pre-commit hooks +``` diff --git a/docs/adr/001-kfp-component-serialization.md b/docs/adr/001-kfp-component-serialization.md new file mode 100644 index 00000000..79a53b4b --- /dev/null +++ b/docs/adr/001-kfp-component-serialization.md @@ -0,0 +1,40 @@ +# ADR-001: KFP Component Serialization + +## Status + +Accepted + +## Context + +Kubeflow Pipelines (KFP) SDK v2 serializes `@dsl.component`-decorated functions into self-contained container steps. The KFP compiler inspects the function source code, extracts it, and packages it to run inside a container image at runtime. + +This means the function must be entirely self-contained — any imports, helper functions, or constants it uses must either: +1. Be defined inside the function body +2. Come from packages listed in `packages_to_install` +3. Already exist in the `base_image` + +Project-level modules (like a shared `logging_config.py` or utility library) are **not available at runtime** because the container only has the serialized function code, not the full repository. + +## Decision + +All KFP component functions place their imports inside the function body, not at module top level. + +```python +@dsl.component(base_image=DOCLING_BASE_IMAGE) +def docling_convert_standard(...): + from pathlib import Path # Inside function + from docling.document_converter import ... # Inside function + ... +``` + +The `# pylint: disable=import-outside-toplevel` comments throughout `components.py`, `standard_components.py`, and `vlm_components.py` exist because of this constraint, not as a style choice. + +The `common/` module (`components.py`, `constants.py`) is used at **compile time only** — when `python standard_convert_pipeline.py` runs, it imports the component functions to build the pipeline graph and compile to YAML. At runtime in Kubernetes, each component runs in isolation. + +## Consequences + +- **Positive**: Components are fully portable — the compiled YAML works on any KFP instance without needing the source repository +- **Positive**: Dependencies are explicit per component via `base_image` and `packages_to_install` +- **Negative**: Code cannot be shared between components at runtime (some duplication is unavoidable) +- **Negative**: Import statements look unusual (inside functions) and trigger linter warnings +- **Constraint**: Any shared logging, error handling, or utility code must be duplicated in each component function or provided by the base image diff --git a/docs/adr/002-compiled-yaml-source-of-truth.md b/docs/adr/002-compiled-yaml-source-of-truth.md new file mode 100644 index 00000000..8b1e6ee7 --- /dev/null +++ b/docs/adr/002-compiled-yaml-source-of-truth.md @@ -0,0 +1,33 @@ +# ADR-002: Compiled YAML as Source of Truth + +## Status + +Accepted + +## Context + +KFP pipelines are defined in Python (`*_convert_pipeline.py`) but deployed as compiled YAML files (`*_compiled.yaml`). Users install pipelines by uploading the compiled YAML to a KFP instance — they never run the Python source directly in production. + +Two approaches exist for managing compiled YAML: +1. **Commit the compiled YAML** alongside source code +2. **Generate YAML in CI only** and publish as a release artifact + +## Decision + +Compiled YAML files are committed to the repository and treated as checked-in artifacts. + +CI enforces consistency: the `compile-kfp.yml` workflow recompiles from source and diffs against the committed YAML. If they differ, CI fails with instructions to regenerate locally. + +``` +standard_convert_pipeline_compiled.yaml ← committed, GENERATED +vlm_convert_pipeline_compiled.yaml ← committed, GENERATED +``` + +## Consequences + +- **Positive**: Users can install pipelines directly from the repository without building from source — just download the YAML via URL +- **Positive**: PR reviewers can see YAML changes alongside Python changes +- **Positive**: The `main` branch always has a deployable pipeline +- **Negative**: Contributors must remember to regenerate YAML after pipeline code changes (`python *_convert_pipeline.py`) +- **Negative**: PRs that modify pipeline code will have larger diffs (Python changes + YAML changes) +- **Mitigation**: CI catches stale YAML and provides the exact regeneration command in the failure message diff --git a/docs/adr/003-shared-secret-name.md b/docs/adr/003-shared-secret-name.md new file mode 100644 index 00000000..441cdc28 --- /dev/null +++ b/docs/adr/003-shared-secret-name.md @@ -0,0 +1,31 @@ +# ADR-003: Shared Secret Name for S3 and VLM Configuration + +## Status + +Accepted + +## Context + +KFP pipelines need credentials for two optional features: +1. **S3 access** — downloading PDFs from S3-compatible storage (`import_pdfs` component) +2. **Remote VLM access** — calling a remote vision-language model API (`docling_convert_vlm` component, VLM pipeline only) + +These could use separate Kubernetes Secrets (e.g., `data-processing-s3` and `data-processing-vlm`) or a single shared secret. + +## Decision + +Both features use a single Kubernetes Secret named `data-processing-docling-pipeline`, mounted at `/mnt/secrets`. The secret is mounted as `optional: true` so pipelines work without it when using HTTP sources and local models. + +S3 keys: `S3_ENDPOINT_URL`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, `S3_BUCKET`, `S3_PREFIX` +VLM keys: `REMOTE_MODEL_ENDPOINT_URL`, `REMOTE_MODEL_API_KEY`, `REMOTE_MODEL_NAME` + +Components validate only the keys they need — `import_pdfs` checks for S3 keys when `from_s3=True`, and `docling_convert_vlm` checks for VLM keys when `remote_model_enabled=True`. + +## Consequences + +- **Positive**: Users manage one secret instead of two — simpler setup +- **Positive**: Single `kubernetes.use_secret_as_volume()` call per component +- **Positive**: Keys have distinct prefixes so there's no collision +- **Negative**: The secret can grow large if both S3 and VLM are configured +- **Negative**: Changing S3 credentials requires updating a secret that also contains VLM credentials (and vice versa) +- **Negative**: RBAC cannot be scoped separately to S3 vs VLM access diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 00000000..a486d1f1 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,43 @@ +# Architecture Decision Records (ADRs) + +This directory contains Architecture Decision Records for the data-processing repository. + +ADRs document significant technical decisions with their context, rationale, and consequences. They help contributors (and AI agents) understand *why* the code is structured the way it is. + +## Index + +| ADR | Title | Status | +|---|---|---| +| [001](001-kfp-component-serialization.md) | KFP Component Serialization | Accepted | +| [002](002-compiled-yaml-source-of-truth.md) | Compiled YAML as Source of Truth | Accepted | +| [003](003-shared-secret-name.md) | Shared Secret Name for S3 and VLM Configuration | Accepted | + +## Creating a New ADR + +1. Copy the template below +2. Name the file `NNN-short-title.md` (e.g., `004-new-decision.md`) +3. Fill in each section +4. Add an entry to the index table above +5. Submit as part of your PR + +### Template + +```markdown +# ADR-NNN: Title + +## Status + +Proposed | Accepted | Deprecated | Superseded by [ADR-NNN](NNN-title.md) + +## Context + +What is the issue that we're seeing that motivates this decision? + +## Decision + +What is the change that we're making? + +## Consequences + +What becomes easier or harder because of this change? +```