Skip to content

⬆️(dependencies) switch base docker image to upgrade markitdown#360

Open
providenz wants to merge 1 commit into
mainfrom
providenz/docker-slim
Open

⬆️(dependencies) switch base docker image to upgrade markitdown#360
providenz wants to merge 1 commit into
mainfrom
providenz/docker-slim

Conversation

@providenz
Copy link
Copy Markdown
Collaborator

@providenz providenz commented Mar 24, 2026

Purpose

We could not upgrade markitdown beacuse of missing approirate compilation toolchain.

Proposal

  • Switch base image from python:3.13.3-alpine to python:3.13.3-slim to unblock upgrading markitdown
  • Upgrade markitdown from 0.0.2 to 0.1.5 - newer versions depend on magika which depends on onnxruntime, and onnxruntime has no pre-built wheels for Alpine (musl libc)
  • Drop rust from the builder stage - no longer needed since Debian's glibc allows using pre-built manylinux wheels
  • Fix .dockerignore

Rationale

Alpine uses musl libc instead of glibc. A lot of Python packages with native extensions (like onnxruntime) only ship pre-built wheels for glibc. This was blocking the markitdown upgrade entirely - there is no practical way to compile onnxruntime on Alpine.

Debian-slim is the standard choice for Python Docker images that need native dependency ompatibility. The trade-off is a slightly bigger base image (~150MB vs ~50MB), but this is negligible compared to the
dependencies themselves (onnxruntime alone is ~200MB).

Distro-ess images were considered, but in my undertsanding they have too many drawbacks:

  • lack of shell causes issues with entrypoint, Helm migrate jobs etc
  • need of python libs with native extensions (pango, cairo etc) and no package manager

Complexity outweights benefits imo.

Trivy reports several security issues, but most dont see exploitable: (sqlite, minizip…)

Summary by CodeRabbit

  • Chores

    • Switched container base to a Debian-slim variant for broader compatibility and improved dependency handling.
    • Upgraded markitdown to v0.1.5 for improved document conversion.
    • Broadened ignore rules so virtual environment folders are excluded at any depth.
  • Documentation

    • Added changelog note recording the base image update.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

Walkthrough

Updates: generalize .dockerignore venv patterns; bump markitdown from 0.0.2 → 0.1.5; replace Docker base image python:3.13.3-alpine with python:3.13.3-slim and convert all Docker stages from apk to apt-get; add changelog entry.

Changes

Cohort / File(s) Summary
Docker & build config
Dockerfile, .dockerignore
Switched base image to python:3.13.3-slim; replaced Alpine apk commands and packages with Debian/apt equivalents across build stages (back-builder, link-collector, core, backend-development, backend-production); added apt cache cleanup; .dockerignore changed venv/.venv to recursive **/venv, **/.venv.
Dependencies
src/backend/pyproject.toml
Bumped markitdown dependency from 0.0.2 to 0.1.5.
Changelog
CHANGELOG.md
Added an entry under Unreleased → Changed noting the Docker base image upgrade to support the markitdown update.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • qbey
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: switching the base Docker image from Alpine to Debian slim to enable the markitdown dependency upgrade.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch providenz/docker-slim

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Dockerfile (1)

157-158: Redundant apt cleanup (harmless but unnecessary).

The apt lists are already removed after each apt-get install command throughout the Dockerfile (lines 9, 27, 58, 97, 136). This cleanup in the production stage won't find anything to remove since the core stage already cleaned up.

This is harmless but adds a small layer overhead. Consider removing if you want to minimize layers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 157 - 158, The final production-stage RUN layer that
executes "rm -rf /var/lib/apt/lists/*" is redundant because earlier apt-get
install steps already clean apt lists in the core stage; remove the lone RUN rm
-rf /var/lib/apt/lists/* instruction from the production stage in the Dockerfile
to eliminate the unnecessary layer and reduce image size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/pyproject.toml`:
- Line 54: The dependency upgrade to markitdown==0.1.5 is breaking because
convert_stream() no longer accepts file_extension; either pin markitdown back to
0.0.2 in pyproject.toml or update the converter code: locate the _convert()
implementation in src/backend/chat/agent_rag/document_converter/markitdown.py
where convert_stream(document, file_extension=...) is called and remove the
file_extension kwarg and adapt to the new convert_stream API (call
convert_stream(document) and handle its return format), ensuring behavior and
tests remain unchanged.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 157-158: The final production-stage RUN layer that executes "rm
-rf /var/lib/apt/lists/*" is redundant because earlier apt-get install steps
already clean apt lists in the core stage; remove the lone RUN rm -rf
/var/lib/apt/lists/* instruction from the production stage in the Dockerfile to
eliminate the unnecessary layer and reduce image size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 410c220f-2f1e-413d-bc9a-8a5e0609bbec

📥 Commits

Reviewing files that changed from the base of the PR and between f5a1b72 and bb06977.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • .dockerignore
  • CHANGELOG.md
  • Dockerfile
  • src/backend/pyproject.toml

"lxml==5.4.0",
"markdown==3.10",
"markitdown==0.0.2",
"markitdown==0.1.5",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

markitdown python library 0.1.5 API convert_stream method

💡 Result:

The markitdown Python library (version 0.1.5, released 2026-02-20) is a Microsoft tool for converting files (PDF, Office docs, images, etc.) to Markdown for LLM use. The MarkItDown class has a primary convert(path) method, but also exposes convert_stream(stream) for advanced use. The convert_stream method converts a binary file-like object (e.g., open(file, 'rb') or io.BytesIO) to Markdown. In v0.1.5 (post-0.1.0 breaking changes), it requires a binary stream only—no text streams like io.StringIO. This aligns with DocumentConverter changes to use streams instead of paths, avoiding temp files. Example usage (inferred from context): from markitdown import MarkItDown import io md = MarkItDown with open('file.pdf', 'rb') as f: result = md.convert_stream(f) print(result.text_content) Official docs are in the GitHub README at https://github.com/microsoft/markitdown (packages/markitdown contains source). No full API docs found beyond README; check source for details. PyPI: https://pypi.org/project/markitdown/0.1.5/.

Citations:


🏁 Script executed:

fd markitdown.py --type f

Repository: suitenumerique/conversations

Length of output: 202


🏁 Script executed:

cat -n src/backend/chat/agent_rag/document_converter/markitdown.py

Repository: suitenumerique/conversations

Length of output: 1799


🏁 Script executed:

cat -n src/backend/chat/tests/agent_rag/document_converter/test_markitdown.py

Repository: suitenumerique/conversations

Length of output: 1427


markitdown 0.1.5 breaks the file_extension parameter—code will fail at runtime.

The upgrade from 0.0.2 to 0.1.5 introduces a breaking change. The convert_stream() method no longer accepts the file_extension parameter. The codebase calls convert_stream(document, file_extension=...) at line 39-41 of src/backend/chat/agent_rag/document_converter/markitdown.py, which will raise TypeError: convert_stream() got an unexpected keyword argument 'file_extension' when executed with 0.1.5.

Either revert to 0.0.2, refactor _convert() to use a different API, or wait for a version that restores compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/pyproject.toml` at line 54, The dependency upgrade to
markitdown==0.1.5 is breaking because convert_stream() no longer accepts
file_extension; either pin markitdown back to 0.0.2 in pyproject.toml or update
the converter code: locate the _convert() implementation in
src/backend/chat/agent_rag/document_converter/markitdown.py where
convert_stream(document, file_extension=...) is called and remove the
file_extension kwarg and adapt to the new convert_stream API (call
convert_stream(document) and handle its return format), ensuring behavior and
tests remain unchanged.

Signed-off-by: Laurent Paoletti <lp@providenz.fr>
@providenz providenz force-pushed the providenz/docker-slim branch from bb06977 to 2b97474 Compare April 13, 2026 13:31
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

157-158: Redundant cleanup layer.

The core stage (lines 84-97) already removes /var/lib/apt/lists/* after installing packages. Since backend-production inherits from core, the apt lists are already gone. This RUN creates an additional image layer that performs a no-op.

🧹 Suggested removal
 # ---- Production image ----
 FROM core AS backend-production

-# Remove apt lists, we don't need them anymore
-RUN rm -rf /var/lib/apt/lists/*
-
 ARG CONVERSATIONS_STATIC_ROOT=/data/static
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 157 - 158, The Dockerfile contains a redundant RUN
rm -rf /var/lib/apt/lists/* in the backend-production stage that duplicates
cleanup already done in the core stage; remove that RUN line from the
backend-production stage so you don't add a no-op image layer (locate the
backend-production stage and delete the RUN rm -rf /var/lib/apt/lists/*
instruction, leaving the core stage's cleanup in place).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 157-158: The Dockerfile contains a redundant RUN rm -rf
/var/lib/apt/lists/* in the backend-production stage that duplicates cleanup
already done in the core stage; remove that RUN line from the backend-production
stage so you don't add a no-op image layer (locate the backend-production stage
and delete the RUN rm -rf /var/lib/apt/lists/* instruction, leaving the core
stage's cleanup in place).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a1fdb546-4e91-4a35-98fd-ac29b9a66777

📥 Commits

Reviewing files that changed from the base of the PR and between bb06977 and 2b97474.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • .dockerignore
  • CHANGELOG.md
  • Dockerfile
  • src/backend/pyproject.toml
✅ Files skipped from review due to trivial changes (3)
  • .dockerignore
  • CHANGELOG.md
  • src/backend/pyproject.toml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant