⬆️(dependencies) switch base docker image to upgrade markitdown#360
⬆️(dependencies) switch base docker image to upgrade markitdown#360providenz wants to merge 1 commit into
Conversation
WalkthroughUpdates: generalize Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile (1)
157-158: Redundant apt cleanup (harmless but unnecessary).The apt lists are already removed after each
apt-get installcommand throughout the Dockerfile (lines 9, 27, 58, 97, 136). This cleanup in the production stage won't find anything to remove since thecorestage already cleaned up.This is harmless but adds a small layer overhead. Consider removing if you want to minimize layers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 157 - 158, The final production-stage RUN layer that executes "rm -rf /var/lib/apt/lists/*" is redundant because earlier apt-get install steps already clean apt lists in the core stage; remove the lone RUN rm -rf /var/lib/apt/lists/* instruction from the production stage in the Dockerfile to eliminate the unnecessary layer and reduce image size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/pyproject.toml`:
- Line 54: The dependency upgrade to markitdown==0.1.5 is breaking because
convert_stream() no longer accepts file_extension; either pin markitdown back to
0.0.2 in pyproject.toml or update the converter code: locate the _convert()
implementation in src/backend/chat/agent_rag/document_converter/markitdown.py
where convert_stream(document, file_extension=...) is called and remove the
file_extension kwarg and adapt to the new convert_stream API (call
convert_stream(document) and handle its return format), ensuring behavior and
tests remain unchanged.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 157-158: The final production-stage RUN layer that executes "rm
-rf /var/lib/apt/lists/*" is redundant because earlier apt-get install steps
already clean apt lists in the core stage; remove the lone RUN rm -rf
/var/lib/apt/lists/* instruction from the production stage in the Dockerfile to
eliminate the unnecessary layer and reduce image size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 410c220f-2f1e-413d-bc9a-8a5e0609bbec
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
.dockerignoreCHANGELOG.mdDockerfilesrc/backend/pyproject.toml
| "lxml==5.4.0", | ||
| "markdown==3.10", | ||
| "markitdown==0.0.2", | ||
| "markitdown==0.1.5", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
markitdown python library 0.1.5 API convert_stream method
💡 Result:
The markitdown Python library (version 0.1.5, released 2026-02-20) is a Microsoft tool for converting files (PDF, Office docs, images, etc.) to Markdown for LLM use. The MarkItDown class has a primary convert(path) method, but also exposes convert_stream(stream) for advanced use. The convert_stream method converts a binary file-like object (e.g., open(file, 'rb') or io.BytesIO) to Markdown. In v0.1.5 (post-0.1.0 breaking changes), it requires a binary stream only—no text streams like io.StringIO. This aligns with DocumentConverter changes to use streams instead of paths, avoiding temp files. Example usage (inferred from context): from markitdown import MarkItDown import io md = MarkItDown with open('file.pdf', 'rb') as f: result = md.convert_stream(f) print(result.text_content) Official docs are in the GitHub README at https://github.com/microsoft/markitdown (packages/markitdown contains source). No full API docs found beyond README; check source for details. PyPI: https://pypi.org/project/markitdown/0.1.5/.
Citations:
- 1: https://github.com/microsoft/markitdown
- 2: https://pypi.org/project/markitdown/
- 3: https://github.com/microsoft/markitdown/blob/main/README.md
- 4: https://www.library.umbc.edu
- 5: https://oreil.ly/pLnsJ
- 6: https://hotgithub.com/projects/markitdown/
🏁 Script executed:
fd markitdown.py --type fRepository: suitenumerique/conversations
Length of output: 202
🏁 Script executed:
cat -n src/backend/chat/agent_rag/document_converter/markitdown.pyRepository: suitenumerique/conversations
Length of output: 1799
🏁 Script executed:
cat -n src/backend/chat/tests/agent_rag/document_converter/test_markitdown.pyRepository: suitenumerique/conversations
Length of output: 1427
markitdown 0.1.5 breaks the file_extension parameter—code will fail at runtime.
The upgrade from 0.0.2 to 0.1.5 introduces a breaking change. The convert_stream() method no longer accepts the file_extension parameter. The codebase calls convert_stream(document, file_extension=...) at line 39-41 of src/backend/chat/agent_rag/document_converter/markitdown.py, which will raise TypeError: convert_stream() got an unexpected keyword argument 'file_extension' when executed with 0.1.5.
Either revert to 0.0.2, refactor _convert() to use a different API, or wait for a version that restores compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/pyproject.toml` at line 54, The dependency upgrade to
markitdown==0.1.5 is breaking because convert_stream() no longer accepts
file_extension; either pin markitdown back to 0.0.2 in pyproject.toml or update
the converter code: locate the _convert() implementation in
src/backend/chat/agent_rag/document_converter/markitdown.py where
convert_stream(document, file_extension=...) is called and remove the
file_extension kwarg and adapt to the new convert_stream API (call
convert_stream(document) and handle its return format), ensuring behavior and
tests remain unchanged.
Signed-off-by: Laurent Paoletti <lp@providenz.fr>
bb06977 to
2b97474
Compare
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile (1)
157-158: Redundant cleanup layer.The
corestage (lines 84-97) already removes/var/lib/apt/lists/*after installing packages. Sincebackend-productioninherits fromcore, the apt lists are already gone. ThisRUNcreates an additional image layer that performs a no-op.🧹 Suggested removal
# ---- Production image ---- FROM core AS backend-production -# Remove apt lists, we don't need them anymore -RUN rm -rf /var/lib/apt/lists/* - ARG CONVERSATIONS_STATIC_ROOT=/data/static🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 157 - 158, The Dockerfile contains a redundant RUN rm -rf /var/lib/apt/lists/* in the backend-production stage that duplicates cleanup already done in the core stage; remove that RUN line from the backend-production stage so you don't add a no-op image layer (locate the backend-production stage and delete the RUN rm -rf /var/lib/apt/lists/* instruction, leaving the core stage's cleanup in place).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile`:
- Around line 157-158: The Dockerfile contains a redundant RUN rm -rf
/var/lib/apt/lists/* in the backend-production stage that duplicates cleanup
already done in the core stage; remove that RUN line from the backend-production
stage so you don't add a no-op image layer (locate the backend-production stage
and delete the RUN rm -rf /var/lib/apt/lists/* instruction, leaving the core
stage's cleanup in place).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a1fdb546-4e91-4a35-98fd-ac29b9a66777
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
.dockerignoreCHANGELOG.mdDockerfilesrc/backend/pyproject.toml
✅ Files skipped from review due to trivial changes (3)
- .dockerignore
- CHANGELOG.md
- src/backend/pyproject.toml



Purpose
We could not upgrade markitdown beacuse of missing approirate compilation toolchain.
Proposal
Rationale
Alpine uses musl libc instead of glibc. A lot of Python packages with native extensions (like onnxruntime) only ship pre-built wheels for glibc. This was blocking the markitdown upgrade entirely - there is no practical way to compile onnxruntime on Alpine.
Debian-slim is the standard choice for Python Docker images that need native dependency ompatibility. The trade-off is a slightly bigger base image (~150MB vs ~50MB), but this is negligible compared to the
dependencies themselves (onnxruntime alone is ~200MB).
Distro-ess images were considered, but in my undertsanding they have too many drawbacks:
Complexity outweights benefits imo.
Trivy reports several security issues, but most dont see exploitable: (sqlite, minizip…)
Summary by CodeRabbit
Chores
Documentation