images extracted from documents as inputs in the final LLM answering.#281
images extracted from documents as inputs in the final LLM answering.#281chadiaitekioui wants to merge 47 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for a multimodal (text + images) RAG flow by persisting image references at indexing time, retrieving them back into document metadata, and routing generation through a vision-capable LLM path when enabled.
Changes:
- Store
image_paths(derived from image modalities) in Milvus during indexing; parse and attach them to retrievedDocument.metadata. - Add a vision branch to the RAG pipeline that loads images from retrieved
image_pathsand calls a multimodal LLM adapter. - Update CLI output formatting and add a regression test for
<attachment>modality chunking.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_postprocessors.py | Adds a regression test ensuring a single <attachment> keeps its image modality through chunking. |
| src/mmore/process/post_processor/chunker/multimodal.py | Fixes an off-by-one boundary so the last/only modality is not dropped. |
| src/mmore/index/indexer.py | Persists image_paths to Milvus as a dynamic field when indexing. |
| src/mmore/rag/retriever.py | Requests and parses image_paths from Milvus results and adds it to Document.metadata. |
| src/mmore/rag/llm.py | Adds use_vision toggle to LLMConfig. |
| src/mmore/rag/context.py | New helper to format retrieved docs and load images from file paths. |
| src/mmore/rag/multimodal_llm.py | New multimodal LLM adapter layer (OpenAI-style + Hugging Face vision adapter). |
| src/mmore/rag/pipeline.py | Adds vision-mode branching in the RAG chain and config for max images per request. |
| src/mmore/run_rag.py | Enhances saved outputs with num_images_used / used_image_paths. |
| examples/rag/config.yaml | Adds commented vision example settings and max images knob. |
| docs/multimodal_rag.md | New documentation describing the multimodal RAG flow and changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c9300be to
f8e3bf1
Compare
…ent metadata as dict for LangChain- Retriever: use default output_fields on Milvus fallback when image_paths is missing from schema- HuggingFaceVisionAdapter: simplify _load; image_utils: silently skip failed loads/encodes- Tests: consolidate multimodal coverage in test_multimodal_core and test_multimodal_rag
JCHAVEROT
left a comment
There was a problem hiding this comment.
Hi @chadiaitekioui, good job with this big task ! 👏👍
For now I mostly left comments on the documentation and a few high level considerations before reviewing the actual code (but from what I saw it looks very clean so well done!)
…ent metadata as dict for LangChain- Retriever: use default output_fields on Milvus fallback when image_paths is missing from schema- HuggingFaceVisionAdapter: simplify _load; image_utils: silently skip failed loads/encodes- Tests: consolidate multimodal coverage in test_multimodal_core and test_multimodal_rag
…oui/mmore into issue-multimodal-llm
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…oui/mmore into issue-multimodal-llm
…che to avoid launching multiple from_pretrained
There was a problem hiding this comment.
After tweaking a bit your code so that I can run it on my MPS machine, I tried to run the Indexing pipeline on my computer with the config.yaml file given as example (after running the classical process and postprocess stages with template data in the repo).
indexer:
dense_model:
#model_name: sentence-transformers/all-MiniLM-L6-v2
#is_multimodal: false
# To enable vision mode
model_name: Qwen/Qwen2.5-VL-3B-Instruct
is_multimodal: true
sparse_model:
model_name: splade
is_multimodal: false
db:
uri: ./proc_demo.db
name: my_db
collection_name: my_docs
documents_path: 'examples/postprocessor/outputs/merged/results.jsonl'However even if the Qwen/Qwen2.5-VL-3B-Instruct model is not that big (~8GB, maybe a bit more once uncompressed), as you can see in the stacktrace below I ended up having out of memory errors.
python -m mmore index --config-file examples/index/config.yaml --documents-path examples/postprocessor/outputs/merged/results.jsonl
[INDEX 🗂️ -- 2026-05-29 20:55:21] Creating the indexer...
/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/milvus_lite/__init__.py:15: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import DistributionNotFound, get_distribution
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████| 2/2 [00:02<00:00, 1.05s/it]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[INDEX 🗂️ -- 2026-05-29 20:55:28] my_docs already exists, adding documents to it
[INDEX 🗂️ -- 2026-05-29 20:55:28] --------------------------------------------------
[INDEX 🗂️ -- 2026-05-29 20:55:28] Collection stats (before inserting):
[INDEX 🗂️ -- 2026-05-29 20:55:28] - row_count: 384
[INDEX 🗂️ -- 2026-05-29 20:55:28] --------------------------------------------------
Indexing documents...: 0%| | 0/7 [00:00<?, ?it/s]The channel dimension is ambiguous. Got image shape torch.Size([3, 1, 1]). Assuming channels are the first dimension. Use the [input_data_format](https://huggingface.co/docs/transformers/main/internal/image_processing_utils#transformers.image_transforms.rescale.input_data_format) parameter to assign the channel dimension.
Indexing documents...: 86%|██████▊ | 6/7 [02:53<00:28, 28.92s/it]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/__main__.py", line 4, in <module>
main()
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1524, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1445, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1912, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1308, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 877, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/cli.py", line 96, in index
run_index(config_file, documents_path, collection_name)
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/profiler.py", line 95, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/run_index.py", line 48, in index
Indexer.from_documents(
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 98, in from_documents
indexer.index_documents(
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 243, in index_documents
inserted = self._index_documents(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 184, in _index_documents
dense_embeddings = self.dense_model.embed_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/rag/model/dense/multimodal.py", line 90, in embed_documents
outputs = self.model(**inputs, output_hidden_states=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/utils/generic.py", line 918, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1476, in forward
outputs = self.model(
^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1257, in forward
image_embeds = self.get_image_features(pixel_values, image_grid_thw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1170, in get_image_features
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 453, in forward
hidden_states = blk(
^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 281, in forward
hidden_states = hidden_states + self.attn(
^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 244, in forward
attn_outputs = [
^
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 245, in <listcomp>
attention_interface(
File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/integrations/sdpa_attention.py", line 106, in sdpa_attention_forward
attn_output = attn_output.transpose(1, 2).contiguous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: MPS backend out of memory (MPS allocated: 19.77 GiB, other allocations: 45.85 GiB, max allowed: 63.65 GiB). Tried to allocate 47.80 MiB on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).What hardware and/or data did you use during your tests for indexing? Did you encounter similar issues? (It could simply be QwenVL attention mechanism needing too much memory, which would be fair as now we deal with actual images)
There was a problem hiding this comment.
Quite significant changes have been introduced in the latest commits, and I believe we need more time before merging to properly test and sanitize everything to avoid discovering bugs in the pipelines later on as it is not independent from the rest.
This feature is not really meant to run on local devices, so we shouldn't adapt the code so that it works with a particular config on a particular computer. Otherwise when in the future we will try to use multimodality again, we will have to figure out the parameters on our own from scratch.
That's why we need configuration files that are known to work reliably on servers (in our case the RCP cluster) and ensure a reproducible behavior we can trust. Perhaps it will be already the case with the default configurations, but at least we take no risk by being extra cautious.
Resolve conflicts: keep ColVision docs rename and multimodal CUDA embedding fix.
Goal
Enable multimodal RAG generation: final answers can use retrieved text + related images, while keeping text-only mode unchanged.
Main Changes vs
origin/masterIndexing image references
In
src/mmore/index/indexer.py, each inserted chunk now storesimage_paths(JSON-serialized from image modalities).Retrieval compatibility and parsing
In
src/mmore/rag/retriever.py:_parse_image_paths(...)handling (None, list, JSON string, plain string),image_paths, then falls back if the field is missing in older Milvus schemas,Document.metadatanow includesimage_paths,page_numbers, andparagraph_numbers.Vision toggle in config
In
src/mmore/rag/llm.py,LLMConfignow includesuse_vision: bool = False.Lightweight vision module
New package:
src/mmore/rag/model/vision/adapters.py:BaseMultimodalLLM,OpenAIMultimodalAdapter,HuggingFaceVisionAdapter,get_multimodal_llm(...),image_utils.py: image-path aggregation, image loading, base64 conversion, multimodal message content builder,__init__.py: public exports.Pipeline split: text vs vision
In
src/mmore/rag/pipeline.py:RAGConfigaddsmax_images_per_request,prompt | llm | StrOutputParser(),invoke_with_images(...),image_pathswhen vision is used.Output schema and CLI export
src/mmore/rag/types.py:MMOREOutputaddsimage_paths: List[str].src/mmore/run_rag.py: results export now keepsimage_paths.Supporting fixes
src/mmore/rag/model/dense/multimodal.py: uses shared image-loading helper for safer image handling.src/mmore/process/post_processor/chunker/multimodal.py: fixed modality-boundary condition (>= len(sample.modalities)).