Skip to content

images extracted from documents as inputs in the final LLM answering.#281

Open
chadiaitekioui wants to merge 47 commits into
EPFLiGHT:masterfrom
chadiaitekioui:issue-multimodal-llm
Open

images extracted from documents as inputs in the final LLM answering.#281
chadiaitekioui wants to merge 47 commits into
EPFLiGHT:masterfrom
chadiaitekioui:issue-multimodal-llm

Conversation

@chadiaitekioui

@chadiaitekioui chadiaitekioui commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Goal

Enable multimodal RAG generation: final answers can use retrieved text + related images, while keeping text-only mode unchanged.

Main Changes vs origin/master

Indexing image references

In src/mmore/index/indexer.py, each inserted chunk now stores image_paths (JSON-serialized from image modalities).

Retrieval compatibility and parsing

In src/mmore/rag/retriever.py:

  • added robust _parse_image_paths(...) handling (None, list, JSON string, plain string),
  • retrieval first requests image_paths, then falls back if the field is missing in older Milvus schemas,
  • returned Document.metadata now includes image_paths, page_numbers, and paragraph_numbers.

Vision toggle in config

In src/mmore/rag/llm.py, LLMConfig now includes use_vision: bool = False.

Lightweight vision module

New package: src/mmore/rag/model/vision/

  • adapters.py: BaseMultimodalLLM, OpenAIMultimodalAdapter, HuggingFaceVisionAdapter, get_multimodal_llm(...),
  • image_utils.py: image-path aggregation, image loading, base64 conversion, multimodal message content builder,
  • __init__.py: public exports.

Pipeline split: text vs vision

In src/mmore/rag/pipeline.py:

  • RAGConfig adds max_images_per_request,
  • pipeline supports two generation branches:
    • text branch: existing prompt | llm | StrOutputParser(),
    • vision branch: retrieve docs, aggregate/load images, then call invoke_with_images(...),
  • output now includes image_paths when vision is used.

Output schema and CLI export

  • src/mmore/rag/types.py: MMOREOutput adds image_paths: List[str].
  • src/mmore/run_rag.py: results export now keeps image_paths.

Supporting fixes

  • src/mmore/rag/model/dense/multimodal.py: uses shared image-loading helper for safer image handling.
  • src/mmore/process/post_processor/chunker/multimodal.py: fixed modality-boundary condition (>= len(sample.modalities)).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for a multimodal (text + images) RAG flow by persisting image references at indexing time, retrieving them back into document metadata, and routing generation through a vision-capable LLM path when enabled.

Changes:

  • Store image_paths (derived from image modalities) in Milvus during indexing; parse and attach them to retrieved Document.metadata.
  • Add a vision branch to the RAG pipeline that loads images from retrieved image_paths and calls a multimodal LLM adapter.
  • Update CLI output formatting and add a regression test for <attachment> modality chunking.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_postprocessors.py Adds a regression test ensuring a single <attachment> keeps its image modality through chunking.
src/mmore/process/post_processor/chunker/multimodal.py Fixes an off-by-one boundary so the last/only modality is not dropped.
src/mmore/index/indexer.py Persists image_paths to Milvus as a dynamic field when indexing.
src/mmore/rag/retriever.py Requests and parses image_paths from Milvus results and adds it to Document.metadata.
src/mmore/rag/llm.py Adds use_vision toggle to LLMConfig.
src/mmore/rag/context.py New helper to format retrieved docs and load images from file paths.
src/mmore/rag/multimodal_llm.py New multimodal LLM adapter layer (OpenAI-style + Hugging Face vision adapter).
src/mmore/rag/pipeline.py Adds vision-mode branching in the RAG chain and config for max images per request.
src/mmore/run_rag.py Enhances saved outputs with num_images_used / used_image_paths.
examples/rag/config.yaml Adds commented vision example settings and max images knob.
docs/multimodal_rag.md New documentation describing the multimodal RAG flow and changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mmore/rag/multimodal_llm.py Outdated
Comment thread docs/multimodal_rag.md Outdated
Comment thread docs/multimodal_rag.md Outdated
Comment thread src/mmore/rag/retriever.py Outdated
Comment thread src/mmore/rag/pipeline.py Outdated
Comment thread src/mmore/rag/retriever.py
Comment thread src/mmore/rag/retriever.py
Comment thread src/mmore/rag/context.py Outdated
@chadiaitekioui chadiaitekioui force-pushed the issue-multimodal-llm branch from c9300be to f8e3bf1 Compare May 7, 2026 15:07
@chadiaitekioui chadiaitekioui marked this pull request as draft May 7, 2026 17:36
@chadiaitekioui chadiaitekioui marked this pull request as ready for review May 10, 2026 19:27
@fabnemEPFL fabnemEPFL requested a review from Copilot May 11, 2026 14:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Comment thread src/mmore/rag/retriever.py Outdated
Comment thread src/mmore/rag/pipeline.py Outdated
Comment thread src/mmore/rag/pipeline.py
Comment thread src/mmore/rag/model/vision/adapters.py
Comment thread src/mmore/run_rag.py Outdated
Comment thread docs/source/getting_started/multimodal_rag.md Outdated
Comment thread docs/source/getting_started/multimodal_rag.md Outdated

@JCHAVEROT JCHAVEROT left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chadiaitekioui, good job with this big task ! 👏👍

For now I mostly left comments on the documentation and a few high level considerations before reviewing the actual code (but from what I saw it looks very clean so well done!)

Comment thread docs/source/getting_started/multimodal_rag.md Outdated
Comment thread docs/source/getting_started/multimodal_rag.md Outdated
Comment thread docs/source/getting_started/multimodal_rag.md Outdated
Chadi AIT EKIOUI and others added 18 commits May 20, 2026 20:57
…ent metadata as dict for LangChain- Retriever: use default output_fields on Milvus fallback when image_paths is missing from schema- HuggingFaceVisionAdapter: simplify _load; image_utils: silently skip failed loads/encodes- Tests: consolidate multimodal coverage in test_multimodal_core and test_multimodal_rag
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…che to avoid launching multiple from_pretrained

@JCHAVEROT JCHAVEROT left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After tweaking a bit your code so that I can run it on my MPS machine, I tried to run the Indexing pipeline on my computer with the config.yaml file given as example (after running the classical process and postprocess stages with template data in the repo).

indexer:
  dense_model:
    #model_name: sentence-transformers/all-MiniLM-L6-v2
    #is_multimodal: false
    # To enable vision mode
    model_name: Qwen/Qwen2.5-VL-3B-Instruct
    is_multimodal: true
  sparse_model:
    model_name: splade
    is_multimodal: false
  db:
    uri: ./proc_demo.db
    name: my_db
collection_name: my_docs
documents_path: 'examples/postprocessor/outputs/merged/results.jsonl'

However even if the Qwen/Qwen2.5-VL-3B-Instruct model is not that big (~8GB, maybe a bit more once uncompressed), as you can see in the stacktrace below I ended up having out of memory errors.

python -m mmore index --config-file examples/index/config.yaml --documents-path examples/postprocessor/outputs/merged/results.jsonl
[INDEX 🗂️  -- 2026-05-29 20:55:21] Creating the indexer...
/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/milvus_lite/__init__.py:15: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import DistributionNotFound, get_distribution
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████| 2/2 [00:02<00:00,  1.05s/it]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[INDEX 🗂️  -- 2026-05-29 20:55:28] my_docs already exists, adding documents to it
[INDEX 🗂️  -- 2026-05-29 20:55:28] --------------------------------------------------
[INDEX 🗂️  -- 2026-05-29 20:55:28] Collection stats (before inserting):
[INDEX 🗂️  -- 2026-05-29 20:55:28]   - row_count: 384
[INDEX 🗂️  -- 2026-05-29 20:55:28] --------------------------------------------------
Indexing documents...:   0%|                | 0/7 [00:00<?, ?it/s]The channel dimension is ambiguous. Got image shape torch.Size([3, 1, 1]). Assuming channels are the first dimension. Use the [input_data_format](https://huggingface.co/docs/transformers/main/internal/image_processing_utils#transformers.image_transforms.rescale.input_data_format) parameter to assign the channel dimension.
Indexing documents...:  86%|██████▊ | 6/7 [02:53<00:28, 28.92s/it]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/__main__.py", line 4, in <module>
    main()
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1524, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1445, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1912, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 1308, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/click/core.py", line 877, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/cli.py", line 96, in index
    run_index(config_file, documents_path, collection_name)
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/profiler.py", line 95, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/run_index.py", line 48, in index
    Indexer.from_documents(
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 98, in from_documents
    indexer.index_documents(
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 243, in index_documents
    inserted = self._index_documents(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/index/indexer.py", line 184, in _index_documents
    dense_embeddings = self.dense_model.embed_documents(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/src/mmore/rag/model/dense/multimodal.py", line 90, in embed_documents
    outputs = self.model(**inputs, output_hidden_states=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/utils/generic.py", line 918, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1476, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1257, in forward
    image_embeds = self.get_image_features(pixel_values, image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1170, in get_image_features
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 453, in forward
    hidden_states = blk(
                    ^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 281, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 244, in forward
    attn_outputs = [
                   ^
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 245, in <listcomp>
    attention_interface(
  File "/Users/chaverot/dev/thesis/mmore-multimodal-llm/.venv/lib/python3.11/site-packages/transformers/integrations/sdpa_attention.py", line 106, in sdpa_attention_forward
    attn_output = attn_output.transpose(1, 2).contiguous()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: MPS backend out of memory (MPS allocated: 19.77 GiB, other allocations: 45.85 GiB, max allowed: 63.65 GiB). Tried to allocate 47.80 MiB on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

What hardware and/or data did you use during your tests for indexing? Did you encounter similar issues? (It could simply be QwenVL attention mechanism needing too much memory, which would be fair as now we deal with actual images)

Comment thread docs/source/getting_started/rag.md Outdated
Comment thread docs/source/getting_started/rag.md Outdated
Comment thread src/mmore/run_rag.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment thread src/mmore/rag/model/vision/adapters.py
Comment thread src/mmore/rag/retriever.py Outdated
Comment thread src/mmore/rag/model/dense/multimodal.py
Comment thread docs/source/getting_started/rag.md Outdated

@JCHAVEROT JCHAVEROT left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite significant changes have been introduced in the latest commits, and I believe we need more time before merging to properly test and sanitize everything to avoid discovering bugs in the pipelines later on as it is not independent from the rest.

This feature is not really meant to run on local devices, so we shouldn't adapt the code so that it works with a particular config on a particular computer. Otherwise when in the future we will try to use multimodality again, we will have to figure out the parameters on our own from scratch.

That's why we need configuration files that are known to work reliably on servers (in our case the RCP cluster) and ensure a reproducible behavior we can trust. Perhaps it will be already the case with the default configurations, but at least we take no risk by being extra cautious.

Chadi AIT EKIOUI added 2 commits June 14, 2026 21:58
Resolve conflicts: keep ColVision docs rename and multimodal CUDA embedding fix.
@fabnemEPFL fabnemEPFL added the in-progress Being worked on by the core label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress Being worked on by the core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants