Skip to content

feat(process): add NemotronVLMProcessor as alternative PDF backend#308

Open
perrin-arthur wants to merge 6 commits into
EPFLiGHT:masterfrom
perrin-arthur:feat/mistral-ocr-processor
Open

feat(process): add NemotronVLMProcessor as alternative PDF backend#308
perrin-arthur wants to merge 6 commits into
EPFLiGHT:masterfrom
perrin-arthur:feat/mistral-ocr-processor

Conversation

@perrin-arthur

@perrin-arthur perrin-arthur commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds NVIDIA Nemotron Nano 12B V2 VL as a runtime-selectable PDF extraction backend, complementing the existing Marker/Surya pipeline. Default behaviour is unchanged.

  • New NemotronVLMProcessor rasterizes each PDF page with PyMuPDF and prompts the VLM via the OpenAI-compatible endpoint at integrate.api.nvidia.com, returning per-page Markdown concatenated into the same MultimodalSample shape as PDFProcessor.
  • Backend selection is driven by dispatcher_config.pdf_backend (marker | nemotron) in the process YAML; the dispatcher exports it to the MMORE_PDF_BACKEND env var so both processors can disambiguate in accepts() without colliding on .pdf.
  • Adds openai>=1.30 to the process extra (used as an OpenAI-compatible client for the NVIDIA endpoint).
  • Adds docs/nemotron_vlm_cost_estimate.md covering the build.nvidia.com free tier vs. managed NIM vs. self-hosted NIM on lab GPUs.

Note: an earlier revision of this PR used Mistral OCR. The approach pivoted to a generalist VLM backend so it can also run self-hosted (NIM container) on EPFL RCP / CSCS GPUs without per-call billing.

Why

The current Marker/Surya pipeline runs locally on GPU and gives full control, but on heterogeneous corpora (noisy scans, layout-heavy reports, figures) a VLM may offer better fidelity. Making the backend pluggable lets users (or downstream benches) pick the right tool without forking the pipeline.

Configuration

dispatcher_config:
  pdf_backend: marker         # or "nemotron" (requires NVIDIA_API_KEY)
  processor_config:
    NemotronVLMProcessor:
      - nemotron_model: "nvidia/nemotron-nano-12b-v2-vl"
      - nemotron_dpi: 200
      - nemotron_max_tokens: 4096
      - nemotron_temperature: 0.0

For the nemotron backend, set NVIDIA_API_KEY in the environment.

Test plan

  • mmore process with pdf_backend: marker on the sample corpus — unchanged behaviour.
  • mmore process with pdf_backend: nemotron + NVIDIA_API_KEY exported — verify markdown extraction.
  • Verify AutoProcessor.from_file() routes .pdf to the right processor under each flag.
  • Confirm the existing test suite still passes (no GPU/API code paths touched by default).

Copilot AI review requested due to automatic review settings May 19, 2026 13:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a runtime-selectable PDF extraction backend using Mistral’s hosted OCR API, alongside the existing local Marker/Surya-based PDFProcessor. Selection is driven by dispatcher_config.pdf_backend, with the default behavior remaining the existing Marker pipeline.

Changes:

  • Added MistralOCRProcessor that calls the Mistral OCR endpoint and returns MultimodalSample output for PDFs.
  • Added dispatcher_config.pdf_backend and export to MMORE_PDF_BACKEND so .pdf routing can be switched at runtime.
  • Added mistralai as a process extra dependency and documented OCR cost estimates.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/mmore/process/processors/mistral_ocr_processor.py New Mistral OCR-backed PDF processor and metadata/pagination logic.
src/mmore/process/processors/pdf_processor.py Steps aside when MMORE_PDF_BACKEND=mistral to avoid .pdf accept collisions.
src/mmore/process/dispatcher.py Adds pdf_backend config and sets MMORE_PDF_BACKEND during dispatcher config init.
pyproject.toml Adds mistralai>=2.4 to the process extra.
production-config/process/config.yaml Documents and wires pdf_backend + per-processor config for Mistral OCR.
docs/mistral_ocr_cost_estimate.md Adds cost estimate documentation for benchmarking Mistral OCR.
uv.lock Locks new dependencies for mistralai and transitive requirements.
Comments suppressed due to low confidence (1)

src/mmore/process/processors/mistral_ocr_processor.py:115

  • md is rewritten to include <attachment> tokens unconditionally, even when extract_images is false. Because image decoding/attachment creation is gated on extract_images, this can yield text containing attachment markers but no corresponding image modalities. Consider only substituting image markdown when extract_images is enabled, or strip image markdown entirely when it’s disabled.
                            f"Could not decode image on page {page_idx} of {file_path}: {e}"
                        )
            md = re.sub(IMG_REGEX, "<attachment>", md)
            page_texts.append((page_idx, md))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def __post_init__(self):
os.makedirs(self.output_path, exist_ok=True)
if self.pdf_backend:
os.environ["MMORE_PDF_BACKEND"] = self.pdf_backend.lower()
Comment on lines +25 to +29
class MistralOCRMetadata(DocumentMetadata):
paragraph_starts: List[Tuple[int, int, int]] = field(default_factory=list)
backend: str = "mistral-ocr"
model: str = "mistral-ocr-latest"

Comment on lines +55 to +59
def accepts(cls, file: FileDescriptor) -> bool:
if os.environ.get(PDF_BACKEND_ENV, "").lower() != MISTRAL_BACKEND:
return False
return file.file_extension.lower() == ".pdf"

Comment thread docs/mistral_ocr_cost_estimate.md Outdated

@JCHAVEROT JCHAVEROT May 19, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Fabrice said, you can remove it from the GitHub history completely, we don't really want this in the repo !

Some documentation: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

@fabnemEPFL fabnemEPFL requested a review from JCHAVEROT May 19, 2026 17:22
@perrin-arthur perrin-arthur force-pushed the feat/mistral-ocr-processor branch from dc1fe6c to cb98f81 Compare May 19, 2026 18:24
@perrin-arthur perrin-arthur changed the title feat(process): add MistralOCRProcessor as alternative PDF backend feat(process): add NemotronVLMProcessor as alternative PDF backend May 19, 2026
@perrin-arthur perrin-arthur force-pushed the feat/mistral-ocr-processor branch from cb98f81 to 9936bcb Compare May 19, 2026 18:27
- NemotronVLMProcessor: NVIDIA API-based VLM backend for PDF extraction
- bench_omnidocbench.py: HF transformers driver for VLM document parsing
- CSCS SLURM job + Pyxis env for GH200 (NGC PyTorch container, ARM64)
@perrin-arthur perrin-arthur force-pushed the feat/mistral-ocr-processor branch from 357fa27 to 241eca2 Compare May 22, 2026 12:36
@JCHAVEROT JCHAVEROT added documentation Improvements or additions to documentation enhancement New feature or request dependencies Pull requests that update a dependency file labels Jun 1, 2026

@JCHAVEROT JCHAVEROT left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the NemotronVLMProcessor and it works pretty good! (a bit slow but it most likely comes from limitations inherent to the free NVIDIA API)

Please just add a short section in docs/source/getting_started/process.md to explain how to use the new processor


Note that I only gave the PDFs to process here
Image

Merged results:

{"text": "arXiv:2407.07895v2 [cs.CV] 28 Jul 2024\n\n# **LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3Din Large Multimodal Models**\n\n**Feng Li**<sup>1_,_2\u2217</sup>**, Renrui Zhang**<sup>1_,_3\u2217</sup>**, Hao Zhang**<sup>1_,_2\u2217</sup>**, Yuanhan Zhang**<sup>1_,_4</sup>**, Bo Li**<sup>1_,_4</sup>**, Wei Li**<sup>1</sup>**, Zejun Ma**<sup>1</sup>**, Chunyuan Li**<sup>1</sup>\n<sup>1</sup> ByteDance <sup>2</sup> HKUST <sup>3</sup> CUHK <sup>4</sup> NTU <sup>\u2217</sup> Core contributor\n`https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/`\n\nFigure 1. Performance comparison in three interleaved scenarios, including multi-image, multi-frame (video), and multi-view (3D). Our LLaVA-NeXT-Interleave model achieves SoTA performance across a variety of evaluation benchmarks.\n\n**Abstract**\n\n_Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi- image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce **LLaVA-NeXT-Interleave**, which simultaneously tackles **M**ulti-image, **M**ulti-frame (video), **M**ulti-view (3D), and **M**ulti-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the **M4-Instruct** dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the **LLaVA- Interleave Bench** to comprehensively evaluate the multi- image performance of LMMs. Through extensive experi_\n\n_ments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at `https://github.com/LLaVA-VL/ LLaVA-NeXT`._\n\n### 1 . Introduction\n\nRecent advancements in Large Multimodal Models (LMMs) [11, 12, 26, 37, 43, 64, 66] have showcased impressive capabilities in diverse multimodal contexts, advancing the pursuit of artificial general intelligence. With extensive vision-language data [46, 47], they empower Large Language Models (LLMs) [5, 8, 52, 53] with visual modality by aligning vision encoders [9, 44, 45]. This integration has propelled forward the field of AI, enabling complex image\nand language understanding tasks to be performed with unprecedented accuracy.\n\nHowever, most open-source LMMs [11, 24, 34, 36] have primarily focused on pushing the performance limit of the single-image scenario, the more complex multi-image scenarios remain largely less explored. This oversight is significant given that many real-world applications demand multi-image capabilities, such as comprehensive multi- image analyses. Traditionally, researchers have approached these challenges by training separate task-specific models for each application scenario, e.g., multi-image [1, 19, 27], video [7, 29, 67], and 3D [14, 15, 58]. This is both labor-intensive and time-consuming, resulting in fragmented methodologies that are inefficient and often unscalable. Considering the diverse range of computer vision settings and data formats, there is a pressing need to develop a general framework for LMMs that can operate effectively across these varied contexts.\n\nIn this paper, we observe that the image-text interleaved format can naturally serve as a general data template to unify different scenarios, e.g., single-image or multi-image as special cases, video as multi-frames, and 3D as multi- views, as illustrated in Figure 2. Therefore, we present _LLaVA-NeXT-Interleave_, an all-around LMM that extends the model capabilities to various real-world settings such as _M_ulti-image, _M_ulti-frame (videos), _M_ulti-view (3D) while maintains the performance of the _M_ulti-patch (single- image) performance. We denote the four settings as _M4_.\n\nThe core innovation of our approach lies in the perspective to leverage an image-text interleaved format as a universal data template capable of accommodating different scenarios, and construct the related visual instruction- following data. This perspective not only simplifies the training process across various domains, but also allow the model to emerge new capabilities due to cross-domain task composition.\n\nOur contributions are summarized as below:\n\n\u2022 **_Interleave data format unifies different tasks._** We convert multi-image, video, 3D, and single-image data all into an interleaved training format, which unifies different tasks in a single LMM.\n\n\u2022 **_New dataset and benchmark._** We compile a high- quality training dataset, **M4-Instruct**, with 1177.6 samples to empower LMMs with the M4 capabilities, which spans 4 primary domains (multi-image, video, 3D, and single-image) with 14 tasks and 41 datasets. We also curate LLaVA-Interleave Bench, a diverse set of benchmarks to evaluate the multi-image performance, including 7 newly collected and 13 existing in/out-domain benchmarks.\n\n\u2022 **_SoTA performance._** With a single model, LLaVA- NeXT-Interleave can achieve leading results across\n\ndifferent multi-image tasks compared to the previous SoTA, while maintaining the single-image performance, as exemplified in Figure 1.\n\n\u2022 **_Emerging capabilities with cross-task transfer._** By jointly training on a diverse set of tasks, our model showcases emerging capabilities to transfer tasks across different settings and modalities. e.g., from spotting differences between images to videos.\n\n## 2 . Related Work\n\n### Interleaved Image-text Training Data.\n\nAs a more general format, interleaved image-text data can enable LMMs with two distinctive capabilities: multimodal in-context learning (ICL) capability and instruction-following capability in real-world multi-image application scenarios. _The former in-context scenarios_ interleave several image-text examples within the prompt as task demonstrations, adapting LMMs to new tasks in the inference stage in a few- shot manner. Flamingo [1] is first model to demonstrate this capability, and thus is considered as GPT-3 moment for multimodal community. Typically, the multimodal ICL ability is emerged after pre-training on web-scale raw interleaved image-text sequences. In the open-source community, MMC4 [68] introduces a public 101.2M interleaved dataset spanning everyday topics, OBELICS [22] also presents a filtered dataset comprising 141M interleaved web pages. Kosmos-1 [18] curates a 71M multimodal corpora, including arbitrarily interleaved documents. To explicitly enable the ICL capability, MIMIC-IT [25] proposes an automatic pipeline to create 2.8M multimodal samples in the instruction-tuning stage. On the other hand, _the latter multi-image scenarios_ aim to tackle diverse real-world applications scenarios that involve multi-images. The training data of VPG-C [27] collected 4 new datasets with ChatGPT. Mantis-Instruct [19] compiles existing 11 interleaved datasets and creates 4 new datasets. The proposed M4- Instruct [19] compiles existing 41 interleaved datasets and creates 6 new datasets, covering a much higher scenarios diversity than Mantis-Instruct.\n\n### Interleaved LMMs.\n\nAs representative _closed-source LMMs_, both GPT-4V [42] and Gemini [12] support real- world multi-image application scenarios with leading performance. With various public datasets aforementioned, the community has developed _open-source LMMs_ equipped with remarkable multi-image proficiency. The ICL performance is typically considered to evaluate multimodal pre-training, which has been adopted in several known LMMs, such as OpenFlamingo [2], IDEFICS series [22, 23], VILA [33] and MM1 [41], Emu2 [51]. Otter [25] is initialized from OpenFlamingo, and is fine-tuned on the MIMIC-IT dataset to further improve ICL ability with\ninstruction-tuning. In contrast, the use of instruction- tuning in LMMs for various real-world multi-image applications has been less explored, despite of Mantis [19]. The proposed LLaVA-NeXT-Interleave not only broadens the multi-image scenario itself as demonstrated by the improved experimental results, but also generalize the settings to diverse scenarios with one model, e.g., video, 3D, and single-image. The cross-scenario training leads to emerging capabilities, achieving zero-shot task composition in new multi-image contexts.\n\n#### Interleaved Benchmarks.\n\nTo assess the interleaved multi-image capabilities of LMMs, there have been several high-quality benchmarks in various scenarios. The ICL benchmarks [20, 49] for LMMs comprehensively evaluate their interleaved skills from few-shot to many-shot settings. For the more challenging multi-image scenarios, previous works mainly focus on a specific domain for evaluation, including NLVR2 [50] for daily-life VQA, MMMU [61] for colleague-level problem-solving, MathVerse-mv [65] and SciVerse-mv [13] for mathematical and scientific reasoning, BLINK [10] to challenge LMMs, and Mantis-Eval [19] for multi-image understanding. To further evaluate LMMs on a collection of multi-image scenarios, DEMON [27] is the first benchmark that compiles dozens of datasets with 477K samples. With the large amount of data and high diversity, DEMON lays a good foundation for multi-image research. Unfortunately, it also inherits a significant amount of low-quality data samples from existing datasets. To facilitate evaluation, the proposed LLaVA-Interleave Bench curate high-quality samples, comprising both specific (synthetic, mathematical, low-level) and general (daily, real- world, text-rich) multi-image scenarios. With 9 newly curated and 13 existing datasets, we categorize them into in- domain (12.9K) and out-domain (4.1K) schemes. Concurrent multi-image evaluation benchmarks include MuirBench [54] and ReMI [21].\n\n## 3 . Interleaved Multi-image Tasks & Data\n\n### 3.1 . Task Overview\n\nWe observe different computer vision scenarios can be generally represented by the interleaved multi-image format, such as video, 3D, and single-image data. Therefore, to endow LLaVA-Interleave with diverse capabilities, as shown in Figure 2, we adopt the interleaved multi-image format to unify the data input of the following four tasks:\n\n**Multi-image scenarios** include visual instructions incorporating interleaved vision-language input with multiple images. This setting covers 12 challenging real-world tasks included in our training data, such as spotting the difference, visual story telling, image editing instruction generation, interleaved multi-image dialogue, multi-image puzzle, low-level multi-image assessment, etc.\n\n**Multi-frame scenarios** refer to taking video as input data by sampling it into multiple frames, preserving temporal visual cues across the multi-image sequence. We mainly focus on 2 tasks: video detailed captioning and video VQA.\n\n**Multi-view scenarios** depict 3D environments by multi- view images from different perspectives, where the visual correspondence and disparity can indicate spatial information in the 3D world. For 3D perception, we include 2 tasks: embodied VQA (dialogue and planning), and 3D scene VQA (captioning and grounding).\n\n**Multi-patch scenarios** represent the conventional single- image tasks. With the design of \u2018any resolution\u2019 in LLaVA- NeXT [36], we divide a high-resolution image into multiple low-resolution patches for efficient visual encoding, compatible with our interleaved multi-image format.\n\n### 3.2 . M4-Instruct\n\nTo empower all-round multi-image capabilities, we meticulously curate a comprehensive training dataset including 1177.6K instances, termed M4-Instruct, widely\n\nFigure 2. Tasks in our M4-Instruct. (a) showcases an example of interleaved multi-image scenarios (visual story telling). (b), (c), and (d) indicate that video, 3D and single-image data can also be organized as the interleaved data format for unified processing.\nspanning multi-image, multi-frame, and multi-view scenarios with 14 tasks and 41 datasets, along with multi-patch data to preserve basic single-image performance. We showcase task examples of the first three scenarios in Figure 3.\n\nWe exhibit a data overview of M4-Instruct in Figure 4, and the detailed data statistics in Table 15. For the multi- image data, most of the datasets are collected from previous public efforts and rigorously converted into our unified format with task-specific instructions, some inspired by DE-\n\nMON [27] and Mantis [19]. On top of that, we also utilize GPT-4V [43] to annotate 3 new tasks to enable more diverse capabilities, i.e., Real-world Difference, Synthetic Difference, and Twitter Post. For the video data, we collect a 255K subset from LLaVA-Hound [63], including 240K video VQA and 15K video detailed captioning. We also include NExT-QA [57] and STAR [55] to expand our video training data. For the 3D data, we widely gather the training set from nuScenes QA [6], ALFRED [48], ScanQA [3], and\n\nFigure 3. Task examples of M4-Instruct, containing diverse scenarios in multi-image, multi-frame (video), and multi-view (3D).\n\nFigure 4. M4-Instruct training data statistics.\n\nFigure 5. LLaVA-Interleave Bench statistics.\n3D-LLM [16], covering both outdoor and indoor scenarios. For the single-image data, we randomly sample 40% of the stage-2 fine-tuning data from LLaVA-NeXT [24], which aims to preserve the single-image capacity.\n\nTo comprehensively evaluate the interleaved multi- image performance, we introduce the LLaVA-Interleave Bench for LMMs, consisting of 13 challenging tasks with 17K instances. We present a data overview of the benchmark in Figure 3, and the detailed data statistics in Table 16. In detail, we categorize multi-image tasks into two classes:\n\n\u2022 _In-domain Evaluation_ includes tasks that have been \u2018seen\u2019 during our training, designed to verify the model performance within familiar scenarios. We adopt 5 newly curated multi-image tasks corresponding to training datasets, and 2 existing benchmarks, Q- Bench [56] and NLVR2 [50], with 12.9K in total.\n\n\u2022 _Out-domain Evaluation_ involves tasks that don\u2019t overlap with training scenarios, aiming to reveal the generalization capacity of LMMs. We construct 2 new tasks for multi-image mathematical (MathVerse [65]) and scientific (SciVerse [13]) comprehension, and utilize 3 existing benchmarks, Mantis-Eval [19], BLINK [10], and MMMU [60], with 4.1K in total.\n\n## 4 . Interleaved Visual Instruction Tuning\n\nIn this section, we introduce several key techniques during the interleaved visual instruction tuning of LLaVA- NeXT-Interleave. For architecture designs, we follow LLaVA-NeXT [24] to adopt the most general framework, i.e., a vision encoder [62], an intermediate projector, and a powerful LLM [4]. Then, we consider the following three techniques to achieve improved multi-image performance.\n\n#### Technique 1: Continue training from single-image models.\n\nThe interleaved multi-image tasks can be regarded as an extension of single-image scenarios, more flexible in formats and challenging in reasoning. Therefore, to better leverage the pre-trained single-image proficiency, we adopt an off-the-shelf LLaVA-NeXT-Image [24] as the base model, which has gone through a stage-1 image-caption pre-training and a stage-2 single-image fine-tuning. On top of this model, we perform the interleaved multi-image instruction tuning with our M4-Instruct dataset.\n\n#### Technique 2: Mixed Interleaved data formats during training.\n\nWe adopt two format choices for the positions of image tokens during the interleaved multi-image training. The first is to place all the image tokens in front of the prompt, while maintaining the placeholders \\(\\langle\\)image\\(\\rangle\\) within the text, denoted as \u2018In-the-front format\u2019. The second preserves the interleaved format to put image tokens in the\n\nplace they are originally in, i.e., the positions of \\(\\langle\\)image\\(\\rangle\\), denoted as \u2018interleaved format\u2019. In this way, LLaVA-NeXT- Interleave supports more flexible inference modes, exhibiting robustness to different input formats.\n\n#### Technique 3: Combining different data scenarios improves individual task performance.\n\nMost existing works conduct supervised fine-tuning with only one type of data source, e.g., multi-image tuning of Mantis [19] and multi-frame tuning of LLaMA-VID [31]. Instead, we utilize the M4-Instruct to simultaneously conduct instruction tuning with four different tasks (multi- image/frame/view/patch). With a unified interleaved format, distinct data scenarios have the potential to provide complementary semantics and instruction-following capabilities.\n\n## 5 . Experiments\n\nIn Section 5.1, we first introduce our evaluation schemes and implementation details. Then, in Section 5.2, we report and analyze the quantitative results in four interleaved multi-image scenarios.\n\n### 5.1 . Settings\n\n#### Evaluation Schemes.\n\nWe evaluate our LLaVA-NeXT- Interleave model on four real-world interleaved scenarios, i.e., multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image).\n\n\u2022 _For multi-image evaluation_, we adopt the proposed LLaVA-Interleave Bench covering comprehensive in- domain and out-domain tasks.\n\n\u2022 _For video evaluation_, we utilize the existing NExT- QA [57], MVBench [30], Video Detailed Description (VDD) [67], and ActivityNet-QA (Act) [59]. For ActivityNet-QA, we present both the accuracy and GPT score (Acc/Score). We also evaluate on VideoChat-GPT (VCG) [40] with five metrics: CI (Correctness of Information), DO (Detail Orientation), CU (Context Understanding), TU (Temporal Understanding), and CO (Consistency).\n\n\u2022 _For 3D evaluation_, we select ScanQA [3], two tasks from 3D-LLM [16], i.e., 3D-assisted Dialogue and Task Decomposition, and also curate two new test set from nuScenes VQA [6] and ALFRED [48].\n\n#### Implementation Details.\n\nFollowing the same architecture in LLaVA-NeXT [24], our LLaVA-NeXT-Interleave adopts Qwen 1.5 [5] as the base LLM with 0.5B, 7B and 14B parameters, SigLIP-400M [62] with 384\\(\\times\\)384 resolutions as the vision encoder, and a two-layer MLP as the projection layer.\n## 5.2 . Main Results\n\n### Multi-image Results.\n\nAs reported in Table 1, the average multi-image performance of LLaVA-NeXT-Interleave surpasses previous open-source models in both in- and out- domain benchmarks. For in-domain evaluation, our model demonstrates significant advantages across various tasks as expected, due to the multi-image instruction tuning with M4-Instruct. For out-domain evaluation, LLaVA-NeXT-\n\nInterleave also showcases superior generalization capacity within novel scenarios, e.g., comparable to GPT-4V on Mantis-Eval and BLINK.\n\n### Multi-frame (Video) Results.\n\nCompared with previous video-based LMMs under similar model sizes, LLaVA- NeXT-Interleave achieves superior results on many benchmarks in Table 2, though not specifically designed for video tasks. We also follow LLaVA-Hound to add DPO training\n\n\\begin{tabular}{c c c c c c c c c c c c c c c c}\n\\multirow{2}{*}{Model} & \\multicolumn{8}{c}{In-domain Evaluation} & \\multicolumn{7}{c}{Out-domain Evaluation} \\\\\n & Avg & SD & IE & VST & TRVQA & MIVQA & Puzzle & QB & NLVR2 & Avg & Math & Sci & Mantis & BLINK & MMMU-mv \\\\\nGPT-4V [43] & 39.2 & 12.5 & 11.0 & 10.9 & 54.5 & 52.0 & 17.1 & 76.5 & 88.8 & 57.8 & 60.3 & 66.9 & 62.7 & 51.1 & 47.9 \\\\\nLLaVA-NeXT-Image (7B) [36] & 32.4 & 12.9 & 13.2 & 10.1 & 59.6 & 39.4 & 9.0 & 51.0 & 68.0 & 29.4 & 13.5 & 12.2 & 46.1 & 41.8 & 33.5 \\\\\nVPG-C (7B) [28] & 35.8 & 27.8 & 15.2 & 21.5 & 38.9 & 46.8 & 2.4 & 57.6 & 73.2 & 34.5 & 24.3 & 23.1 & 52.4 & 43.1 & 29.4 \\\\\nMantis (7B) [19] & 39.6 & 17.6 & 11.2 & 12.5 & 45.2 & 52.5 & 25.7 & 69.9 & 87.4 & 39.3 & 27.2 & 29.3 & 59.5 & 46.4 & 34.1 \\\\\nLLaVA-NeXT-Interleave &  &  &  &  &  &  &  &  &  &  &  &  &  &  &  \\\\\n0.5B & 43.9 & 34.3 & 21.6 & 29.7 & 63.9 & 54.8 & 35.4 & 52.0 & 67.8 & 33.1 & 13.3 & 12.2 & 45.6 & 39.2 & 28.6 \\\\\n7B & 58.6 & 37.1 & 24.3 & 33.1 & 76.1 & 87.5 & 48.7 & 74.2 & 88.8 & 42.8 & 32.8 & 31.6 & 62.7 & 52.6 & 34.5 \\\\\n14B & **62.3** & **40.5** & **24.5** & **33.3** & **78.6** & **95.0** & **59.9** & **76.7** & **91.1** & **44.3** & **33.4** & **32.7** & **66.4** & **52.1** & **37.1** \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c c c c}\n\\multirow{2}{*}{Model} & \\multirow{2}{*}{NExTQA} & \\multirow{2}{*}{MVBench} & \\multirow{2}{*}{ActivityNet-QA} & \\multirow{2}{*}{VDD} & \\multicolumn{5}{c}{VideoChat-GPT} \\\\\n &  &  &  &  & CI & DO & CU & TU & CO \\\\\nGPT-4V [43] & - & - & - & 4.00 & 4.09 & 3.88 & 4.37 & 3.94 & 4.02 \\\\\nVideoChatGPT (7B) [39] & - & - & 35.2/2.70 & - & 2.40 & 2.52 & 2.62 & 1.98 & 2.37 \\\\\nVideo-LLaVA (7B) [32] & - & - & 45.3/3.30 & - & 2.87 & 2.94 & 3.44 & 2.45 & 2.51 \\\\\nVISTA-LLaMA (7B) [38] & - & - & 48.3/3.30 & - & 2.44 & 2.31 & 2.64 & 3.18 & 2.26 \\\\\nVideoChat2 (7B) [29] & 68.6 & 51.9 & 49.1/3.30 & - & 3.02 & 2.88 & 3.51 & 2.66 & 2.81 \\\\\nLLaMA-VID (7B) [31] & - & 50.2 & 47.4/3.30 & 2.84 & 3.01 & 2.97 & 3.54 & 2.53 & 2.60 \\\\\nLLaVA-NeXT-Video (7B) [67] & - & - & 53.5/3.20 & 3.32 & 3.39 & 3.29 & 3.92 & 2.60 & 3.12 \\\\\nLLaVA-NeXT-Video-DPO (7B) & - & - & 60.2/3.50 & 3.72 & 3.64 & 3.45 & 4.17 & 2.95 & 4.08 \\\\\nLLaVA-NeXT-Video-DPO (34B) & - & - & **64.4**/**3.60** & 3.84 & 3.81 & 3.55 & 4.24 & 3.14 & 4.12 \\\\\nLLaVA-NeXT-Interleave &  &  &  &  &  &  &  &  &  \\\\\n0.5B & 59.5 & 45.6 & 48.0/2.84 & 3.25 & 3.12 & 2.97 & 3.62 & 2.36 & 3.27 \\\\\n7B & 78.2 & 53.1 & 55.3/3.13 & 3.57 & 3.51 & 3.28 & 3.89 & 2.77 & 3.68 \\\\\n14B & 79.1 & **54.9** & 56.2/3.19 & 3.59 & 3.65 & 3.37 & 3.98 & 2.74 & 3.67 \\\\\n7B (DPO) & **77.9** & 52.3 & 55.0/3.13 & **3.90** & **3.99** & **3.61** & **4.24** & **3.19** & **4.12** \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c}\n\\multirow{3}{*}{Model} & \\multicolumn{6}{c}{In-domain Evaluation} \\\\\n & \\multirow{2}{*}{Avg} & 3D-assisted & Task & ScanQA & ALFRED & nuScenes \\\\\n &  & Dialogue & Decomposition & (val) &  & VQA \\\\\nFlamingo [1] & 20.5 & 27.9 & 33.2 & 31.1 & 5.3 & 4.9 \\\\\nGPT-4V [43] & 34.6 & 31.2 & 35.4 & 32.6 & 10.3 & 63.7 \\\\\nPoint-Bind \\& LLM [14] & 22.5 & 38.3 & 35.8 & 34.6 & 0.6 & 3.3 \\\\\n3D-LLM [17] & 22.9 & 39.3 & 37.8 & 35.7 & 1.4 & 0.4 \\\\\nMantis (7B) [19] & 18.7 & 2.60 & 14.7 & 16.1 & 14.0 & 46.2 \\\\\nLLaVA-NeXT-Interleave &  &  &  &  &  &  \\\\\n0.5B & 53.0 & 67.2 & 48.5 & 29.3 & 57.0 & 62.8 \\\\\n7B & 58.2 & 69.3 & 51.4 & 32.2 & 61.6 & 76.5 \\\\\n14B & **59.2** & **70.6** & **52.2** & **34.5** & **62.0** & **76.7** \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c c c}\nModel & LLM & Avg & AI2D & ChartQA & DocVQA & MME & SciQA & POPE \\\\\n\\multirow{2}{*}{Single\nInterleave} & \\multirow{2}{*}{0.5B} & 59.8 & 51.7 & 50.2 & 59.1 & 52.8 & 60.0 & 85.4 \\\\\n &  & **60.5** & 52.2 & 52.2 & 59.2 & 52.0 & 60.6 & 86.8 \\\\\n\\multirow{2}{*}{Single\nInterleave} & \\multirow{2}{*}{7B} & 72.3 & 72.7 & 66.3 & 75.6 & 61.0 & 71.1 & 86.9 \\\\\n &  & **73.3** & 73.9 & 67.2 & 75.7 & 63.5 & 72.6 & 86.8 \\\\\n\\multirow{2}{*}{Single\nInterleave} & \\multirow{2}{*}{14B} & **77.2** & 77.5 & 72.1 & 80.0 & 67.7 & 78.9 & 87.3 \\\\\n &  & 76.4 & 76.5 & 71.2 & 78.9 & 66.2 & 77.4 & 87.9 \\\\\n\\end{tabular}\n\nTable 1. Results on our LLaVA-Interleave Bench. SD: Spot the Difference, IE: Image Edit Instruction, VST: Visual Story Telling, TRVQA: Text-rich VQA, MIVQA: Multi-image VQA, QB: Q-Bench, SQ: ScanQA, Math: MathVerse-mv, Sci: SciVerse-mv.\n\nTable 2. Results on multi-frame (video) benchmarks. VDD: Video Detailed Description. CI (Correctness of Information), DO (Detail Orientation), CU (Context Understanding), TU (Temporal Understanding), and CO (Consistency).\n\nTable 3. Results on 3D benchmarks. 3D-assisted Dialogue and Task Decomposition are evaluation tasks from 3D-LLM.\n\nTable 4. Results on multi-patch (single-image) benchmarks with different LLM sizes. \u2018Single\u2019 and \u2018Interleave\u2019 denote LLaVA- NeXT-Image and our model, respectively.\nafter our M4-Instruct tuning. After adding DPO, our 7B model attains SoTA performance on VDD and VideoChatGPT benchmarks, surpassing the previous LLaVA-NeXT- Video (34B). This demonstrates the effective temporal understanding and reasoning capabilities of our model across sequential frames. Note that we calculate the average scores by multiplying a weight of 10 times by the score of Video Detailed Description and VideoChat-GPT.\n\n### Multi-view (3D) Results.\n\nFor 3D perception in Table 3, our model also obtains leading results for both indoor and outdoor scenarios on five in-domain benchmarks. Compared to 3D-LLM and Point-LLM with additional point clouds as input, LLaVA-NeXT-Interleave only accepts multi-view images to interpret the 3D world, attaining significantly higher scores in challenging 3D scenarios.\n\n### Multi-patch (single-image) Results.\n\nWe also add 307k (40%) of original LLaVA-NeXT single-image data, which makes our model capable of doing single-image tasks. We use the _anyres_ training for single-image data, which divides an image into multiple patches, forming another multi- image setting. As shown in Table 4, we maintain the single- image performance of LLaVA-NeXT-Image. As single- image data is of high quality and diversity, adding single- image data also improves the instruction-following ability and enables task transfer from single-image to multi-image, which is demonstrated in Section 6.\n\n### 5.3 . Ablations of Proposed Techniques\n\nWe study the effectiveness of the three proposed training techniques in Section 4 as below.\n\n\u2022 In Table 5, we compare training strategies. It is seen that initialization from a good single-image model checkpoint (from Stage-2) can consistently enhance the interleaved multi-image performance, than directly from a Stage-1 model checkpoint.\n\n\u2022 In Table 6, our mixed-format training can benefit the results of both two input formats.\n\n\u2022 In Table 7, we progressively incorporate single-image and multi-image data upon the video data. The integration of more sources contributes to enhanced performance, compared with models from individual visual scenarios.\n\n## 6 . Emerging Capabilities\n\nIn this section, we show some example to demonstrate the emerging capabilities of our model. Emerging capabilities means the capabilities do not trained during training but demonstrated when inference. We mainly showcase the emerging capabilities from three aspects:\n\n1. **Task Transfer from Single-image to Multi-image:** The capability to reason over one image and tell the funny part is initially observed in single-image models [35], and not included in our multi-image training. As shown in Table 8, our model is capable of _analyzing the fun part within multiple images_. This new task is probably emerged by the composition of the single- image capability and multi-image VQA training.\n\n2. **Task Transfer from Image to Video:** We only include the multi-image Twitter post task in the M4- Instruct training, while our model can directly perform\n\n\\begin{tabular}{c c c c c c c c c c c c c c}\n\\multirow{2}{*}{Continue training} & \\multicolumn{4}{c}{Multi-image} & \\multicolumn{3}{c}{Multi-frame} & Multi-view & \\multicolumn{5}{c}{Single-image} \\\\\n & Mantis & BLINK & QB & NLVR2 & Act & MVB & VDD & ScanQA & AI2D & ChartQA & DocVQA & MME\\* & POPE & SQA \\\\\nFrom stage-1 pre-training & 41.0 & 37.6 & 47.0 & 54.0 & 44.7/2.17 & 43.0 & 2.96 & 27.7 & 46.3 & 38.3 & 47.5 & 47.1 & 85.4 & 59.4 \\\\\nFrom single-image models & 45.6 & 39.2 & 52.0 & 67.8 & 48.0/2.84 & 45.6 & 3.25 & 29.3 & 52.2 & 52.2 & 59.2 & 52.0 & 86.8 & 60.6 \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c}\nTraining & Inference & \\multirow{2}{*}{Avg} & \\multirow{2}{*}{Spot the\nDifference} & \\multirow{2}{*}{Visual Story\nTelling} & \\multirow{2}{*}{Text-rich\nVQA} & \\multirow{2}{*}{Q-Bench} \\\\\nSetting & Setting &  &  &  &  &  \\\\\n\\multirow{2}{*}{In-the-front} & Interleaved & 52.9 & 36.8 & 30.5 & 70.1 & 74.0 \\\\\n & In-the-front & 54.3 & 36.6 & 32.8 & 74.7 & 75.3 \\\\\n\\multirow{2}{*}{Interleaved} & Interleaved & 55.4 & 37.8 & 32.9 & 76.2 & 76.0 \\\\\n & In-the-front & 52.4 & 36.1 & 29.0 & 72.9 & 71.8 \\\\\n\\multirow{2}{*}{Mixed} & Interleaved & 57.0 & 38.3 & 32.5 & 78.1 & 76.9 \\\\\n & In-the-front & 56.6 & 37.9 & 32.5 & 78.4 & 76.3 \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c c}\n\\multirow{2}{*}{Data} & \\multirow{2}{*}{NEXT-QA} & \\multirow{2}{*}{VDD} & \\multicolumn{5}{c}{VideoChatGPT} \\\\\n &  &  & CI & DO & CU & TU & CO \\\\\nVideo & 42.6 & 3.46 & 3.47 & 3.27 & 3.87 & 2.74 & 3.61 \\\\\nVideo + Single-image & 67.7 & 3.49 & 3.46 & 3.30 & 3.85 & 2.71 & 3.60 \\\\\nVideo + Multi-image & 77.7 & 3.50 & 3.50 & 3.31 & 3.90 & 2.70 & 3.63 \\\\\nVideo + Both & 78.2 & 3.58 & 3.50 & 3.27 & 3.87 & 2.77 & 3.68 \\\\\n\\end{tabular}\n\nTable 5. Ablation on whether to continue training from single-image models. QB: Q-Bench, Act: ActivityNet-QA, MVB: MVBench, VDD: Video Detailed Description, MME\\*: Throughout our paper, we convert MME\u2019s score to accuracy by summing up the perception and cognition scores and dividing 2800, SQA: Scienceqa-IMG.\n\nTable 6. Ablation on mixed interleaved data formats. We select several tasks within LLaVA-Interleave Bench for ablation.\n\nTable 7. Ablation on the improvement of combined data scenarios for video tasks. CI (Correctness of Information), DO (Detail Orientation), CU (Context Understanding), TU (Temporal Understanding), and CO (Consistency).\n the _witter post from a video_, as shown in Table 9. This new task is probably composed by the training data of multi-image Twitter post and video VQA tasks.\n\n3. **Real-world Applications:** In Tables 10, 11, and 12, we showcase three real-world scenarios that are not explicitly contained in our interleaved training data, which are multi-image painting style recognition, PPT summary & QA, and multi-doc VQA. This demonstrates our generalization potentials to a broader spectrum of applications.\n\nMore interesting demos can be found in our project page<sup>1</sup>.\n\n## 7 . Conclusion\n\nIn conclusion, our research highlights the transformative potential of LLaVA-NeXT-Interleave in unifying and advancing the capabilities of Large Multimodal Models (LMMs) across diverse visual tasks. By leveraging the interleaved data format, we effectively integrate multi-image, video, 3D, and single-image scenarios, offering a cohesive approach to handling twoose varied challenges. The introduction of the comprehensive **M4-Instruct** dataset and the **LLaVA-Interleave Bench** provides a solid foundation for training and evaluating LMMs, ensuring high-quality performance across multiple domains. Our extensive experiments validate that LLaVA-NeXT-Interleave not only sets new state-of-the-art benchmarks in multi-image tasks but also maintains exceptional performance in single-image tasks. Furthermore, the model exhibits promising emerging capabilities, such as cross-task transfer, showcasing its versatility and potential for broader applications. This work sets a new precedent in the field, paving the way for future advancements in multimodal AI and complex visual understanding tasks.\n\n<sup>1</sup>`https://llava-v1.github.io/blog/2024-06-16-llava-next-interleave/`\nTable 8. An example of emerging capability: multi-image fun part analyze.\n\nTable 9. An example of emerging capability: Twitter post for videos.\n\nTable 10. An example of emerging capability: recognizing artists\u2019 painting styles.\n\n\\begin{tabular}{c c}\n\\multicolumn{2}{c}{**Task Transfer from Single-image to Multi-image**} \\\\\nUser & What is fun about the images? \\\\\nOurs & The humor in these images comes from the anthropomorphized birds engaging in human-like activities and conversations. In the first image, one bird is asking about going to get some fries on the pier, which is a human activity. The second image shows a bird asking about the \"culminating result of consciousness,\" which is a philosophical question rather than a literal one. The third image has a bird expressing its desire to do something with the fries on the pier, again referencing a human activity. The anthropomorphism of the birds adds a layer of humor by attributing human thoughts and desires to animals. \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c}\n\\multicolumn{2}{c}{**Task Transfer from Image to Video**} \\\\\nUser & Help me write a twitter post to describe this video. \\\\\nOurs & \"Feeling the chill in the air, but the cherry blossoms are a sight to behold! Walking down the street, each person bundled up against the winter chill. The snow-covered trees and market stalls add to the serene beauty of this winter wonderland. #CherryBlossoms #WinterWalk #CityScenes\". \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c}\n\\multicolumn{2}{c}{**Real-World Applications**} \\\\\nUser & Do you kown who draw these paintings? \\\\\nOurs & Yes, these paintings are by the renowned artist Vincent van Gogh. The first painting is his famous work \"The Starry Night,\" which depicts a swirling night sky with a bright crescent moon and stars. The second painting is his self-portrait, which shows him smoking a pipe and wearing a hat. The third painting is also a self-portrait, but it features a different expression and hairstyle. The fourth painting is by another artist, not Van Gogh, and depicts a person seated with flowers in the background. \\\\\n\\end{tabular}\nTable 11. An example of emerging capability: PPT summary & VQA.\nTable 12. An example of emerging capability: multi-doc VQA.\n## References\n\n[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. _arXiv preprint_, 2022.\n\n[2] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. _arXiv preprint arXiv:2308.01390_, 2023.\n\n[3] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In _proceedings of the IEEE/CVF conference on computer vision and pattern recognition_, pages 19129-19139, 2022.\n\n[4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. _arXiv preprint arXiv:2309.16609_, 2023.\n\n[5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. _arXiv preprint arXiv:2308.12966_, 2023.\n\n[6] Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In _Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI 16_, pages 51-67. Springer, 2020.\n\n[7] Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yeli Wang, Tu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. _arXiv preprint arXiv:2305.13292_, 2023.\n\n[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. _arXiv preprint arXiv:1810.04805_, 2018.\n\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,\n\nThomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. _arXiv preprint arXiv:2010.11929_, 2020.\n\n[10] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. _arXiv preprint arXiv:2404.12390_, 2024.\n\n[11] Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. _arXiv preprint arXiv:2402.05935_, 2024.\n\n[12] Google Gemini Team. Gemini: a family of highly capable multimodal models. _arXiv preprint arXiv:2312.11805_, 2023.\n\n[13] Ziyu Guo, Renrui Zhang, Hao Chen, Jialin Gao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Sciverse. _https://sciverse-cuhk.github.io_, 2024.\n\n[14] Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. _arXiv preprint arXiv:2309.00615_, 2023.\n\n[15] Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. _arXiv preprint arXiv:2309.03905_, 2023.\n\n[16] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. _Advances in Neural Information Processing Systems_, 36:20482-20494, 2023.\n\n[17] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models, 2023.\n\n[18] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. _ArXiv_, abs/2302.14045, 2023.\n\n[19] Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Inter\nTitle Suppressed Due to Excessive Length 21\n\nleaved multi-image instruction tuning. _arXiv preprint arXiv:2405.01483_, 2024.\n\n[20] Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. _arXiv preprint arXiv:2405.09798_, 2024.\n\n[21] Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Dee Guo, Sreenivas Gollapudi, et al. Remi: A dataset for reasoning with multiple images. _arXiv preprint arXiv:2406.09175_, 2024.\n\n[22] Hugo Lauren\u00e7on, Lucile Saulnier, L\u00e9o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. _Advances in Neural Information Processing Systems_, 36, 2024.\n\n[23] Hugo Lauren\u00e7on, L\u00e9o Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models? _arXiv preprint arXiv:2405.02246_, 2024.\n\n[24] Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024.\n\n[25] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. _arXiv preprint arXiv:2306.05425_, 2023.\n\n[26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In _International Conference on Machine Learning_, pages 12888-12900. PMLR, 2022.\n\n[27] Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In _The Twelfth International Conference on Learning Representations_, 2023.\n\n[28] Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2024.\n\n[29] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. _arXiv preprint arXiv:2305.06355_, 2023.\n\n[30] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition_, pages 22195-22206, 2024.\n\n[31] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. _arXiv preprint arXiv:2311.17043_, 2023.\n\n[32] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023.\n\n[33] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition_, pages 26689-26699, 2024.\n\n[34] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. _arXiv preprint arXiv:2311.07575_, 2023.\n\n[35] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.\n\n[36] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024.\n\n[37] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In _NeurIPS_, 2023.\n\n[38] Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens, 2023.\n\n[39] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2024.\n\n[40] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In _Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)_, 2024.\n\n[41] Brandon McKenzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights\nfrom multimodal llm pre-training. _arXiv preprint arXiv:2403.09611_, 2024.\n\n[42] OpenAI. Gpt-4 technical report, 2023.\n\n[43] OpenAI. GPT-4V(ision) system card, 2023.\n\n[44] Maxime Oquab, Timoth\u00e9e Darce, Th\u00e9o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. _arXiv preprint arXiv:2304.07193_, 2023.\n\n[45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In _International Conference on Machine Learning_, pages 8748-8763. PMLR, 2021.\n\n[46] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. _arXiv preprint arXiv:2111.02114_, 2021.\n\n[47] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In _Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)_, pages 2556-2565, 2018.\n\n[48] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In _Proceedings of the IEEE/CVF conference on computer vision and pattern recognition_, pages 10740- 10749, 2020.\n\n[49] Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. _arXiv preprint arXiv:2310.00647_, 2023.\n\n[50] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. _arXiv preprint arXiv:1811.00491_, 2018.\n\n[51] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition_, pages 14398-14409, 2024.\n\n[52] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. _arXiv preprint arXiv:2302.13971_, 2023.\n\n[53] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. _arXiv preprint arXiv:2307.09288_, 2023.\n\n[54] Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. _arXiv preprint arXiv:2406.09411_, 2024.\n\n[55] Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. _arXiv preprint arXiv:2405.09711_, 2024.\n\n[56] Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. _arXiv preprint arXiv:2309.14181_, 2023.\n\n[57] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In _Proceedings of the IEEE/CVF conference on computer vision and pattern recognition_, pages 9777-9786, 2021.\n\n[58] Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. _arXiv preprint arXiv:2308.16911_, 2023.\n\n[59] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In _Proceedings of the AAAI Conference on Artificial Intelligence_, volume 33, pages 9127-9134, 2019.\n\n[60] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition_, pages 9556-9567, 2024.\n\n[61] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu\nJiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. _arXiv preprint arXiv:2311.16502_, 2023.\n\n[62] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In _Proceedings of the IEEE/CVF International Conference on Computer Vision_, pages 11975-11986, 2023.\n\n[63] Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. _arXiv preprint arXiv:2404.01258_, 2024.\n\n[64] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In _The Twelfth International Conference on Learning Representations_, 2024.\n\n[65] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? _arXiv preprint arXiv:2403.14624_, 2024.\n\n[66] Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathematical visual instruction tuning. _arXiv preprint arXiv:2407.08739_, 2024.\n\n[67] Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024.\n\n[68] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. _Advances in Neural Information Processing Systems_, 36, 2024.\nTable 13. Ablation to compare pooling and not pooling.\n\nTable 14. Ablation on the impact of video dpo on the performance of other tasks. QB: Q-Bench, Act: ActivityNet-QA, MVB: MVBench, VDD: Video Detailed Description, MME\\*: Throughout our paper, we convert MME\u2019s score to accuracy by summing up the perception and cognition scores and dividing 2800, SQA: Scienceqa-IMG.\n\n\\begin{tabular}{c c c c c c c c c c c c}\nTraining & Inference & #frames & # Image tokens & Act & Avg & VDD & \\multicolumn{5}{c}{VideoChatGPT} \\\\\n &  &  &  &  &  &  & CI & DO & CU & TU & CO \\\\\nPooling 1/4 & Pooling 1/4 & 40 & 40x729x1/4=10x729 & 52.8/3.53 & 3.35 & 3.38 & 3.46 & 3.25 & 3.87 & 2.59 & 3.57 \\\\\nPooling 1/4 & Pooling 1/4 & 64 & 64x729x1/4=16x729 & 52.7/3.53 & 3.33 & 3.38 & 3.45 & 3.23 & 3.86 & 2.49 & 3.55 \\\\\nNot Pooling & Not Pooling & 10 & 10x729 & 52.9/3.48 & 3.38 & 3.46 & 3.43 & 3.26 & 3.85 & 2.64 & 3.61 \\\\\nNot Pooling & Not Pooling & 16 & 16x729 & **54.4/3.51** & **3.41** & **3.46** & **3.48** & **3.28** & **3.87** & **2.74** & **3.62** \\\\\n\\end{tabular}\n\n\\begin{tabular}{c c c c c c c c c c c c}\n\\multirow{2}{*}{Setting} & \\multicolumn{4}{c}{Multi-image} & Multi-view & \\multicolumn{6}{c}{Single-image} \\\\\n & Mantis & BLINK & QB & NLVR2 & ScanQA & AI2D & ChartQA & DocVQA & MME\\* & POPE & SQA \\\\\nBefore Video-DPO & 62.7 & 52.7 & 73 & 89.1 & 32.2 & 73.9 & 67.2 & 75.7 & 63.5 & 85.4 & 72.6 \\\\\nAfter Video-DPO & 60.8 & 51.7 & 86.8 & 87.7 & 25.5 & 72.2 & 56.1 & 73.1 & 62.6 & 86.6 & 71.7 \\\\\n\\end{tabular}\n\n## A . Data Statistics\n\nThe detailed data statistics of M4-Instruct is shown in Table 15.\n\nThe detailed data statistics of LLaVA-Interleave Bench is shown in Table 16.\n\n## B . Ablation Study\n\n### B.1 . Pool vs not Pool Vision Tokens for video tasks.\n\nSimilar to LLaVA-NEXT-Video, we adopt a \"Pooling to 1/4\" strategy for which we pool the width and heights of feature maps to 1/2 therefore reducing the number to totals to 1/4. We study the impact of image token pooling. We train and infer our model under two settings: pooling to 1/4 and not pooling with ShareGPTVideo-Caption+QA(255K) data. Pooling to a 1/4 setting is similar to LLaVA-NEXT-Video, which uses the pooling technique to trade-off between the number of image tokens and the number of frames. In our experiment, we find that not pooling yields better performance under similar #image tokens. During training, we sample 10 frames for videos. In this table, we also observe that adding more frames (from 10 to 16) during inference improves performance.\n\n### B.2 . Impact of video DPO training on other tasks.\n\nIn Table 14, we compare the results of doing video DPO on other tasks. Though DPO significantly improves the video performance as shown in Table 2, it slightly impacts the performance of other tasks.\nTable 15. M4-Instruct detailed datasets.\n\n\\begin{tabular}{c c c c}\nTask & Dataset & Scenario & # Samples \\\\\n\\multicolumn{4}{c}{**Multi-image Scenarios**} \\\\\nSpot the Difference(42.6K) & Real-world Difference & Realistic & 6.7K \\\\\n & Synthetic Difference & Synthetic & 7.0K \\\\\n & Spot-the-Diff & Surveillance & 10.8K \\\\\n & Birds-to-Words & Birds & 14.2K \\\\\n & CLEVR-Change & Solids & 3.9K \\\\\nImage Edit Instruction(67.7K) & HQ-Edit & Synthetic & 50K \\\\\n & MagicBrush & Realistic & 14.2K \\\\\n & IEdit & Realistic & 3.5K \\\\\nVisual Story Telling(67.5K) & AESOP & Cartoon & 6.9K \\\\\n & FlintstonesSV & Cartoon & 22.3K \\\\\n & PororoSV & Cartoon & 12.3K \\\\\n & VIST & Realistic & 26K \\\\\nText-rich VQA(21.3K) & WebQA & Webpage & 9.3K \\\\\n & TQA & Textbook & 8.2K \\\\\n & OCR-VQA & OCR & 1.9K \\\\\n & DocVQA & Document & 1.9K \\\\\nMulti-image VQA(153.5K) & NLVR2 & Realistic & 86.4K \\\\\n & MIT-States\\_StateCoherence & General & 1.9K \\\\\n & MIT-States\\_PropertyCoherence & General & 1.9K \\\\\n & RecipeQA\\_ImageCoherence & Recipe & 8.7K \\\\\n & VISION & Industrial & 9.9K \\\\\n & Multi-VQA & General & 5K \\\\\n & IconQA & General & 34.6K \\\\\nLow-level Comparison(65.9K) & Coinstruct & Low-level & 50K \\\\\n & Dreamsim & Low-level & 15.9K \\\\\nImage-caption Comprehension (41.8K) & ImageCoDe & General & 16.6K \\\\\n & Contrast-Caption & General & 25.2K \\\\\nDaily Scenarios (5.7K) & MMChat\\_Twitter\\_Post & General & 5.7K \\\\\nMulti-image Puzzle (35K) & Raven & Abstract & 35K \\\\\n\\multicolumn{4}{c}{**Multi-frame (Video) Scenarios**} \\\\\nVideo QA(246.9K) & NExT-QA & General & 3.9K \\\\\n & STAR & General & 3K \\\\\n & ShareGPTVideo-VQA & General & 240K \\\\\nVideo Detailed Captioning (15K) & ShareGPTVideo-Caption & General & 15K \\\\\n\\multicolumn{4}{c}{**Multi-view (3D) Scenarios**} \\\\\nScene VQA(45.4K) & Nuscenes & Outdoor & 9.8K \\\\\n & ScanQA & Indoor Realistic & 25.6k \\\\\n & 3D-LLM-Scene & Indoor Realistic & 10K \\\\\nEmbodied VQA(62.5K) & ALFRED & Indoor Synthetic & 22.6K \\\\\n & 3D-LLM-Dialogue & Indoor Realistic & 20K \\\\\n & 3D-LLM-Planning & Indoor Realistic & 19.9K \\\\\n\\multicolumn{4}{c}{**Single-image Scenarios**} \\\\\nSingle-image Tasks(307K) & Randomly sampling 40% SFT data of LLaVA-NeXT & General & 307K \\\\\n\\end{tabular}\nTable 16. LLaVA-Interleave Bench detailed datasets.\n\n\\begin{tabular}{c c c c}\nTask & Dataset & Scenario & # Samples \\\\\n\\multicolumn{4}{c}{**In-domain Evaluation - Newly Curated Benchmarks**} \\\\\nSpot the Difference(0.3K) & Spot-the-Diff & Surveillance & 0.1K \\\\\n & Birds-to-Words & Birds & 0.1K \\\\\n & CLEVR-Change & Solids & 0.1K \\\\\nImage Edit Instruction(2K) & HQ-Edit & Sythentic & 1K \\\\\n & MagicBrush & Realistic & 0.9K \\\\\n & IEdit & Realistic & 0.1K \\\\\nVisual Story Telling(0.4K) & AESOP & Cartoon & 0.1K \\\\\n & FlintstonesSV & Cartoon & 0.1K \\\\\n & PororoSV & Cartoon & 0.1K \\\\\n & VIST & Realistic & 0.1K \\\\\nText-rich VQA(0.4K) & WebQA & Webpage & 0.1K \\\\\n & TQA & Textbook & 0.1K \\\\\n & OCR-VQA & OCR & 0.1K \\\\\n & DocVQA & Document & 0.1K \\\\\nMulti-image VQA(0.4K) & MIT-States\\_StateCoherence & General & 0.1K \\\\\n & MIT-States\\_PropertyCoherence & General & 0.1K \\\\\n & RecipeQA\\_ImageCoherence & Recipe & 0.1K \\\\\n & VISION & Industrial & 0.1K \\\\\nPuzzle (1.4K) & Raven & Abstract & 1.4K \\\\\n\\multicolumn{4}{c}{**In-domain Evaluation - Existing Benchmarks**} \\\\\nNLVR2 (7K) & NLVR2 & Realistic & 7K \\\\\nQ-Bench (1K) & Q-Bench & Low-level & 1K \\\\\n\\multicolumn{4}{c}{**Out-domain Evaluation - Newly Curated Benchmarks**} \\\\\nMathVerse-mv (0.8K) & MathVerse (Vision Dominant) & Math Diagram & 0.8K \\\\\nSciVerse-mv (0.4K) & SciVerse (Vision Dominant) & Scientific Diagram & 0.4K \\\\\n\\multicolumn{4}{c}{**Out-domain Evaluation - Existing Benchmarks**} \\\\\nMantis-Eval (0.2K) & Mantis-Eval & General & 0.2K \\\\\nBLINK (1.9K) & BLINK & General & 1.9k \\\\\nMMMU-mv (test) (0.8K) & MMMU & Scientific Diagram & 0.8K \\\\\n\\end{tabular}\n", "modalities": [{"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_r4de2ucp.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_mg2vqz1p.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_hpwgrq2k.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_b6klls2s.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_ejmfvr0l.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_9fcpl73z.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152802_hl06dlhk.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_4fqb97no.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_yz90uixk.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_lumj7pxn.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_06zpgd77.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_6li3_4el.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_vl9d_0x_.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152803_3xs9qfn3.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152804_0elcfmvy.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152804_i15bew73.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152804_c0t1owfo.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152804_5rh4zp46.png"}], "metadata": {"file_path": "examples/sample_data/pdf/llava-interleave.pdf", "processed_at": "2026-06-01T15:28:04.462872", "processor_type": "NemotronVLMProcessor", "paragraph_starts": [[0, 0, 0], [40, 0, 1], [132, 0, 2], [545, 0, 3], [777, 0, 4], [791, 0, 5], [1753, 0, 6], [2104, 0, 7], [2126, 0, 8], [2582, 1, 0], [2661, 1, 1], [3547, 1, 2], [4133, 1, 3], [4548, 1, 4], [4592, 1, 5], [4794, 1, 6], [5236, 1, 7], [5342, 1, 8], [5478, 1, 9], [5742, 1, 10], [5763, 1, 11], [5806, 1, 12], [7387, 1, 13], [7410, 1, 14], [8041, 2, 0], [8580, 2, 1], [8610, 2, 2], [10017, 2, 3], [10062, 2, 4], [10087, 2, 5], [10428, 2, 6], [10822, 2, 7], [11056, 2, 8], [11374, 2, 9], [11660, 2, 10], [11683, 2, 11], [11843, 2, 12], [12106, 3, 0], [12338, 3, 1], [12634, 3, 2], [13143, 3, 3], [13271, 3, 4], [13320, 3, 5], [13365, 4, 0], [13577, 4, 1], [13917, 4, 2], [14226, 4, 3], [14583, 4, 4], [14629, 4, 5], [15022, 4, 6], [15085, 4, 7], [15583, 4, 8], [15650, 4, 9], [16010, 4, 10], [16252, 4, 11], [16344, 4, 12], [16806, 4, 13], [16826, 4, 14], [17025, 4, 15], [17045, 4, 16], [17071, 4, 17], [17248, 4, 18], [17382, 4, 19], [17809, 4, 20], [18006, 4, 21], [18036, 4, 22], [18303, 5, 0], [18326, 5, 1], [18352, 5, 2], [18721, 5, 3], [18858, 5, 4], [18892, 5, 5], [19137, 5, 6], [20392, 5, 7], [21828, 5, 8], [22581, 5, 9], [23209, 5, 10], [23447, 5, 11], [23670, 5, 12], [23784, 5, 13], [23948, 6, 0], [24381, 6, 1], [24411, 6, 2], [24768, 6, 3], [24809, 6, 4], [25380, 6, 5], [25424, 6, 6], [25517, 6, 7], [25765, 6, 8], [25857, 6, 9], [26080, 6, 10], [26110, 6, 11], [26378, 6, 12], [26804, 6, 13], [26964, 6, 14], [27549, 6, 15], [28217, 6, 16], [28678, 6, 17], [28989, 6, 18], [29111, 6, 19], [29329, 7, 0], [29489, 7, 1], [29829, 7, 2], [29899, 7, 3], [29918, 7, 4], [31020, 7, 5], [31100, 8, 0], [31175, 8, 1], [31245, 8, 2], [31329, 8, 3], [32102, 8, 4], [32559, 8, 5], [33207, 9, 0], [33271, 10, 0], [33331, 11, 0], [33346, 11, 1], [33589, 11, 2], [33880, 11, 3], [34129, 11, 4], [34775, 11, 5], [35038, 11, 6], [35278, 11, 7], [35506, 11, 8], [35700, 11, 9], [35792, 11, 10], [36004, 11, 11], [36243, 11, 12], [36509, 11, 13], [36630, 11, 14], [36777, 11, 15], [37083, 11, 16], [37301, 11, 17], [37535, 11, 18], [37700, 11, 19], [38037, 11, 20], [38135, 12, 0], [38180, 12, 1], [38261, 12, 2], [38471, 12, 3], [38727, 12, 4], [39040, 12, 5], [39201, 12, 6], [39401, 12, 7], [39604, 12, 8], [39851, 12, 9], [40145, 12, 10], [40368, 12, 11], [40562, 12, 12], [40858, 12, 13], [41010, 12, 14], [41178, 12, 15], [41427, 12, 16], [41698, 12, 17], [41816, 12, 18], [41984, 12, 19], [42094, 12, 20], [42251, 12, 21], [42422, 12, 22], [42698, 12, 23], [42889, 13, 0], [42965, 13, 1], [43009, 13, 2], [43056, 13, 3], [43329, 13, 4], [43641, 13, 5], [43914, 13, 6], [44222, 13, 7], [44546, 13, 8], [44775, 13, 9], [44973, 13, 10], [45278, 13, 11], [45546, 13, 12], [45810, 13, 13], [46058, 13, 14], [46238, 13, 15], [46499, 13, 16], [46747, 13, 17], [46944, 13, 18], [47220, 13, 19], [47568, 13, 20], [47667, 14, 0], [47981, 14, 1], [48204, 14, 2], [48484, 14, 3], [48771, 14, 4], [49033, 14, 5], [49264, 14, 6], [49450, 14, 7], [49746, 15, 0], [49802, 15, 1], [50118, 15, 2], [50795, 15, 3], [51242, 15, 4], [51266, 15, 5], [51333, 15, 6], [51411, 15, 7], [51434, 15, 8], [51493, 15, 9], [52259, 15, 10], [52315, 15, 11], [52515, 16, 0], [52557, 16, 1], [54695, 17, 0], [54748, 17, 1], [56307, -1, -1]], "backend": "nemotron-vlm", "model": "nvidia/nemotron-nano-12b-v2-vl"}}
{"text": "# How to get your dynamically updating EPFL calendar link?\n\n1. Open **EPFLCampus** app and go to **Calendar**\n\n2. Click **sync icon**\n\n3. Copy your personal dynamic calendar link\n\n[Figure: EPFL Campus app screenshot with \"Mein Stundenplan\" highlighted]\n[Figure: Calendar view with scheduled events]\n[Figure: Calendar integration settings with \"Copy URL\" highlighted]\n# Setup your dynamically synced EPFL calendar to your favourite calendar app\n\n## For Google Calendar, from a computer:\n\n[Figure: Google Calendar settings page with annotations]\n", "modalities": [{"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_151928_wki8rp_d.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_151928_esccjzdo.png"}], "metadata": {"file_path": "examples/sample_data/pdf/calendar.pdf", "processed_at": "2026-06-01T15:28:04.462872", "processor_type": "NemotronVLMProcessor", "paragraph_starts": [[0, 0, 0], [60, 0, 1], [111, 0, 2], [135, 0, 3], [180, 0, 4], [367, 1, 0], [445, 1, 1], [487, 1, 2], [544, -1, -1]], "backend": "nemotron-vlm", "model": "nvidia/nemotron-nano-12b-v2-vl"}}
{"text": "Indian Journal of Gastroenterology (May\u2013June 2020) 39(3):220\u2013231\nhttps://doi.org/10.1007/s12664-020-01075-2\n\n# Corona Virus Disease-19 pandemic: The gastroenterologists\u2019 perspective\n\n**Jahnvi Dhar**\\(^1\\cdot\\) **Jayanta Samanta**\\(^1\\)**<|unk|>** \u2022 **Rakesh Kochhar**\\(^1\\)\n\nReceived: 21 May 2020 / Accepted: 8 July 2020 / Published online: 12 August 2020\n\n**Abstract** The world is witnessing a serious public health threat in the wake of the third corona virus pandemic, a novel corona virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]). The Corona Virus Disease-19 (COVID-19) is not limited to the respiratory system but has widespread involvement including the gastrointestinal (GI) tract and liver, with evidence of prolonged fecal shedding and feco-oral transmission. This finding has stirred up a hornet\u2019s nest of not only a newer modality of the spread of the virus but also a risk of the unpredictable duration of the infective potential of the shedders. We reviewed the literature on fecal shedding and possible implications on prevention and surveillance strategies. The pandemic is changing the management of underlying chronic diseases such as inflammatory bowel disease (IBD) and other diseases. Moreover, for the gastroenterologist, doing endoscopic procedures in this COVID-19 era poses a high risk of contamination, as it is an aerosol-generating procedure. There is a daily influx of data on this disease, and multiple societies are coming up with various recommendations. We provide a comprehensive review of all the reported GI manifestations of COVID-19 infection and the side effects of confounding drugs. We have summarized the management recommendations for diseases such as IBD with COVID-19 and nutritional recommendations and provided a concise review of the endoscopy guidelines by the various societies. This review provides a comprehensive account and a lucid guide covering various aspects of gastroenterology practice during this COVID-19 pandemic.\n\n**Keywords** Colon \\(\\cdot\\) COVID-19 \\(\\cdot\\) Endoscopy \\(\\cdot\\) Fecal \\(\\cdot\\) Gastrointestinal \\(\\cdot\\) Inflammatory bowel disease \\(\\cdot\\) Novel coronavirus \\(\\cdot\\) Pandemic \\(\\cdot\\) Severe acute respiratory syndrome \\(\\cdot\\) Viral pneumonia\n\n### Introduction\n\nSevere acute respiratory syndrome coronavirus 2 (SARS- CoV-2), a novel coronavirus, was first detected in the Wuhan City of China in December 2019. Since then, it has spread in the last 5 months to become a worldwide pandemic. Even as we write this article, the virus has already affected more than 6 million people and claimed 300,000 lives [1].\n\nIndia has already reported more than 200,000 cases and 6000 deaths. While the virus was initially thought to be a respiratory pathogen, the extra-pulmonary effects of the virus and the mode of transmission gained limelight when the first diagnosed case of Corona Virus Disease-19 (COVID-19) from the USA had gastrointestinal (GI) complaints [2]. Since then, more and more studies have looked into the effects of COVID-19 on the GI tract and liver with increasing reports of higher frequency than initially reported. We provide a concise review of the relevant published articles and have summarized the overall incidence rates reported to date for the various GI manifestations. The liver manifestations are beyond the scope of this article. The possibility of feco-oral transmission has been suggested with fecal sample positivity for SARS-CoV-2 strain. This not only highlights a newer modality of the spread of the virus but also makes it difficult to assess the duration for which a viral shedder is infective. Thus, it has important implications for the prevention and surveillance strategies that need to be adopted. COVID-19 can also adversely affect the already existing GI diseases,\n\n<|unk|> Jayanta Samanta\n\ndj\\_samanta@yahoo.co.in\n\n<sup>1</sup> Department of Gastroenterology, Post Graduate Institute of Medical Education and Research, Chandigarh, Sector \u2013 12, Chandigarh 160 012, India\n\nSpringer\nIndian J Gastroenterol (May-June 2020) 39(3):220-231 221\n\nand hence, the optimum management strategies need to be reviewed. Various endoscopic societies have clearly highlighted the increased risk of endoscopic procedures during this pandemic and thus the need for extra precautions for the practicing gastroenterologists. Thus, in this review, we give a comprehensive overview of various aspects of COVID-19 for the practicing gastroenterologist in a user-friendly way for daily practice and the precautions to be taken.\n\n## Information sources and literature search\n\nA review of literature was carried out through PubMed, Medline, and Google Scholar search engine for all relevant English-language articles/abstracts using a search query constructed with the following medical subject heading (MeSH) terms: (\u201csevere acute respiratory syndrome coronavirus 2\u201d OR \u201cCOVID-19\u201d OR \u201ccoronavirus 2019\u201d OR \u201cSARS-CoV- 2\u201d) AND (\u201cGastroenterology\u201d OR \u201cGastrointestinal\u201d OR \u201csigns and symptoms, digestive\u201d OR \u201cGI\u201d OR \u201cdiarrhea\u201d OR \u201cgastr*\u201d OR \u201cInflammatory bowel disease\u201d OR \u201cIBD\u201d OR \u201cendoscopy\u201d OR \u201cColon*\u201d OR \u201cfecal\u201d OR \u201cStool\u201d). The reference list of the papers, webpages of major gastroenterology and hepatology journals, and websites of the World Health Organization (WHO) and Center for Disease Control (CDC) publications were reviewed and all relevant data extracted. Additional published and unpublished studies from other platforms such as medRxiv and Social Science Research Network (SSRN) were also searched. For the recommendations, all the latest guidelines of the major gastrointestinal and endoscopy societies were reviewed (including online suggestions and pre-proof) by the authors and all framed into a narrative review.\n\n## Corona virus pathogen\n\nSARS-CoV-2 is a zoonotic, single-stranded, positive-sense, enveloped, ribonucleic acid (RNA) beta-coronavirus. This family also includes the severe acute respiratory syndrome-related coronavirus (SARS-CoV) and Middle East respiratory syndrome-related coronavirus (MERS-CoV), which bear a phylogenetic similarity of 79% and 50%, respectively with SARS-CoV-2 strain [3]. Electron micrograph images reveal its diameter to be 60-140 nm and the appearance of a solar corona (crown-like) due to the presence of spikes (diameter 9-12 nm).\n\nThe world has already witnessed the pandemics of former 2 strains, in 2002-2003 (SARS) and in 2012 (MERS). The third coronavirus pandemic started as pneumonia of unknown etiology, which was designated as SARS-CoV-2 by the International Committee on Taxonomy of Viruses and later rechristened as COVID-19 by the WHO [4].\n\nThe virus has been thought to have a zoonotic origin, due to phylogenetic similarity to horseshoe bats. It has an easy\n\nperson-to-person transmission, even in the asymptomatic phase of the disease. Furthermore, community transmission of the virus has been proven across all affected continents.\n\nA large epidemiological study from China of 72,314 cases showed an overall case fatality rate of 2.3%, with 3.8% affected being health care workers (HCWs) [5]. A recent meta-analysis estimated mortality at 2.0% to 4.4% [6]. COVID-19 has already exceeded the morbidity and mortality of the previous coronavirus outbreaks (SARS, 774 deaths in 2002- 2003, and MERS, 848 deaths in 2012) and at this rate, it might be as catastrophic as the Spanish flu of 1918 [7, 8].\n\n## Classical symptoms\n\nThe 3 main members of the coronavirus family (SARS-CoV, MERS-CoV, and SARS-CoV-2) are categorized mainly as respiratory pathogens following a principle of human-to- human transmission by droplet, aerosols, and contact route. The main symptomatology of COVID-19 pertains to respiratory system with patients presenting predominantly with fever, cough, sore throat, and shortness of breath, and acute respiratory distress syndrome (ARDS) in patients with severe disease [4].\n\nThe pathogenesis encompasses attachment of the virus to angiotensin-converting enzyme type 2 (ACE2) receptor on the lung type 2 alveolar cells (AT2). Once the spike protein (S) attaches to the alveolar cells, it leads to a cytokine storm leading to alveolar flooding and denudation of the lining epithelium, hampering oxygen exchange and manifesting clinically as ARDS [9].\n\nIn the first study from Wuhan, China, out of 41 cases, 27 (66%) had positive contact history. Common symptoms were fever (98%), cough (76%), shortness of breath (55%), and fatigue (44%), and 28% had sputum production and only one case had diarrhea [4]. Hence, most of the studies published initially discussed primarily the respiratory complaints and the diagnosis was based on testing of oro-pharyngeal swabs by reverse transcriptase-polymerase chain reaction (RT-PCR) for SARS-CoV-2. Eventually, GI symptoms gained precedence and more and more reports hinted at a larger proportion of patients having GI involvement.\n\n## GI manifestations: pathogenesis and reported literature\n\nThe first confirmed case detected in the USA had presented with cough, nausea, and vomiting followed 2 days later by diarrhea and the fecal specimen was positive on the 7th day [2]. This led to heightened attention towards GI tract being a potential route of virus spread. Previous studies had revealed that 10.6% of SARS and 30% of MERS cases had diarrhea\n\nSpringer\n222 Indian J Gastroenterol (May-June 2020) 39(3):220-231\n\n[10]. In the early stage of the pandemic, only 3% to 3.8% cases of diarrhea and 5% cases of nausea and/or vomiting were reported [4, 11], but later studies cited as high as 79% of cases having digestive symptoms [12]. In fact, studies reported from countries outside China reported a higher prevalence of GI symptoms as compared with those from China [12]. Still more fascinating are the cases presenting with only GI involvement without respiratory findings.\n\nThe various GI manifestations reported currently include diarrhea, loss of appetite, nausea, vomiting, and/or pain in the abdomen. Loss of taste and smell and even GI bleeding either at the time of admission or during the hospital stay have also been reported. A summary of the various GI manifestations have been outlined in Appendix Panel 1.\n\n## Pathophysiology\n\nThe multisystem involvement of the virus, including the GI tract, can be explained by looking at its pathophysiology. The entry of the SARS-CoV-2 is mediated by the interaction of the viral spike protein and the host ACE2 cell receptor (a regulator of intestinal inflammation), which combines with host cellular transmembrane serine protease 2 (TMPRSS2) [13]. SARS- Co-V-2 has greater efficiency to bind to receptor S protein, responsible for virus invasion [14]. ACE2 and TMPRSS2 were found not only in the lung AT2 cells but also in abundance in esophageal epithelial cells and absorptive enterocytes of the ileum and colon [15]. Liang et al. [16] demonstrated a high ACE2 expression in the proximal and distal enterocytes. Moreover, ACE2 is associated with neutral amino acid transporter, B0AT1, and has a role in the regulation of intestinal microflora [17]. Hence, the interaction of SARS-CoV-2 with ACE2 may lead to dysbiosis and inflammation. Endoscopy performed in cases with diarrhea showed evidence of virus expression in the esophagus, stomach, duodenum, and rectum, proving that the virus can flourish throughout the GI tract [18]. The glycosylation of S protein, the evolution of intrinsic resistance, or the formation of tight complexes with mucins could possibly explain the ability of the virus to sustain the adverse milieu of low pH of the stomach and the bile salts [19]. Thus, the augmented invasive property of the virus along with the abundance of its attaching receptors along the GI tract can explain its GI manifestations. To date, only a single autopsy report detailed the GI findings of an 85-year-old male patient with COVID-19, showing segmental dilation and stenosis of the small intestine [20]. Whether this finding is specific for COVID-19 is only speculative and more data is warranted.\n\nThe mechanisms postulated for diarrhea in SARS-CoV-2 infection are [13, 21-23] (i) direct virus entry through the ACE2 receptor (leading to malabsorption, unbalanced intestinal secretion, and activated enteric nervous system); (ii) direct/ indirect damage to the intestinal epithelium by an\n\ninflammatory response (probably related to cytokine storm; interleukin-6); (iii) antibiotic and/or antiviral drugs induced intestinal dysbiosis leading to diarrhea or exacerbation of the underlying condition; (iv) the virus itself causing disorders of the intestinal flora (which needs further evaluation by luminal flora and stool flora specimen analysis and comparison); and (v) the disturbance of the \u201cgut-lung axis\u201d wherein respiratory flora adversely affects the digestive system by immune regulation, a possible explanation for COVID-19 pneumonia cases having diarrhea.\n\n## Reported literature\n\nIn a systematic review published by Tian et al. [24], anorexia was the most common symptom in adults (39.9% to 50.2%), followed by diarrhea (both in adults and children) (2% to 49.5%) while vomiting was more common in children (6.5% to 66.7%) and abdominal pain more predominant in sick patients. Anorexia, though much more commonly reported than other digestive symptoms, is difficult to assess objectively. Hence, the most common definitive GI symptom would be diarrhea (1% to 36%) [25].\n\nDiarrhea might be the first indication of an underlying SARS-CoV-2 infection. COVID-19 presenting as diarrhea at onset was first reported in a patient from China [26] and later reported from other countries as well [27, 28]. Fang et al. [29] reported 22.2% of cases presenting with diarrhea before the clinical diagnosis of COVID-19, while Wang et al. [30] reported 14 patients (of 138 cases) to have diarrhea and nausea 1-2 days before the onset of fever or dyspnoea. The pooled estimate of GI symptoms as presenting manifestations has been found to be 9.3% to 10% [12, 31]. It usually occurs around 1-8 days from the onset of symptoms and is usually self-limiting. The frequency has been reported to be 3.3\\(\\pm\\)1.6/ day (range 2-10) with a mean duration of 4.1\\(\\pm\\)2.5 days (range 1-14 days) [29]. Interestingly, patients who presented with GI symptoms had prolonged interval of symptom-onset to hospital admission [31].\n\nWhether GI symptoms have any relationship with the severity of the disease is not fully established. A few studies have reported no significant difference in GI symptoms between severe and mild cases [11, 29]. On the contrary, Wang et al. [30] reported more anorexia and abdominal pain in intensive care unit (ICU) patients (anorexia 66.7% vs. 30.4%; abdominal pain 8.3% vs. 0%) while Jin et al. [32] showed higher GI manifestations in critical cases (22.97% vs. 8.14%). These and others raise the possibility of GI symptoms being associated with more severe disease [31]. A recent meta-analysis pointed out that hospitalized patients had a higher prevalence of diarrhea compared with the outpatients (10.4% vs. 4%) [12]. Interestingly, pooled analysis reported abdominal pain to be associated with increased COVID-19\n\nSpringer\nIndian J Gastroenterol (May-June 2020) 39(3):220-231 223\n\nseverity, not the presence of diarrhea, loss of appetite, or nausea/vomiting [31, 33].\n\nIn a study of 74 cases of COVID-19 with GI symptoms, Jin et al. [32] highlighted that 28% had no respiratory symptoms. Similarly, Ping et al. [34] and Pan et al. [35] have reported 9 and 7 cases with GI complaints without respiratory symptoms, respectively. In a larger study by Luo et al. [36], 16% (183 of 1141) of patients presented with only GI symptoms. This cohort had more of loss of appetite (15.8%) and nausea/vomiting (11.7%) and less of diarrhea (6%) and abdominal pain (3.9%). These cases emphasize that pure GI form of COVID-19 exists, though less common, and can confuse and confound the clinical scenario. This pure GI form poses a red flag sign for the gastroenterologists and warrants better self- protection from unsuspected cases in this COVID-19 era.\n\nChildren have been reported to have milder COVID-19 infection with similar rates of diarrhea (9.6% to 15%) but possible higher rates of vomiting (6.5% to 66.7%) [24, 31, 37]. Supplementary Table 1 highlights all the relevant,\n\nselected studies in both children and adults, based on the manifestations of COVID-19 on the GI tract.\n\nDisturbances in the olfactory and gustatory function have of late been reported to be associated with SARS-CoV-2 infection. Since the initial description of this finding by Mao et al. [38], multiple reports of new-onset olfactory and gustatory dysfunction in association with other COVID-19 have surfaced. A recent meta-analysis reported a prevalence of 43.9% for gustatory dysfunction and 86.6% (using validated instruments) for olfactory dysfunction [39]. Recently, CDC has added \u201cnew loss of taste or smell\u201d to its list of symptoms for COVID-19, which may appear 2\u201314 days after the exposure [40].\n\nA host of drugs have been tried for the COVID-19 treatment including antivirals, anti-malarial, and various monoclonal antibodies. While their efficacy is yet to be established, it is crucial that gastroenterologist be acquainted with the adverse effects of these medications on the GI tract and liver and not equate everything to direct manifestations of SARS- CoV-2 infection [41] (Table 1).\n\nSpringer\n\n\\begin{tabular}{c c c c}\nCOVID-19 treatments & Dose for COVID-19 infection & Dosing route & Gastrointestinal-/liver-related side effects \\\\\nRemdesivir & 200 mg single dose followed by 100 mg OD for 10 days & Intravenous & Nausea, vomiting, deranged liver enzymes \\\\\nHydroxychloroquine & 1200/800 mg loading dose on day 1 followed by 400 mg daily (prophylaxis: 400 mg once weekly for 8 weeks) & Oral & Nausea, vomiting, weight loss, abdominal pain \\\\\nChloroquine & 500 mg BD & Oral & Increased liver enzymes, anorexia, nausea, vomiting, diarrhea, abdominal cramps \\\\\nAzithromycin & 500 mg OD & Oral & Diarrhea, nausea/vomiting, pain abdomen \\\\\nTocilizumab & 8 mg/kg IV once (can combine with steroids); max 3 doses & Intravenous & Elevated liver enzymes, bowel perforation, pancreatitis, abdominal pain, reactivation of chronic hepatitis B \\\\\nLopinavir/ritonavir & 400 mg/100 mg BD & Oral & Nausea and vomiting (5\u201310%); abdominal pain (1\u201310%); diarrhea (10\u201330%); dysgeusia (< 2%); increased serum amylase/lipase. Deranged liver enzymes; in few cases, jaundice reported in HIV-infected people \\\\\nFavipiravir & 1000\u20131600 mg on the first day, followed by 400\u2013800 mg BD for 4\u201313 days (being tried in clinical trials) & Oral & Nausea/vomiting (5\u201315%); diarrhea (5%) \\\\\nIvermectin & 200 mcg/kg of body weight taken as one dose & Oral & Nausea, vomiting, diarrhea (very few reports on elevated liver enzymes or jaundice: uncommon) \\\\\nSarilumab & NA & Intravenous in COVID-19 trials & Increased ALT; few cases of gastrointestinal perforation \\\\\nBaricitinib & 2 mg OD & Oral & Bowel perforation, hepatitis B reactivation, nausea, vomiting \\\\\n\\end{tabular}\n\nTable 1 Drugs used in Corona Virus Disease-19 (COVID-19), dosing, and their gastrointestinal-/liver-related side effects\n\n_COVID-19_ corona virus disease-19, _GI_ gastrointestinal, _OD_ once daily, _BD_ twice daily, _IV_ intravenous, _NA_ not applicable, _HIV_ human immunodeficiency virus, _ALT_ alanine aminotransferase\n224 Indian J Gastroenterol (May-June 2020) 39(3):220-231\n\n**Fecal shedding: objective evidence of GI involvement**\n\nAll the suspected cases of COVID-19 are tested by nucleic acid amplification tests (NAAT) on the samples from the upper/lower respiratory tract. These samples are monitored for clearance of the virus during the resolution of the disease. Feco-oral route of transmission has earlier been described in SARS-CoV and MERS-CoV during the course of the illness. A study by Corman et al. [42] verified MERS-CoV RNA in 14.6% of stool specimens. SARS-CoV-2, belonging to the same family, follows suit.\n\nThe first case reported from the USA tested positive for SARS-CoV-2 had fecal positivity on day 7 of illness [2]. Chen et al. [43] conclusively showed that 28 (66.67%) of 42 laboratory-confirmed COVID-19 patients tested positive for SARS-CoV-2 RNA in stool specimens. The positivity rate is usually neither associated with the presence of GI symptoms nor the severity of illness, though Cheung et al. [25] did report high stool RNA in those presenting with diarrhea (38.5% vs. 8.7%) (Table 2).\n\nThe fecal test becomes positive around 2-5 days after oro- pharyngeal swab positivity, and the positivity lasts for 1- 16 days. It can stay positive for a period longer than the respiratory samples (28 days vs. 17 days), maximum reported up to 5 weeks [44, 46]. In a recent meta-analysis, 48.1% had positive stool viral RNA during the illness [25]. Intriguingly 70.3% (range 23.3% to 88.3%) of patients had persistent fecal positivity, for a mean period of 11 (9-16) days, even after the respiratory samples have become negative [22, 25, 45]. Additionally, only half of them had diarrhea [45]. Another systematic review showed that as high as 62.8% (125/199) of the positive fecal viral RNA cases had persistence of the virus in the stool after the oro-pharyngeal swab had turned negative [47].\n\nWorldwide, the decision to discharge the patient from the hospital is based on negative RT-PCR test result from at least two sequential respiratory tract specimens collected at an interval of \\(\\ge\\)24 h [24]. However, the longest fecal shedding (after negative respiratory viral RNA) is around 33 days, making discharging without fecal testing a tricky proposition [46]. Thus, fecal sampling can be used as an adjunct for initial diagnosis, in case of negative respiratory sample and high clinical suspicion. Moreover, it might be advisable to at least test for fecal/anal samples for RT-PCR at the time of discharge and adopt better control measures during the convalescent phase.\n\nThis provides hard evidence for the fact that the GI tract acts as a potent source of viral shedding. Its doubtful correlation with clinical symptomatology makes it difficult to ascertain the time when a case can be labelled non-infective during the convalescent phase. Moreover, it has been demonstrated that the virus is viable for 3 h in aerosol form and 2-3 days on plastic and stainless-steel surfaces [48]. Samples collected from the surface of toilet bowl, sink bowl, and door handle\n\nSpringer\n\n\\begin{tabular}{c c c c c c c c c}\nPlace & Stool remained positive for (duration) & Total patients & Respiratory samples tested positive by RT-PCR & Positive fecal samples by RT-PCR & Positive fecal sample; nut negative respiratory samples by RT-PCR & Duration between negative respiratory and fecal samples & Number of fecal samples tested by RT-PCR & \\\\\nXiao et al. [22] & Guangdong Province, China & 73 & .. & 39 (53.4%) & 17 (23.3%) & .. & 73 & \\\\\nZhang et al. [44] & Jinhua, China & 14 & 14 & 5 (35.7%) & .. & .. & 14 & \\\\\nLing et al. [45] & Shanghai, China & 66 & 66 & 66 & 55 (88.33%) & 11 (9-16) days & 66 & \\\\\nWu et al. [46] & Guangdong Province, China & 98 & 74 (76%) & 41 (55%) & ... & 33 days: longest duration mentioned & 74 (76%) & \\\\\nYoung et al. [27] & Singapore & 18 & 18 (100%) & 50% (4 out of 8 cases tested) & None & .. & 8 & \\\\\nChen et al. [43] & Wuhan, China & 42 & 42 (100%) & 28 (66.67%) & 18 (64.29%) patients & 7 (6-10) days & 42 & \\\\\nCheung et al. [25] & Hong Kong & 15 (out of 59) & .. & 9 (15.3%) & 70.3% (meta-analysis) & .. & 59 & \\\\\n\\end{tabular}\n\n**Table 2** Major studies on fecal reverse transcriptase polymerase chain reaction (RT-PCR) test in patients with Severe Acute Respiratory Syndrome Corona Virus-2 (SARS-CoV-2) infection\n\n_SARS-CoV-2_ severe acute respiratory syndrome coronavirus 2, _RT-PCR_ reverse transcriptase-polymerase chain reaction\nIndian J Gastroenterol (May\u2013June 2020) 39(3):220\u2013231 225\n\nof the washroom used by a fecal positive patient were found to be positive for SARS-CoV-2 before disinfection, emphasizing the importance of hygiene maintenance [49]. Interestingly, in a proof of concept study, SARS-CoV-2 RNA was isolated from a waste-water catchment area in Australia, further highlighting the implication of this route of transmission [50]. Prolonged fecal shedding along with extended in vitro survival and toilet fume generation makes the virus a potent agent for efficient surface transmission using the feco-oral route.\n\nOf late, CDC states that replication-competent virus could not be cultured from respiratory swab beyond 9 days of onset of illness and the infective virus could not be reliably cultured from the feces. Thus, the fecal route might contribute very little to the overall risk of transmission [51].\n\nRecently, guidelines have been published to test stool donors for fecal microbiota transplantation (FMT) who have typical COVID-19 symptoms in the previous 30 days and those having travel history to COVID-19 prone regions or contact with COVID-19 suspects/proven cases in the preceding 30 days [52].\n\n## COVID-19 concomitant with other GI conditions\n\nBesides the GI effects of SARS-CoV-2, the infection can itself possibly aggravate pre-existing GI diseases. Pre-existing conditions might flare up like inflammatory bowel disease (IBD) or the baseline immunosuppression in these sub-groups of individuals might lead them to have severe COVID-19.\n\n## COVID-19 and IBD\n\nAlthough data is lacking, IBD patients are presumed to have increased susceptibility to COVID-19. The high-risk population in IBD cohorts include elderly patients, smokers, those on prolonged high-dose steroids (>20 mg/day), pregnant women, children, those with underlying comorbidities, and those having active disease. IBD patients, however, do not seem to have an increased risk of acquiring SARS-CoV-2 infection [53]. Real- world data on the actual outcome of SARS-CoV-2 infection in IBD patients are scanty [54]. An international registry (Surveillance Epidemiology of Coronavirus Under Research Exclusion [SECURE-IBD]) has been formulated to determine the outcome of these patients [55]. Of late, the International Organization of IBD (IOIBD) has opined that use of ustekinumab and vedolizumab does not increase the infection risk, but thiopurines, anti-tumor necrosis factor (anti- TNF) agents, and Janus kinase-2 inhibitors have a debatable risk [56]. Suspension of regular outpatient services during this pandemic has reportedly resulted in an interruption in medications and sometimes disease\n\nexacerbation. Probably, \u201ctele-medicine\u201d as being advocated recently can help in unravelling this complex socio-medical issue. A detailed review of the interactions of COVID-19 and IBD will be dealt with in a separate article in this issue [57].\n\nThe various society guidelines (Asian-Pacific Association of Gastroenterology IBD group [58], European Society for Pediatric Gastroenterology Hepatology and Nutrition [59], European Crohn\u2019s and Colitis Organization [60], British Society of Gastroenterology [61], Crohn\u2019s and Colitis Foundation [62], American Gastroenterological Association [63], World Endoscopy Organization [64]) have been enlisted in Supplementary Table 2 and a summary of the recommendations for IBD management during the COVID-19 pandemic have been outlined in Appendix Panel 2.\n\n## Effects of COVID-19 on the pancreas\n\nIn a study of 52 cases of SARS-CoV-2 pneumonia, 17% had a pancreatic injury (defined by elevated amylase >90 U/L or lipase >70 U/L). Those with pancreatic injury had a higher incidence of anorexia and diarrhea and more severe illness on admission. Three possible explanations include ACE2 receptors on pancreatic islets causing acute diabetes, cytokine storm, and drug-induced pancreatic injury [65]. In a case series, 2 out of 3 family members were diagnosed to have severe acute pancreatitis related to COVID-19 [66].\n\n## Nutrition therapy for COVID-19 patients\n\nThe strategy for nutritional therapy to be adopted for COVID- 19 patients has been outlined in the joint guidance statements by the European Society for Clinical Nutrition and Metabolism (ESPEN) [67]. This has been summarized in (Appendix Panel 3).\n\n## Endoscopic practices during COVID-19 pandemic: guidelines and summary of the recommendations for practicing gastroenterologists\n\nLast but not the least, from a gastroenterologist\u2019s perspective, the practice of endoscopy to be followed in the wake of the COVID-19 pandemic is of utmost importance. In one of the earliest reports from Wuhan, around 29% positive cases (40 out of 138 cases) were healthcare workers [30], suggesting an increased risk of transmission, and hence greater importance of learning the art of donning and doffing the personal protective equipment (PPE). The possible routes of transmission during an endoscopic procedure can be person to person, droplet mode, aerosols generated in the positive pressure\n\nSpringer\n226 Indian J Gastroenterol (May-June 2020) 39(3):220-231\n\nroom, and contact with contaminated bodily fluids and fecal matter. The risk of transmission enhances manifold due to multitude of factors: (i) fomites transferred from the patient\u2019s respiratory secretions into the endoscopy room, more so when the viral load is high; (ii) endoscopic procedures are high aerosol-generating procedures, because of coughing and retching during upper GI endoscopy and passage of flatus during colonoscopy; (iii) suctioning and exchange of accessories during endoscopy pose a further risk by splashing and spreading of infective material; (iv) biopsy specimens too\n\nare infective as pointed out earlier. Thus, various societies (Asian-Pacific Society for Digestive Endoscopy [68], American Gastroenterological Association [69], European Society of Gastrointestinal Endoscopy [70], World Endoscopy Organization [71], American Society for Gastrointestinal Endoscopy [72], British Society of Gastroenterology [73], Indian Society of Gastroenterology/ Society of Gastrointestinal Endoscopy of India/Indian National Association for the Study of Liver joint guidance [74]) have come up with guidance statements to be strictly\n\nSpringer\n\n[Figure: Flowchart summarizing recommendations for endoscopy during the COVID-19 pandemic. _GI_ gastrointestinal, _EHBO_ extrahepatic biliary obstruction, _RT_ Ryle\u2019s tube, _NJ_ naso-jejunal tube, _PEG_ percutaneous endoscopic gastrostomy, _FTOCC_ fever, travel, occupation, clustering, contact, _COVID-19_ corona virus disease-19, _PPE_ personal protective equipment, _ASGE_ American Society for Gastrointestinal Endoscopy]\nIndian J Gastroenterol (May\u2013June 2020) 39(3):220\u2013231 227\n\nfollowed by practicing gastroenterologists (Supplementary Table 3). Based on these statements, a flowchart has been proposed in Fig. 1, and recommendations have been summarized in Appendix Panel 4.\n\n## Management of GI-related symptoms due to COVID-19\n\nThe management of GI symptoms necessitates the same steps to be followed like for any other disease. The presence of diarrhea could be due to the virus itself, drug-related, or dysbiosis in the GI tract. Proper hydration (especially the use of oral rehydration solution) to maintain electrolyte balance is essential. Sometimes, loperamide or other anti-diarrheal agents can be recommended to tide over the situation. Probiotics (to treat dysbiosis) and antispasmodics for treating abdominal pain can be added. Various digestive symptoms have diverse etiologies which should be looked into if symptomatic management does not tide over the situation.\n\nFor any new-onset GI symptoms such as diarrhea, patients should be evaluated for (i) risk of contact exposure, (ii) the detailed history of COVID-19-related symptoms, (iii) history for other GI symptoms, and (iv) in cases of high prevalence setting, monitoring of the cases for later development of respiratory symptoms. For patients undergoing drug therapy in hospitalized cases, evaluation for drug-related side effects is to be monitored [12].\n\n## Prevention of feco-oral transmission\n\nGI involvement with fecal shedding provides hard evidence to this route being a potential health hazard. It is important to prevent it to curb its spread. All over the world, especially in developing countries, and more importantly as gastroenterologists, the need of the hour is to encompass the conventional practice of hand hygiene taught to us for ages. Everyone must avoid the conventional five \u201cF\u201d factors responsible for feco- oral transmission: fingers, flies, fluids, food, and fields. Community education programs, if possible by tele-communication, to raise awareness about safe food practices, hand hygiene, and forbidding open defecation are cornerstones in preventing the spread of the disease and need to be meticulously implemented.\n\n## Hiatus in the available literature\n\nThe data available until now is not only inadequate but also rigged with bias. Duration of GI symptoms, isolated or along with other manifestations, and follow-up have not been systematically evaluated. A significant publication bias was\n\nnoted for GI symptoms [31]. Whether GI symptoms are due to virus replication or because of mucosal immunity interactions is not clear. For cases presenting with GI symptoms initially, whether the GI tract is the first site of infection is not known. The true impact of this infection on pre-existing diseases such as IBD is still in its nascent stage. While early data pointed towards no increased risk, more data over time will spell out the real scenario. The impact of fecal shedding on the transmission dynamics is unclear. Whether the virus isolated in the fecal samples can be adequately cultured is not known, and if yes, whether that factor is sufficient enough to render the stool infective enough for transmission or not needs to be established. Moreover, the implication of virus isolation in sewage water on community transmission has to be explored. These and a host of other questions have to be answered before we can confidently embark on future management recommendations. Finally, peer review in the publication process is essential in maintaining the quality of publications and has to be ensured even for COVID-19 data as well.\n\n## Conclusion\n\nGI manifestations are not uncommon in patients with COVID- 19, and more intriguing is the presence of a sub-group of these cases presenting only with GI symptoms. Fecal shedding of the virus objectively establishes GI tract involvement but also underscores its implications on formulating preventive strategies. Future studies and data are needed to define the role of fecal testing for initial diagnosis or during discharge. The data and hence the recommendations for optimum management of difficult situations such as IBD with COVID-19 are still evolving. For the practicing gastroenterologists, not only patient management but also personal safety is of prime importance. Resorting to \u201ctele-medicine\u201d facilities for patient management, restricting unnecessary procedures, and following strict protective strategies are key to help sail through these difficult times. During this era of COVID-19 pandemic, as more and more data keep pouring in every day, we need to unlearn many old habits and learn a few new ones to protect ourselves and our patients and tread the path more carefully.\n\n**Author contributions** All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Jahnvi Dhar and Jayanta Samanta. The first draft of the manuscript was written by Jahnvi Dhar and Jayanta Samanta, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. Jahnvi Dhar and Jayanta Samanta contributed equally to the work.\n\n## Compliance with ethical standards\n\n**Conflict of interests** JD, JS, and RK declare that they have no conflict of interest.\n\nSpringer\n228 Indian J Gastroenterol (May-June 2020) 39(3):220-231\n\n## Disclaimer\n\nThe authors are solely responsible for the data and the contents of the paper. In no way, the Honorary Editor-in-Chief, Editorial Board Members, or the printer/publishers are responsible for the results/ findings and content of this article.\n\n## Appendix\n\n### Panel 1 Overall reported incidence rates of gastrointestinal manifestations\n\n\u2022 Cumulative digestive symptoms: 2%-57%\n\n\u2022 Diarrhea: 1%-36.6%\n\n\u2022 Nausea: 1%-22%\n\n\u2022 Vomiting: 3.6%-15.9%\n\n\u2022 Abdominal pain: 1.3%-9%\n\n\u2022 Loss of appetite: 1%-79%\n\n\u2022 Gastrointestinal bleeding: 4%-13.7%\n\n\u2022 Loss of taste (gustatory dysfunction): 5.6%-92.6%\n\n\u2022 Loss of smell (olfactory dysfunction): 5.1%-98.3%\n\n\u2022 Stool ribonucleic acid positivity: 36%-53%\n\n### Panel 2 Key recommendations for managing inflammatory bowel disease patients during the COVID-19 pandemic\n\n\u2022 High-risk population in the IBD cohort\n\n\u2022 Elderly >65 years\n\n\u2022 Those with underlying comorbidities: hypertension, diabetes mellitus, chronic liver diseases\n\n\u2022 Pregnancy with IBD\n\n\u2022 Those not in clinical and/or endoscopic remission: especially with moderate/active disease\n\n\u2022 Those on immunosuppressive medications: especially on prolonged, high-dose steroids >20 mg/day, followed by thiopurines, biological agents, JAK inhibitors\n\n\u2022 Summary of the recommendations by various society guidelines regarding use of medications in IBD patients\n\n\u2022 Protective measures to be followed: social distancing, facial masks, hand hygiene\n\n\u2022 Emphasis on tele-medicine services to overcome decreased hospital visits\n\n\u2022 Can safely continue 5-ASAs (amino salicylates) in both presumed and active COVID-19\n\n\u2022 To continue steroids but subsequently taper to the lowest effective dose during active infection; budesonide can be an alternative\n\n\u2022 To continue with thiopurines, biological agents, JAK inhibitors, but to stop all during active COVID-19\n\n\u2022 For the use of biological agents: prefer monotherapy; and no switching of class; can receive infusions in a facility having SARS-CoV-2 testing protocol; If infliximab infusion not possible, consider switching to adalimumab (subcutaneous injection) at home (only during the period of the pandemic)\n\n### Panel 2 (continued)\n\n\u2022 Vedolizumab and ustekinumab do not increase the risk of COVID-19: to be continued safely\n\n\u2022 Exclusive enteral nutrition to be used if biological is not available\n\n\u2022 Non-invasive monitoring: CRP, fecal calprotectin, procalcitonin recommended by few societies\n\n\u2022 If a case is SARS-CoV-2 positive but asymptomatic: steroids to be reduced to <20 mg/day, or switch to budesonide, stop thiopurines, methotrexate, tofacitinib, delay the dosing of monoclonal for 2 weeks and monitor for COVID-19\n\n\u2022 If symptomatic COVID-19 present: only continue 5 ASAs and local therapy, oral budesonide recommended by some, restart all medications to stop. Restart all the above after 2 weeks of resolution of symptoms\n\n\u2022 For clinical trials, new enrolment should be postponed. For existing ones, can continue\n\n\u2022 In children: continue all the medications in the usual dose (no dose reduction even if SARS-CoV-2 positive)\n\n\u2022 Endoscopy in IBD patients during COVID-19 pandemic\n\n\u2022 Defer all elective cases\n\n\u2022 Emergency situations include newly diagnosed moderate/active IBD, acute flare of IBD, inflammatory intestinal obstruction necessitating endoscopic dilatation, to rule out CMV (Cytomegalovirus) colitis, managing cholangitis (especially dominant stricture associated) in primary sclerosing cholangitis with IBD\n\n\u2022 Always triage with FTOCC protocol: fever, travel, occupation, contact, clustering (especially in the last 14 days)\n\n\u2022 Always test for COVID-19: naso/oro-pharyngeal swab with RT-PCR and CECT chest if needed\n\n\u2022 In newly diagnosed cases and acute flare of IBD:\n\n\u2022 Rule out infectious causes (fecal calprotectin/CRP levels)\n\n\u2022 Perform stool CDTA (_Clostridium difficile_), CMV DNA, stool cultures\n\n\u2022 In moderate/severe signs of infection: perform sigmoidoscopy/colonoscopy with biopsies\n\n\u2022 For mild disease, 5-ASA and/or budesonide are reasonable\n\n\u2022 For moderate/severe disease, requiring steroid treatment, strict social distancing, and precautions to be adopted. Upfront biologicals (subcutaneous) may be considered\n\n\u2022 In cases of IBD with intestinal obstruction:\n\n\u2022 Perform abdominal CT/MRI in all\n\n\u2022 If inflammatory stenosis: perform sigmoidoscopy/colonoscopy with biopsies with endoscopic stricture dilatation\n\n\u2022 If fibrotic stenosis: refer to surgery\n\n_IBD inflammatory bowel disease, COVID-19 corona virus disease-19, JAK Janus kinase, SARS-CoV-2 severe acute respiratory syndrome coronavirus 2, ASA amino salicylate, CRP C-reactive protein, CMV cytomegalovirus, RT-PCR reverse transcriptase-polymerase chain reaction, CECT contrast-enhanced computed tomography, CDTA Clostridium difficile toxin assay, DNA deoxyribonucleic acid, MRI magnetic resonance imaging_\n\nSpringer\nIndian J Gastroenterol (May-June 2020) 39(3):220-231 229\n\nPanel 3 Key recommendations for nutrition therapy in COVID-19 patients\n\n\u2022 Patients at high risk for poor outcome such as elderly and those with multiple comorbidities should be evaluated for malnutrition\n\n\u2022 Those with malnutrition should have optimized nutritional therapy by diet counselling using weight-based formulae:\n\na. 27 kcal/kg/day for age >65 years with multiple comorbidities\n\nb. 30 kcal/kg/day for severely malnourished with multiple comorbidities\n\nc. Protein at the rate of 1 g/kg body weight for older individuals. For multiple comorbidities, may consider \\(\\ge\\)1 g/kg of protein\n\n\u2022 Adequate supplementation with vitamins and minerals in cases of malnutrition\n\n\u2022 Regular physical activity for those in quarantine\n\n\u2022 Oral nutritional supplements may be advocated in situation where diet counselling and food fortification are inadequate\n\n\u2022 For intensive care unit (ICU) admitted patients:\n\na. Enteral nutrition (EN) preferred over parenteral nutrition (PN) (when gastrointestinal (GI) symptoms absent): placement of 10-12 F nasogastric tube. Consider post pyloric feeding if the above fails\n\nb. PN preferred over EN when GI symptoms present and transitioning to EN when they subside.\n\nc. Initiation of early EN within 24-36 h of admission to the ICU or within 12 h of intubation; continuous EN preferred over bolus feeding\n\nd. Early PN in high-risk cases (shock, bowel ischemia, high positive pressure support is required); multi-chamber bags to be used to minimize exposure while handling\n\ne. Confirmatory abdominal X-rays should be clustered with chest X-ray timings\n\nf. To start with hypocaloric feeding, then increasing within 1 week to goal of 15-20 kcal/kg actual body weight (ABW)/day and protein of 1.2-2.0 g/kg ABW/day\n\ng. Monitoring of serum triglyceride levels in those receiving propofol and/or intravenous lipid emulsions early in their course (as COVID-19 leads to secondary hemophagocytosis in some reported cases)\n\nh. Even in prone position: EN to be considered over PN but with a reverse trendelenburg position to avoid gastric aspiration\n\nPanel 4 Key recommendations for performing endoscopic procedures during the COVID-19 pandemic\n\n\u2022 Pre-procedure\n\n\u2022 Triage of indications on the basis of level of urgency\n\n\u2022 Procedures not time-sensitive should be postponed\n\n\u2022 Regular tele-monitoring of postponed patients to ensure that condition does not turn urgent\n\n\u2022 Risk stratification of cases on the basis of low, intermediate, and high risk for COVID-19\n\n\u2022 N95 masks recommended for all GI endoscopy procedures\n\n\u2022 Proper separate donning and doffing area: adequate training of HCWs\n\n\u2022 All patients should wear surgical masks\n\n\u2022 Adequate informed consent\n\n\u2022 In-procedure Room\n\n\u2022 Minimize the number of personnel: only 1 endoscopist and 2 assistants adequate\n\n\u2022 Avoid personnel switching during procedures\n\n\u2022 Proper hand hygiene to be followed\n\n\u2022 Standard PPE for negative cases. Enhanced PPE for suspected/positive cases\n\n\u2022 Use of double gloves preferable\n\n\u2022 Goggles/face shield to be used\n\n\u2022 Use of washable work boots to be used during the endoscopy session\n\n\u2022 Negative pressure room/ HEPA filter/ use of exhaust fans\n\n\u2022 During procedure\n\n\u2022 Avoid aggressive suctioning and multiple catheter exchanges\n\n\u2022 Minimum positive insufflation during endoscopy\n\n\u2022 While using the accessory channel, the handle of the scope should be directed down and towards the left to minimize exposures\n\n\u2022 All specimen, including biopsies, to be handled with extra precautions\n\n\u2022 Precaution to be followed during colonoscopy as well\n\n\u2022 Use of gauze piece to cover instrument channel and mouth of the scope after removal\n\n\u2022 Endoscopist should alert the team during scope withdrawal\n\n\u2022 Post-procedure\n\n\u2022 Adequate disinfection with standard agents\n\n\u2022 Disinfection of non-critical surfaces such as bedside tables, bed rails, computers, and phones to be done after each procedure\n\n\u2022 Disposable devices not to be reused\n\n\u2022 A gap of at least 30 min between two procedures\n\n\u2022 Follow-up of negative patients and health care workers for any new-onset symptoms\n\nCOVID-19 corona virus disease-19, GI gastrointestinal, HCWs healthcare workers, PPE personal protective equipment, HEPA high-efficiency particulate air\n\nSpringer\n230 Indian J Gastroenterol (May-June 2020) 39(3):220-231\n\n## References\n\n1. WHO. Coronavirus disease 2019 (COVID-19) situation report \u2013 120. 2020. https://www.who.int/docs/default-source/coronaviruse/ situation-reports/20200519-covid-19-sitrep-120.pdf?sfvrsn= 515cabfb\\_2. Accessed 19 May 2020.\n\n2. Holshue ML, DeBolt C, Lindquist S, et al. First case of 2019 novel coronavirus in the United States. N Engl J Med. 2020;382:929-36.\n\n3. Lu R, Zhao X, Li J, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet. 2020;395:565-74.\n\n4. Huang C, Wang Y, Li X, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395: 497-506.\n\n5. Epidemiology Working Group for NCIP Epidemic Response. Chinese Center for Disease Control and Prevention. Zhonghua Liu Xing Bing Xue Za Zhi. 2020;41:145-51.\n\n6. Hu Y, Sun J, Dai Z, et al. Prevalence and severity of corona virus disease 2019 (COVID-19): A systematic review and meta-analysis. J Clin Virol. 2020;127:104371.\n\n7. Hui DS, Azhar EI, Kim YJ, et al. Middle East respiratory syndrome coronavirus: risk factors and determinants of primary, household, and nosocomial transmission. Lancet Infect Dis. 2018;18:e217- e27.\n\n8. Stadler K, Masignani V, Eickmann M, et al. SARS -beginning to understand a new virus. Nat Rev Microbiol. 2003;1:209-18.\n\n9. Zhou P, Yang XL, Wang XG, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270-3.\n\n10. Chan JF, Yuan S, Kok KH, et al. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to- person transmission: a study of a family cluster. Lancet. 2020;395: 514-23.\n\n11. Guan WJ, Ni ZY, Hu Y, et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med. 2020;382:1708-20.\n\n12. Sultan S, Altayar O, Siddique SM, et al. AGA institute rapid review of the GI and liver manifestations of COVID-19, meta-analysis of international data, and recommendations for the consultative management of patients with COVID-19. Gastroenterology. 2020;159: 320-34.e27.\n\n13. Yan R, Zhang Y, Li Y, et al. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science. 2020;367: 1444-8.\n\n14. Wan Y, Shang J, Graham R, et al. Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. J Virol. 2020;94:e00127-0.\n\n15. Zhang H, Kang Z, Gong H, et al. Digestive system is a potential route of COVID-19: an analysis of single-cell coexpression pattern of key proteins in viral entry process. Gut. 2020;69:1010-8.\n\n16. Liang W, Feng Z, Rao S, et al. Diarrhoea may be underestimated: a missing link in 2019 novel coronavirus. Gut. 2020;69:1141-3.\n\n17. Hashimoto T, Perlot T, Rehman A, et al. ACE2 links amino acid malnutrition to microbial ecology and intestinal inflammation. Nature. 2012;487:477-81.\n\n18. Lin L, Jiang X, Zhang Z, et al. Gastrointestinal symptoms of 95 cases with SARS-CoV-2 infection. Gut. 2020;69:997-1001.\n\n19. Holmes KV. Enteric infections with coronaviruses and toroviruses. In Novartis Foundation Symposium 2001 Jun 29. Chichester: John Wiley; 1999.\n\n20. Liu Q, Wang RS, Qu GQ, et al. Gross examination report of a COVID-19 death autopsy. Fa Yi Xue Za Zhi. 2020;36:21-3.\n\n21. Budden KF, Gellatly SL, Wood DL, et al. Emerging pathogenic links between microbiota and the gut-lung axis. Nat Rev Microbiol. 2017;15:55-63.\n\n22. Xiao F, Tang M, Zheng X, et al. Evidence for gastrointestinal infection of SARS-CoV-2. Gastroenterology. 2020;158:1831-3.e3.\n\n23. Xie C, Jiang L, Huang G, et al. Comparison of different samples for 2019 novel coronavirus detection by nucleic acid amplification tests. Int J Infect Dis. 2020;93:264-7.\n\n24. Tian Y, Rong L, Nian W, He Y. Review article: gastrointestinal features in COVID-19 and the possibility of faecal transmission. Aliment Pharmacol Ther. 2020;51:843-51.\n\n25. Cheung KS, Hung IF, Chan PP, et al. Gastrointestinal manifestations of SARS-CoV-2 infection and virus load in fecal samples from the Hong Kong Cohort: Systematic review and meta-analysis. Gastroenterology. 2020;159:81-95.\n\n26. Song Y, Liu P, Shi XL, et al. SARS-CoV-2 induced diarrhoea as onset symptom in patient with COVID-19. Gut. 2020;69:1143-4.\n\n27. Young BE, Ong SWX, Kalimuddin S, et al. Epidemiologic features and clinical course of patients infected with SARS-CoV-2 in Singapore. JAMA. 2020;323:1488-94.\n\n28. Hosoda T, Sakamoto M, Shimizu H, Okabe N. SARS-CoV-2 enterocolitis with persisting to excrete the virus for approximately two weeks after recovering from diarrhea: a case report. Infect Control Hosp Epidemiol. 2020;41:753-4.\n\n29. Fang Dan M, Guan J, Wang M. Manifestations of digestive system in hospitalized patients with novel coronavirus pneumonia in Wuhan, China: a single-center, descriptive study. Chin J Dig. 2020; 40(3).\n\n30. Wang D, Hu B, Hu C, et al. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020;323:1061-9.\n\n31. Mao R, Qiu Y, He JS, et al. Manifestations and prognosis of gastrointestinal and liver involvement in patients with COVID-19: a systematic review and meta-analysis. Lancet Gastroenterol Hepatol. 2020;5:667-78.\n\n32. Jin X, Lian JS, Hu JH, et al. Epidemiological, clinical and virological characteristics of 74 cases of coronavirus-infected disease 2019 (COVID-19) with gastrointestinal symptoms. Gut. 2020;69:1002- 9.\n\n33. Henry BM, de Oliveira MHS, Benoit J, Lippi G. Gastrointestinal symptoms associated with severity of coronavirus disease 2019 (COVID-19): a pooled analysis. Intern Emerg Med. 2020;1-3.\n\n34. An P, Chen H, Jiang X, et al. Clinical features of 2019 novel coronavirus pneumonia presented gastrointestinal symptoms but without fever onset. 2020. https://doi.org/10.2139/ssrn.3532530.\n\n35. Pan L, Mu M, Yang P, et al. Clinical characteristics of COVID-19 patients with digestive symptoms in Hubei, China: a descriptive, cross-sectional, Multicenter Study. Am J Gastroenterol. 2020;115: 766-73.\n\n36. Luo S, Zhang X, Xu H. Don\u2019t overlook digestive symptoms in patients with 2019 novel coronavirus disease (COVID-19). Clin Gastroenterol Hepatol. 2020;18:1636-7.\n\n37. Liu W, Zhang Q, Chen J, et al. Detection of Covid-19 in children in early January 2020 in Wuhan, China. N Engl J Med. 2020;382: 1370-1.\n\n38. Mao L, Jin H, Wang M, et al. Neurologic manifestations of hospitalized patients with coronavirus disease 2019 in Wuhan, China. JAMA Neurol. 2020;77:1-9.\n\n39. Tong JY, Wong A, Zhu D, et al. The prevalence of olfactory and gustatory dysfunction in COVID-19 patients: a systematic review and meta-analysis. Otolaryngol Head Neck Surg. 2020;163:3-11.\n\n40. CDC. Symptoms of coronavirus. 2020. https://www.cdc.gov/ coronavirus/2019-ncov/symptoms-testing/symptoms.html. Accessed 13 May 2020.\n\n41. Sanders JM, Monogue ML, Jodlowski TZ, Cutrell JB. Pharmacologic treatments for coronavirus disease 2019 (COVID- 19): a review. JAMA. 2020. https://doi.org/10.1001/jama.2020. 6019.\n\nSpringer\nIndian J Gastroenterol (May-June 2020) 39(3):220-231 231\n\n42. Corman VM, Albarrak AM, Omrani AS, et al. Viral shedding and antibody response in 37 patients with Middle East respiratory syndrome coronavirus infection. Clin Infect Dis. 2016;62:477-83.\n\n43. Chen Y, Chen L, Deng Q, et al. The presence of SARS-CoV-2 RNA in the feces of COVID-19 patients. J Med Virol. 2020;92: 833-40.\n\n44. Zhang J, Wang S, Xue Y. Fecal specimen diagnosis 2019 novel coronavirus-infected pneumonia. J Med Virol. 2020;92:680-2.\n\n45. Ling Y, Xu SB, Lin YX, et al. Persistence and clearance of viral RNA in 2019 novel coronavirus disease rehabilitation patients. Chin Med J (Engl). 2020;133:1039-43.\n\n46. Wu Y, Guo C, Tang L, et al. Prolonged presence of SARS-CoV-2 viral RNA in faecal samples. Lancet Gastroenterol Hepatol. 2020;5: 434-5.\n\n47. Gupta S, Parker J, Smits S, et al. Persistent viral shedding of SARS- CoV-2 in faeces - a rapid review. Color Dis. 2020;22:611-20.\n\n48. van Doremalen N, Bushmaker T, Morris DH, et al. Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1. N Engl J Med. 2020;382:1564-7.\n\n49. Ong SWX, Tan YK, Chia PY, et al. Air, surface environmental, and personal protective equipment contamination by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from a symptomatic patient. JAMA. 2020;323:1610-2.\n\n50. Ahmed W, Angel N, Edson J, et al. First confirmed detection of SARS-CoV-2 in untreated wastewater in Australia: a proof of concept for the wastewater surveillance of COVID-19 in the community. Sci Total Environ. 2020;728:138764.\n\n51. CDC. Symptom-based strategy to discontinue isolation for persons with COVID-19. 2020. https://www.cdc.gov/coronavirus/2019- ncov/community/strategy-discontinueisolation.html? deliveryName=USCDC\\_2067-DM27395. Accessed 3 May 2020.\n\n52. Ianiro G, Mullish BH, Kelly CR, et al. Screening of faecal microbiota transplant donors during the COVID-19 outbreak: suggestions for urgent updates from an international expert panel. Lancet Gastroenterol Hepatol. 2020;5:430-2.\n\n53. Norsa L, Indriolo A, Sansotta N, Cosimo P, Greco S, D\u2019Antiga L. Uneventful course in patients with inflammatory bowel disease during the severe acute respiratory syndrome coronavirus 2 outbreak in northern Italy. Gastroenterology. 2020;S0016- 5085(20)30445-5.\n\n54. Bezzo C, Saibeni S, Variola A, et al. Outcomes of COVID-19 in 79 patients with IBD in Italy: an IG-IBD study. Gut. 2020;69:1213-7.\n\n55. Brenner EJ UR, Colombel JF, Kappelman MD. SECURE-IBD Database Public Data Update. 2020. https://covidibd.org/current- data/. Accessed 18 May 2020.\n\n56. IOIBD. IOIBD update on COVID19 for patients with Crohn\u2019s disease and ulcerative colitis. 2020. https://www.ioibd.org/ioibd- update-on-covid19-for-patients-with-crohns-disease-and- ulcerative-colitis/. Accessed 13 April 2020.\n\n57. Baryah ANS, Midha V, Mahajan R, Sood A. Impact of corona virus disease - 19 (COVID-19) pandemic on gastrointestinal disorders. Indian J Gastroenterol. 2020; 39. https://doi.org/10.1007/ s12664-020-01071-6.\n\n58. Ling KL, Hilmi I, Raja Ali RA, et al. Asian Pacific Association of Gastroenterology (APAGE) Inflammatory Bowel Disease (IBD) Working Party guidelines on IBD management during the COVID-19 pandemic. JGH Open. 2020;4:320-3.\n\n59. Turner D, Huang Y, Mart\u00edn-de-Carpi J, et al. COVID-19 and paediatric inflammatory bowel diseases: global experience and provisional guidance (March 2020) from the PAEDIATRIC IBD Porto\n\nGroup of European Society of Paediatric Gastroenterology, Hepatology, and Nutrition. J Pediatr Gastroenterol Nutr. 2020;70: 727-33.\n\n60. ECCO. 1st Interview COVID-19 ECCO Taskforce. 2020. https:// www.ecco-ibd.eu/images/6\\_Publication/6\\_8\\_Surveys/1st\\_ interview\\_COVID-19%20ECCOTaskforce\\_published.pdf. Accessed 13 April 2020.\n\n61. Kennedy NA, Jones GR, Lamb CA, et al. British Society of Gastroenterology guidance for management of inflammatory bowel disease during the COVID-19 pandemic. Gut. 2020;69:984-90.\n\n62. Foundation CsaC. Resources for IBD healthcare professionals: 2019 novel coronavirus (COVID-19). 2020. https://www. crohnscolitisfoundation.org/coronavirus/professional-resources. Accessed 6 April 2020.\n\n63. Rubin DT, Feuerstein JD, Wang AY, Cohen RD. AGA clinical practice update on management of inflammatory bowel disease during the COVID-19 pandemic: expert commentary. Gastroenterology. 2020;159:350-7.\n\n64. Neumann H, Emura F, Bokemeyer B, et al. Practical advice for management of IBD patients during the COVID-19 pandemic: a world endoscopy organization statement. Dig Endosc. 2020;32: 658-62.\n\n65. Wang F, Wang H, Fan J, Zhang Y, Wang H, Zhao Q. Pancreatic injury patterns in patients with COVID-19 pneumonia. Gastroenterology. 2020;159:367-70.\n\n66. Hadi A, Werge MP, Kristiansen KT, et al. Coronavirus disease-19 (COVID-19) associated with severe acute pancreatitis: case report on three family members. Pancreatology. 2020;20:665-7.\n\n67. Barazzoni R, Bischoff SC, Breda J, et al. ESPEN expert statements and practical guidance for nutritional management of individuals with SARS-CoV-2 infection. Clin Nutr. 2020;39:1631-8.\n\n68. Chiu PWY, Ng SC, Inoue H, et al. Practice of endoscopy during COVID-19 pandemic: position statements of the Asian Pacific Society for Digestive Endoscopy (APSDE-COVID statements). Gut. 2020;69:991-6.\n\n69. Sultan S, Lim JK, Altayar O, et al. AGA institute rapid recommendations for gastrointestinal procedures during the COVID-19 pandemic. Gastroenterology. 2020;S0016-5085(20)30458-3.\n\n70. Gralnek IM, Hassan C, Beilenhoff U, et al. ESGE and ESGENA position statement on gastrointestinal endoscopy and the COVID- 19 pandemic. Endoscopy. 2020;52:483-90.\n\n71. WEO. WEO recommendations on digestive endoscopy and the COVID-19 pandemic. 2020. http://www.worldendo.org/wp- content/uploads/2020/04/200409\\_WEO-Advice-to-Endoscopists- COVID-19-Update-April-9-2020.pdf. Accessed 9 April 2020.\n\n72. Repici A, Maselli R, Colombo M, et al. Coronavirus (COVID-19) outbreak: what the department of endoscopy should know. Gastrointest Endosc. 2020;92:192-7.\n\n73. Guidelines BS. Endoscopy activity and COVID-19: BSG and JAG guidance. 2020. https://www.bsg.org.uk/covid-19-advice/ endoscopy-activity-and-covid-19-bsg-and-jag-guidance/. Accessed 3 April 2020.\n\n74. Philip M, Lakhtakia S, Aggarwal R, et al. Joint guidance from SGEI, ISG and INASL for gastroenterologists and gastrointestinal endoscopists on the prevention, care and management of patients with COVID-19. J Clin Exp Hepatol. 2020;10:266-70.\n\n**Publisher\u2019s note** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.\n\nSpringer\n", "modalities": [{"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152644_0qpsbjf1.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152644_pej_68h4.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152644_sav0zhfz.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152644_qcdwt0oy.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152644_2cogklfy.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_qngvugzz.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_daskuwfy.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_6cxqealk.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_xpig20v6.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_430cgibd.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_jyff52y6.png"}, {"type": "image", "value": "/Users/chaverot/dev/thesis/mmore-nemotron/examples/process/outputs_nemotron/images/20260601_152645_0sfqyj8_.png"}], "metadata": {"file_path": "examples/sample_data/pdf/Corona Virus Disease-19 pandemic.pdf", "processed_at": "2026-06-01T15:28:04.462872", "processor_type": "NemotronVLMProcessor", "paragraph_starts": [[0, 0, 0], [109, 0, 1], [183, 0, 2], [275, 0, 3], [357, 0, 4], [2000, 0, 5], [2256, 0, 6], [2274, 0, 7], [2622, 0, 8], [3815, 0, 9], [3840, 0, 10], [3865, 0, 11], [4021, 0, 12], [4030, 1, 0], [4088, 1, 1], [4553, 1, 2], [4599, 1, 3], [5759, 1, 4], [5785, 1, 5], [6318, 1, 6], [6639, 1, 7], [6759, 1, 8], [6935, 1, 9], [7400, 1, 10], [7423, 1, 11], [7896, 1, 12], [8271, 1, 13], [8891, 1, 14], [8951, 1, 15], [9309, 1, 16], [9318, 2, 0], [9376, 2, 1], [9837, 2, 2], [10182, 2, 3], [10202, 2, 4], [12023, 2, 5], [12315, 2, 6], [12892, 2, 7], [12916, 2, 8], [13407, 2, 9], [14335, 2, 10], [15154, 2, 11], [15163, 3, 0], [15221, 3, 1], [15309, 3, 2], [16081, 3, 3], [16308, 3, 4], [16412, 3, 5], [17014, 3, 6], [17409, 3, 7], [17419, 3, 8], [19066, 3, 9], [19188, 3, 10], [19388, 4, 0], [19446, 4, 1], [19504, 4, 2], [19998, 4, 3], [20493, 4, 4], [21289, 4, 5], [21973, 4, 6], [22465, 4, 7], [22475, 4, 8], [23555, 4, 9], [23742, 4, 10], [23861, 5, 0], [23919, 5, 1], [24463, 5, 2], [24759, 5, 3], [25060, 5, 4], [25110, 5, 5], [25406, 5, 6], [25427, 5, 7], [26531, 5, 8], [26777, 5, 9], [27329, 5, 10], [27369, 5, 11], [27890, 5, 12], [27934, 5, 13], [28184, 5, 14], [28316, 5, 15], [28915, 5, 16], [28924, 6, 0], [28982, 6, 1], [29577, 6, 2], [30132, 6, 3], [30142, 6, 4], [30567, 7, 0], [30625, 7, 1], [30824, 7, 2], [30878, 7, 3], [31528, 7, 4], [31976, 7, 5], [32017, 7, 6], [32767, 7, 7], [32806, 7, 8], [33045, 7, 9], [34194, 7, 10], [34209, 7, 11], [35300, 7, 12], [35754, 7, 13], [35792, 7, 14], [35882, 7, 15], [35891, 8, 0], [35949, 8, 1], [35964, 8, 2], [36207, 8, 3], [36220, 8, 4], [36301, 8, 5], [36342, 8, 6], [36364, 8, 7], [36382, 8, 8], [36406, 8, 9], [36433, 8, 10], [36461, 8, 11], [36500, 8, 12], [36553, 8, 13], [36606, 8, 14], [36652, 8, 15], [36763, 8, 16], [36805, 8, 17], [36826, 8, 18], [36922, 8, 19], [36944, 8, 20], [37038, 8, 21], [37196, 8, 22], [37305, 8, 23], [37390, 8, 24], [37466, 8, 25], [37553, 8, 26], [37687, 8, 27], [37794, 8, 28], [38094, 8, 29], [38119, 8, 30], [38211, 8, 31], [38284, 8, 32], [38380, 8, 33], [38610, 8, 34], [38818, 8, 35], [38909, 8, 36], [39020, 8, 37], [39074, 8, 38], [39102, 8, 39], [39414, 8, 40], [39532, 8, 41], [39623, 8, 42], [39675, 8, 43], [39737, 8, 44], [39810, 8, 45], [39900, 8, 46], [39960, 8, 47], [40131, 8, 48], [40179, 8, 49], [40214, 8, 50], [40328, 8, 51], [40370, 8, 52], [40782, 8, 53], [40791, 9, 0], [40849, 9, 1], [40921, 9, 2], [41054, 9, 3], [41172, 9, 4], [41237, 9, 5], [41310, 9, 6], [41445, 9, 7], [41525, 9, 8], [41578, 9, 9], [41701, 9, 10], [41753, 9, 11], [41955, 9, 12], [42048, 9, 13], [42187, 9, 14], [42354, 9, 15], [42433, 9, 16], [42592, 9, 17], [42794, 9, 18], [42920, 9, 19], [43015, 9, 20], [43032, 9, 21], [43090, 9, 22], [43143, 9, 23], [43238, 9, 24], [43332, 9, 25], [43389, 9, 26], [43460, 9, 27], [43503, 9, 28], [43532, 9, 29], [43553, 9, 30], [43635, 9, 31], [43682, 9, 32], [43720, 9, 33], [43798, 9, 34], [43833, 9, 35], [43867, 9, 36], [43937, 9, 37], [43997, 9, 38], [44017, 9, 39], [44080, 9, 40], [44130, 9, 41], [44259, 9, 42], [44333, 9, 43], [44389, 9, 44], [44476, 9, 45], [44537, 9, 46], [44555, 9, 47], [44601, 9, 48], [44731, 9, 49], [44770, 9, 50], [44821, 9, 51], [44906, 9, 52], [45059, 9, 53], [45068, 10, 0], [45126, 10, 1], [45141, 10, 2], [45364, 10, 3], [45500, 10, 4], [45678, 10, 5], [45823, 10, 6], [45984, 10, 7], [46150, 10, 8], [46353, 10, 9], [46477, 10, 10], [46617, 10, 11], [46827, 10, 12], [46956, 10, 13], [47233, 10, 14], [47372, 10, 15], [47568, 10, 16], [47765, 10, 17], [47897, 10, 18], [48052, 10, 19], [48177, 10, 20], [48324, 10, 21], [48445, 10, 22], [48592, 10, 23], [48722, 10, 24], [48898, 10, 25], [49071, 10, 26], [49298, 10, 27], [49426, 10, 28], [49589, 10, 29], [49819, 10, 30], [50023, 10, 31], [50196, 10, 32], [50411, 10, 33], [50618, 10, 34], [50807, 10, 35], [51001, 10, 36], [51210, 10, 37], [51375, 10, 38], [51516, 10, 39], [51674, 10, 40], [51868, 10, 41], [52006, 10, 42], [52191, 10, 43], [52200, 11, 0], [52258, 11, 1], [52451, 11, 2], [52583, 11, 3], [52708, 11, 4], [52878, 11, 5], [53018, 11, 6], [53154, 11, 7], [53311, 11, 8], [53540, 11, 9], [53774, 11, 10], [54009, 11, 11], [54243, 11, 12], [54508, 11, 13], [54644, 11, 14], [54796, 11, 15], [55026, 11, 16], [55237, 11, 17], [55464, 11, 18], [55653, 11, 19], [55786, 11, 20], [55985, 11, 21], [56169, 11, 22], [56376, 11, 23], [56581, 11, 24], [56775, 11, 25], [56927, 11, 26], [57117, 11, 27], [57307, 11, 28], [57512, 11, 29], [57697, 11, 30], [57865, 11, 31], [58097, 11, 32], [58256, 11, 33], [58455, 11, 34], [58702, 11, 35], [58843, 11, 36], [58852, -1, -1]], "backend": "nemotron-vlm", "model": "nvidia/nemotron-nano-12b-v2-vl"}}

Comment on lines +17 to +19
# Env var that selects the PDF backend. When set to "nemotron", this processor
# accepts .pdf files and the default PDFProcessor (Marker) steps aside.
PDF_BACKEND_ENV = "MMORE_PDF_BACKEND"

@JCHAVEROT JCHAVEROT Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit weird to have to export this new env var just to enable the nemotron pdf processor, I think you should remove this MMORE_PDF_BACKEND

It's better if within the .yaml config file you add a field inside the dispatcher_config section

You seem to have done it already in production-config/process/config.yaml with a parameter pdf_backend, please also update the oder template process configs

@JCHAVEROT JCHAVEROT Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were the results of the benchmark ?
You can just add screenshots if you happen to have some

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants