MoDora is an LLM-powered framework for semi-structured document analysis. It introduces the Component-Correlation Tree (CCTree) to model semi-structured documents with diverse elements and complicated layouts. It combines preprocessing with OCR-parsed elements, tree construction and tree-based analysis, without the need for extra training or fine-tuning. The experiment on two datasets with various documents and question types demonstrates its superior performance compared to existing methods.
- 2026.2: Our paper "MoDora: Tree-Based Semi-Structured Document Analysis System" has been accepted by SIGMOD 2026.
-
🎯 Management & Analysis: Effectively extract document structures and features, answering user questions based on content retrival. Support multi-document management and cross-document analysis.
-
🔄 Visualization & Integration: Visualize the document structure (tree) and the frequency of nodes' participation in analysis (heatmap) intuitively. Users can further modify the tree structure according to their needs in the user interface.
-
👫 Flexibility & Options: Provide a configuration solution for flexibly adjusting backend models (local/API). And different modules can also be processed using different models.
We provide frontend deployment as follows (click to jump to demo video):

Examples
| Question | Answer | MoDora | UDOP | DocOwl2 | M3DocRAG | SV-RAG | TextRAG | ZenDB | QUEST | GPT-5 |
|---|---|---|---|---|---|---|---|---|---|---|
| Is Bitmex launched earlier than South Korea's ban on Bitcoin derivatives? | Yes | Yes—BitMEX launchedin Jan 2014, while South Korea’s ban was on Dec 11, 2017. |
no | Yes | Yes | Yes, Bitmex waslaunched in 2014, while South Korea's ban on Bitcoin derivatives was announced in 2017. |
Yes. BitMEX launchedin 2014, while South Korea’s Bitcoin derivatives ban began on 2017-12-11. |
Not specified. | Yes. | |
| What is the title at the top of the third page? | Preface | PREFACE | The title of the third page is "The Changing World" | PREFACE | 2018 Global DigitalAsset Derivatives Research Report |
2018 Global DigitalAsset Derivatives Research Report |
LEGAL NOTICES | Digital Asset Derivatives | Not found. | Working with Apps |
| When was this document published? | November 30, 2023 | November 30, 2023 | 2023 September 6, 2000 | 2023-11-10 | 2023-11-13 | The document waspublished on November 30, 2023. |
The publication dateis not stated in the provided text. |
November 30, 2023 | 2023-11-30 | November 30, 2023 |
| What image is in the center of page 2? | QR code | A QR code. | The image is in the center of page 2. | not answerable | A QR code. | QR code | Not specified in the provided information. | Not specified. | Not specified. | A QR code. |
| What was the decline in the electrical equipment industry last week? | 2.36% | Down 2.36%. | 3% | not answerable | -2.36% | The decline inthe electrical equipment industry last week was 0.6%. |
Down 2.36%. | Not specified in the provided context. | -2.36% | -3.4% |
Click to view more examples
| Question | Answer | MoDora | UDOP | DocOwl2 | M3DocRAG | SV-RAG | TextRAG | ZenDB | QUEST | GPT-5 |
|---|---|---|---|---|---|---|---|---|---|---|
| What is the fourth section of the report? | Investment Strategy and Key Recommendations | Investment Strategy and Key Recommendations | scientific report | not answerable | Risk warning | The fourth section of the report is titled "2.2.1 New Energy Source." | Not specified in the provided information. | Investment Strategy and Key Recommendations | Unknown | Real estate market |
| How many references are there? | 13 | 13 | 3 | 37 | 13 | There are 5 references in the text. | 11 | 13 | 13 | |
| What is the actual number of the second page? | 266 | 266 | 240 | 2 | 265 | The actual number of the second page is 266. | 3070 | 266 | Not found | 266 |
| What was the amount of the year-on-year increase in the largest proportion of RMB loans in November 2022? | 23.2 billion | 23.2 billion yuan | 9% | 12.40% | 58.6 billion yuan | The largest proportion of RMB loans in November 2022 increased by 7.8%. | Down by 60 billion yuan year on year (1210 billion yuan vs 1270 billion yuan) | 23.2 billion yuan | N/A | 0.20 trillion yuan |
| Which color is used to emphasize key text? | Red | Red | Blue | Yellow | Red | The color used to emphasize key text is red. | Red | Not specified. | None | Red |
| What is the stock code of the company discussed in the document? | 002624 | 002624 | HK | 002624 | 002624 | 002624 | 002624 | 002624 | 002891.SZ | |
| Which kind of companies have more workers deployed in trades irrelevent to their skills certification | Small companies | Smaller firms. | ad hoc | small firms | Small companies. | smaller firms | Smaller companies. | Small companies | Main contractor (general building) companies. | |
| What is the required page length for manuscripts submitted? | Between 16 and 20 pages | 16–20 double-spaced pages. | 240 | between 16 and 20 double-spaced pages | 16–20 double-spaced pages. | Manuscripts should be between 16 and 20 double-spaced typed pages, with margins of at least one inch. | Between 16 and 20 double-spaced typed pages. | Between 16 and 20 double-spaced typed pages. | None | 16–20 double-spaced pages. |
| What is the main underwriting amount of the company's equity financing scale in 2018? | 1783 | 178.3 billion yuan | 57.8 | not answerable | 95.3 billion yuan | The main underwriting amount of the company's equity financing scale in 2018 is 541. | Not specified in the provided text (it appears only in the chart, which isn’t readable here). | 91.116 billion yuan | N/A | 56.6 billion yuan |
| What is the name of the second document in the image? | Traffic Engineering Report | Traffic Engineering Report | The second document in the image is called Peachtree Industrial Boulevard. | not answerable | Traffic Engineering Report | Signal Clearance Intervals | PED CLEARANCE | Cannot be determined from the provided context. | 39132487.txt | Traffic Engineering Report |
| What is the percentage yield of compound IX? | 65% | 65% | 10.7% | 65% | 65% | The percentage yield of compound IX is 65%. | Not specified in the provided information. | Not specified. | N/A | 65% |
| In which solvent are the title amides readily soluble? | DMSO | DMSO. | Common organic solvents but dissolve readily in DMSO. | common organic solvents | DMSO | The title amides are readily soluble in common organic solvents but dissolve readily in DMSO. | DMSO (dimethyl sulfoxide). | DMSO | Not specified. | DMSO |
| Who is the author of "Logic as Algebra"? | Paul Halmos and Steven Givant | Paul Halmos and Steven Givant. | David Greenberg | Paul Halmos | Paul R. Halmos and Steven Givant | Paul Halmos and Steven Givant | Paul Halmos and Steven Givant. | Paul R. Halmos and Steven Givant | None | Paul R. Halmos and Steven Givant |
| What is the serial number associated with this package? | 41850 | U202141850 | 00228 | 41850 | A1850 | 41850 | U202141850 | Not provided. | U202141850 | 41850 |
| Which knitting technique is used to join the yarn in a circle? | Joining in Round | Joining in the round. | Using a bobbin of yarn | Crochet Cast-On | Joining in the round. | K2tog | Join in the round (knitting in the round). | Pass Slipped Stitch Over (PSSO) after casting on one extra stitch. | Knitting in the round. | Joining in the round. |
| How many yards of worsted weight yarn is needed to makes 2 gloves? | 135 yards | 135 yards | 240 lbs | 135 yards/60g | 100 yards | 135 yards | About 135 yards (60 g) of worsted weight yarn for a pair (2 gloves). | 135 yards | 135 yards | About 150 yards. |
| What is the unemployment rate in the year with the lowest labor force participation rate in the 16-24 age group? | 15% | About 15% (in 2020). | 9% | 6.0% | 25% | The unemployment rate in the year with the lowest labor force participation rate in the 16-24 age group is 13.0%. | Cannot be determined from the provided information. | 14.9% | 10% | |
| Around what date did the outbreak of the epidemic occur in China? | 2022-03-04 | Around March 2022. | 86-10-08-90 | 2022-01-01 | Around February 2020. | The outbreak of the epidemic occurred in China around 4 months ago. | Around July 2022. | Around the first quarter of 2020. | Around January 20, 2020. | |
| What is the HTI ESG of Triangle Tyre? | 3.0-4.0-4.0 | 3.0–4.0–4.0 (E–S–G) | HTI ESG of Triangle Tyre is 0-5. | Maintain Outperform | 3 | The HTI ESG of Triangle Tyre is 601163 CH. | 3.0-4.0-4.0 | 3.0-4.0-4. | 3.0-4.0-4.0 | BBB |
| What are the main colors of text in the document? | Blue and black | Black and blue (with occasional gray). | Text is blue | white, orange, black | Black and blue. | The main colors of text in the document are black and blue. | Not specified. The provided text contains no color information, so the main text colors can’t be determined. | Not specified. | Black and blue. | |
| What points does SWOT refer to? | Strength, weakness, opportunities, threatens | Strengths, Weaknesses, Opportunities, and Threats. | SWOT refers to the number of points a person has earned in a given year. | Strength, Weakness, Opportunity, Threat | Strengths, Weaknesses, Opportunities, and Threats. | SWOT refers to Strengths, Weaknesses, Opportunities, and Threats. | Strengths, Weaknesses, Opportunities, and Threats. | Strengths, Weaknesses, Opportunities, Threats. | Not specified in the document. | Strengths, Weaknesses, Opportunities, Threats. |
| What is the willingness of residents to travel in the first quarter of 2022? | 85.32% | 85.32% | 0% | 85.32% | 86.3% | 78% | 85.32% | High. | 53.7% | |
| How many charts are in the page 4? | 2 charts | 2 | 3 | 8 | 6 | There are no charts on page 4. | 3 | 3 | 0 | |
| What type of needle is needed for the Little Star Cowl? | 12 mm (US 17) 16” circular needle | A 12 mm (US 17) 16-inch circular needle. | A 12 mm (US 17) circular needle | Circular needle | A 16-inch circular needle. | US 17) 16" circular needle | A 16" circular needle, size 12 mm (US 17). | A 16-inch circular needle (12 mm/US 17). | 12 mm (US 17) 16-inch circular needle. | 12 mm (US 17) 16-inch circular needle |
| What color is the textile shown in the document? | Red | Red | Blue | Red | Red | Red | Not specified. | Not specified. | Unknown | Red |
| What is the main section following 'ABBREVIATIONS'? | CONSTRUCTION | CONSTRUCTION: | The main section following 'ABBREVIATIONS' is the main section following 'ABBREVIATIONS'. | CONSTRUCTION | CONSTRUCTION | The main section following 'ABBREVIATIONS' is the 'SPECIFICATIONS' section. | Needles. | Pattern | CONSTRUCTION | |
| What are the horizontal and vertical axes of Figure 4? | Distance and densitu profile | Horizontal: Distance x; Vertical: Density profile. | The horizontal and vertical axes of Figure 4 are a horizontal and vertical axis. | x-axis: distance, y-axis: gauss | Horizontal: Orientation (degrees); Vertical: Normal error (degrees). | The horizontal axis is labeled as \\( d_x \\) and the vertical axis is labeled as \\( d_y \\). | Horizontal axis: Orientation angle (degrees) Vertical axis: Angular error (radians) | Horizontal: distance from the surface (x − x0). Vertical: density D (0–1). | Horizontal axis: None; Vertical axis: None. | x and y |
| According to the current investment rating, how much does the stock rise at least relative to the Shanghai and Shenzhen 300 index? | 20% | At least 20% | 12% | The stock rises at least 200 relative to the Shanghai and Shenzhen 300 index | ≥20% | The stock rises at least 194.86% relative to the Shanghai and Shenzhen 300 index. | At least 20% above the CSI 300 (Shanghai and Shenzhen 300) index. | 20% | 20% | |
| What is the temperature range for the testing of paper chromatography reagents? | 110–120°C | 110–120°C | 10 - 20 wg. | 0 to 20°C | 110–120 °C | The temperature range for the testing of paper chromatography reagents is 110-120°C. | 20–25 °C. | 110–120°C | 110–120°C, 60–80°C, and 40–45°C. | 110–120 °C |
| What is the distribution coefficient of Benzol? | 160 | 160 | The distribution coefficient of Benzol is K. | 160 | 160 | The distribution coefficient of Benzol (Benzol) is 160. | 160 | 10.4 | 150 |
The full results of MoDora and baselines are shown in Results.
MMDA is a benchmark with 537 documents and 1065 questions curated from over one million real-world documents. We perform layout-emphasize clustering to obtain these representative documents and most of them are semi-structured. Then automatic LLM generation and manual verification are combined for QA pairs annotaion. The questions can be concered about different aspects of document (e.g. hierarchy, text, table, chart, imgae, location, formatted), to comprehensively evaluate the semi-structured document analysis performance.
You can visit it here MMDA, and some documents involving sensitive data are hidden. If you believe any content in this open source dataset infringes upon your copyright, please contact us, and we will remove it.
The following chart demonstrates the AIC-Acc and ACNLS score of different methods. The AIC-Acc and ACNLS are the modified versions of Acc and ANLS metrics respectively.
Baselines
Content extraction methods:
Structure extraction methods:
End-to-End model methods:
Retrieval-Augmented Generation methods:
First, grant execution permissions and initialize your configuration:
# Grant execution permissions
chmod +x setup.sh run.sh start_backend.sh start_frontend.sh
# Initialize config from template
cp local.example.json local.jsonThen, open local.json and fill in the required fields.
Critical configurations in local.json:
- API Keys (Required): Ensure that the
api_keyfor at least one model instance inmodel_instances(e.g.,GPT-5) is correctly filled so that the system can run. - Vector Search (Optional): If you enable
enable_vector_search, you MUST fill inembedding_api_key(andrerank_api_keyif a rerank model is used). These are not required if vector search is disabled.
We provide a one-click setup script that automatically creates a virtual environment and installs all dependencies (including PyTorch, LMDeploy, FlashAttention, and PaddleOCR):
# Run the setup script (this may take a while)
./setup.shLocal models are NOT automatically downloaded. By default, MoDora uses remote models (GPT-5).
Recommendation: If you have a GPU with 24GB+ VRAM, we highly recommend using Qwen-3-VL-8B-Instruct as a local model for better performance and privacy.
To use a local model, download it and update your local.json:
# Activate venv after setup.sh
source MoDora-backend/venv/bin/activate
# Download recommendation: Qwen-3-VL-8B-Instruct
python download_model.py --repo_id Qwen/Qwen3-VL-8B-Instruct --local_dir ./MoDora-backend/models/Qwen3-VL-8B-InstructUpdate local.json Example:
"model_instances": {
"Qwen3-VL-8B-Instruct": {
"type": "local",
"model": "MoDora-backend/models/Qwen3-VL-8B-Instruct",
"port": 9001,
"device": "0"
}
},
"ui_settings": {
"pipelines": {
"enrichment": { "modelInstance": "Qwen3-VL-8B-Instruct" },
"retriever": { "modelInstance": "Qwen3-VL-8B-Instruct" }
}
}Once installation and configuration are complete, you can use run.sh to start both backend and frontend services simultaneously:
./run.shAlternatively, you can start the backend and frontend separately in two terminal tabs:
# Terminal 1: Backend
./start_backend.sh
# Terminal 2: Frontend
./start_frontend.shNote on Ports: Both scripts automatically detect the API port from
local.json(defaulting to8005). This ensures the frontend proxy correctly points to the backend even when started manually. You can customize the port by modifying the"api_port"field in yourlocal.json.
Note: The CLI is primarily designed for experimental purposes, such as offline dataset preprocessing and batch evaluation. Before using the CLI, please ensure you have downloaded the MMDA dataset to the
datasets/MMDAdirectory.
MoDora provides a comprehensive CLI for offline experiments, dataset preprocessing, and batch evaluation.
First, activate the virtual environment:
source MoDora-backend/venv/bin/activateBasic usage:
# General help
modora --help
# Subcommand help
modora <command> --help-
OCR & Component Extraction Process raw PDFs to extract layout blocks and components.
modora ocr --dataset datasets/MMDA --cache-dir MoDora-backend/cache_v5
-
Tree Construction Build document hierarchy trees (tree.json) from extracted components.
modora build-tree --dataset datasets/MMDA --cache-dir MoDora-backend/cache_v5
-
Single Document QA Ask a question about a specific document using its constructed tree.
modora qa <pdf_path> <tree_json_path> "Your question here"
-
Batch QA Experiment Run multiple questions from a dataset against the constructed trees.
modora batch-qa --dataset datasets/MMDA/test.json --cache MoDora-backend/cache_v5 --output MoDora-backend/tmp
-
Evaluation & Analysis Calculate metrics (Accuracy, ANLS, ACNLS) and generate analysis charts.
# By default, evaluation results and charts will be saved alongside the result.json file modora evaluate --input datasets/MMDA/test.json --result MoDora-backend/tmp/result.jsonThis command will:
- Update
result.jsonwith calculated metrics (Accuracy, ANLS, etc.). - Save the detailed evaluation to
evaluation.jsonlin the same directory asresult.json. - Generate accuracy/ACNLS bar charts and summary CSVs in the same directory.
- Update
If you like this project, please cite our paper:
@article{xu2026modora,
author = {Bangrui Xu and Qihang Yao and Zirui Tang and Xuanhe Zhou and Yeye He and Shihan Yu and Qianqian Xu and Bin Wang and Guoliang Li and Conghui He and Fan Wu},
title = {MoDora: Tree-Based Semi-Structured Document Analysis System},
journal = {ACM SIGMOD},
year = {2026}
}