Skip to content

v0.10.0 — Word (.docx) to PDF Conversion

Choose a tag to compare

@shps951023 shps951023 released this 06 Mar 01:15
· 232 commits to main since this release

v0.10.0 — Word (.docx) to PDF Conversion

Highlights

This release adds DOCX-to-PDF conversion — a brand-new, zero-dependency Word document renderer. MiniPdf can now convert .docx files to PDF with paragraph, table, and image support, achieving a 97.4% average overall score across 60 benchmark test cases compared to LibreOffice reference output.

New Features

DOCX Reader (DocxReader.cs — 727 lines)

• Full OOXML paragraph parsing: text runs, bold/italic/underline/strikethrough, font sizes, font colors, highlight colors
• Heading styles (Heading1–Heading9) with automatic font size mapping from styles.xml
• Paragraph alignment (left, center, right, justify) and indentation (left, right, hanging, firstLine)
• Bullet and numbered list support with numId/ilvl detection
• Tab stop parsing with position and alignment (left, center, right, decimal)
• Paragraph shading / background color support
• Table parsing with cell content, borders, shading, column spans (gridSpan), and grid column widths
• Embedded image extraction via relationships with EMU-to-point dimension conversion
• Page layout reading from sectPr: page size, margins, orientation
• Page break detection (w:br type page and lastRenderedPageBreak)

DOCX-to-PDF Converter (DocxToPdfConverter.cs — 682 lines)

• Paragraph rendering with mixed formatting runs, line wrapping, and proper line spacing
• Heading rendering with bold weight and scaled font sizes
• Text alignment: left, center, right, justified
• List rendering with bullet () and numbered (1., 2., …) prefixes at correct indentation
• Tab stop handling with leader positioning
• Paragraph shading rendered as filled rectangles behind text
• Table rendering with cell borders, shading fills, column-width distribution, and automatic row height
• Image embedding as inline JPEG XObjects with aspect-ratio-aware scaling
• Page layout support: reads page dimensions and margins from DOCX sectPr
• Automatic page breaks: content overflow and explicit w:br type="page" handling
• Configurable ConversionOptions: font size, margins, line spacing, page dimensions

Unified API

MiniPdf.ConvertToPdf() now auto-detects .docx files by extension — no API change needed for existing callers
• New MiniPdf.ConvertDocxToPdf(Stream) method for stream-based DOCX conversion
• Updated NuGet description and tags to include word and docx

Tests

• 9 new unit tests in DocxToPdfConverterTests.cs covering: simple documents, bold text, tables, empty documents, multi-paragraph, stream input, and file output
• 60 DOCX benchmark test cases with visual comparison against LibreOffice reference PDFs

Benchmark

• 60 DOCX test cases (classic01–classic60): single paragraph, multiple paragraphs, headings, bold/italic, font sizes, font colors, alignment, bullet lists, numbered lists, simple tables, table shading, mixed content, images, long documents, multi-page tables, comprehensive reports, and more
Average Overall Score: 0.9739 (text similarity + visual comparison vs LibreOffice)
• Benchmark scripts: Run-Benchmark_docx.ps1, generate_reference_pdfs_docx.py, compare_pdfs.py with DOCX mode

Other Changes

• Added .gitattributes to configure GitHub Linguist for Python scripts
• Updated README badges: replaced .NET badge with Gitee link across all language variants (EN, zh-CN, zh-TW, ja, ko, fr, it)
• README now includes DOCX benchmark visual comparison table with MiniPdf vs Reference side-by-side images

Files Changed

DocxReader.cs — +727 lines (new): OOXML document parser
DocxToPdfConverter.cs — +682 lines (new): DOCX-to-PDF rendering engine
MiniPdf.cs — +23 lines: .docx auto-detection and ConvertDocxToPdf() API
MiniPdf.csproj — updated description and package tags
192 files changed total (including benchmark images, reports, and scripts)

Full Changelog: v0.9.0...v0.10.0