Architectural overview of the document processing SDK for RAG systems
FileFlux follows clean architecture principles with a two-package structure:
- FileFlux.Core: Pure document extraction with zero AI dependencies
- Standard document readers (PDF, DOCX, XLSX, PPTX, MD, TXT, JSON, CSV, HTML)
- Core interfaces and domain models
- AI service interface definitions (no implementations)
- FileFlux: Full RAG pipeline (interface-driven)
- MultiModal document readers (AI-enhanced)
- AI service interfaces for consumer implementation
- Chunking strategies (FluxCurator)
- Content enhancement (FluxImprover)
- Processing orchestration
- Extensible plugin architecture
- Loose coupling through dependency injection
- Strategy and Factory patterns
- Simplified property names:
Properties→Props,ChunkIndex→Index - Props dictionary pattern: Extensible metadata storage
- Guid-based traceability: Track entire pipeline stages
- Simplified Quality: Changed from complex object to double type
- Unified API: Integrated batch/streaming through IDocumentProcessor
graph TB
A[Client Application] --> B[IDocumentProcessor]
B --> C[DocumentProcessor]
C --> D[IDocumentReaderFactory]
C --> E[IChunkingStrategyFactory]
C --> M[IMetadataEnricher]
D --> F[PdfReader]
D --> G[WordReader]
D --> H[ExcelReader]
D --> I[PowerPointReader]
D --> J[MarkdownReader]
D --> K[TextReader]
D --> L[JsonReader]
D --> N[CsvReader]
D --> O[HtmlReader]
E --> P[AutoChunkingStrategy]
E --> Q[SmartChunkingStrategy]
E --> R[IntelligentChunkingStrategy]
E --> S[SemanticChunkingStrategy]
E --> T[ParagraphChunkingStrategy]
E --> U[FixedSizeChunkingStrategy]
E --> V[MemoryOptimizedIntelligentStrategy]
M --> X[AIMetadataEnricher]
M --> Y[RuleBasedMetadataExtractor]
C --> W[DocumentChunk[]]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#fff3e0
style W fill:#e8f5e8
style M fill:#e8eaf6
FileFlux uses a two-package architecture for flexibility:
FileFlux.Core/ # Extraction-Only Package (Zero AI Dependencies)
├── Exceptions/ # Exception types
│ ├── FileFluxException
│ ├── DocumentProcessingException
│ └── UnsupportedFileFormatException
├── Infrastructure/
│ └── Readers/ # Standard Document Readers
│ ├── PdfDocumentReader
│ ├── WordDocumentReader
│ ├── ExcelDocumentReader
│ ├── PowerPointDocumentReader
│ ├── MarkdownDocumentReader
│ ├── HtmlDocumentReader
│ ├── TextDocumentReader
│ ├── JsonDocumentReader
│ └── CsvDocumentReader
├── Utils/ # Utilities
│ └── FileNameHelper
├── IDocumentReader.cs # Reader interface
├── IDocumentParser.cs # Parser interface
├── IChunkingStrategy.cs # Strategy interface
├── DocumentChunk.cs # Chunk model
├── RawContent.cs # Extraction result model
├── ParsedContent.cs # Parsed content model
└── ChunkingOptions.cs # Options model
FileFlux/ # Full RAG Pipeline Package
├── Core/ # AI Service Interfaces
│ ├── IDocumentProcessor
│ ├── IDocumentAnalysisService # AI text generation interface
│ ├── IImageToTextService # Vision AI interface
│ ├── IImageRelevanceEvaluator # Image relevance interface
│ ├── IEmbeddingService # Embedding generation interface
│ ├── IMetadataEnricher
│ └── Factories/
├── Infrastructure/
│ ├── Readers/ # MultiModal Readers (AI-enhanced)
│ │ ├── MultiModalPdfDocumentReader
│ │ ├── MultiModalWordDocumentReader
│ │ ├── MultiModalExcelDocumentReader
│ │ └── MultiModalPowerPointDocumentReader
│ ├── Strategies/ # Chunking Strategies
│ │ ├── AutoChunkingStrategy
│ │ ├── SmartChunkingStrategy
│ │ ├── IntelligentChunkingStrategy
│ │ └── SemanticChunkingStrategy
│ ├── Languages/ # Language Profiles
│ │ └── LanguageProfiles.cs
│ ├── Services/ # Processing Services
│ │ ├── AIMetadataEnricher
│ │ ├── FluxCurator
│ │ └── FluxImprover
│ └── Factories/ # Factory implementations
└── DocumentProcessor.cs # Main orchestrator
┌─────────────────────────────────────────────────────────────┐
│ Client Layer │
│ • Application Code │
│ • RAG Systems Integration │
│ • AI Service Implementation │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ FileFlux.Core │ │ FileFlux │ │
│ │ (Extraction Only) │ │ (Full RAG Pipeline) │ │
│ │ │ │ │ │
│ │ • Document Readers │──│ • Chunking Strategies │ │
│ │ • Core Interfaces │ │ • FluxCurator │ │
│ │ • Domain Models │ │ • FluxImprover │ │
│ │ • AI Service Contracts │ │ • DocumentProcessor │ │
│ │ • Zero AI Dependencies │ │ • Orchestration │ │
│ │ │ │ │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │
│ Use Case: Use Case: │
│ - Extract documents only - Full processing pipeline │
│ - Implement own chunking - Use built-in strategies │
│ - Minimal dependencies - AI-enhanced features │
│ │
└─────────────────────────────────────────────────────────────┘
Role: Single entry point for all document processing with explicit state management
Key Methods:
ExtractAsync(): Stage 1 - Extract raw contentRefineAsync(): Stage 2 - Refine and structure analysisChunkAsync(): Stage 3 - Apply chunking strategyEnrichAsync(): Stage 4 - LLM-powered enrichmentProcessAsync(): Run complete pipeline
Properties:
State: Current processor state (Created → Extracted → Refined → Chunked → Enriched)Result: Accumulated results across all stagesFilePath: Source document path
Responsibilities: Pipeline orchestration, state management, error handling, result validation
5-Stage Processing Pipeline:
📂 Extract → 🔄 Refine → 🤖 LLM-Refine → 📦 Chunk → ✨ Enrich
| Stage | Interface | Description | Output |
|---|---|---|---|
| Extract | IDocumentReader |
Raw content extraction from files | RawContent |
| Refine | IDocumentRefiner |
Text cleaning, normalization, structure analysis | RefinedContent |
| LLM-Refine | ILlmRefiner |
LLM-powered noise removal, sentence restoration (optional, skipped if LLM unavailable) | LlmRefinedContent |
| Chunk | IChunkerFactory |
Text segmentation into chunks | IReadOnlyList<DocumentChunk> |
| Enrich | IDocumentEnricher |
LLM-powered summaries, keywords, contextual text | DocumentGraph |
State Flow:
Created → Extracted → Refined → LlmRefined → Chunked → Enriched → Disposed
Auto-Dependency Resolution:
- Calling
RefineAsync()automatically runsExtractAsync()if not already completed - Calling
LlmRefineAsync()automatically runsExtractAsync()andRefineAsync()if needed - Calling
ChunkAsync()automatically runsExtractAsync()andRefineAsync()if needed - Each stage is idempotent - calling twice returns cached results
Dependency Management:
- Optional logger support via NullLogger pattern
- Optional IDocumentRefiner for refinement stage
- Optional IDocumentEnricher for enrichment stage
- All optional dependencies use graceful fallback strategies
RawContent Definition:
RawContentis not the original file bytes. It is "normalized text with semantic structure preserved, extracted without AI dependencies".
This design decision optimizes for RAG pipeline quality:
- HTML tags are noise for RAG → Extract converts to Markdown
- PDF layout info is valuable → Extract preserves structure hints
- Markdown is already structured → Extract preserves as-is
Extract vs Refine Role Separation:
| Stage | Primary Responsibility | Examples |
|---|---|---|
| Extract | Format normalization | HTML→Markdown, metadata extraction, noise tag removal |
| Refine | Quality enhancement | OCR correction, image→text, whitespace cleanup, header/footer removal |
Reader-specific Extract Output:
| Reader | Extract Output | Rationale |
|---|---|---|
HtmlDocumentReader |
Markdown (headings, lists, links, code blocks) | HTML tags are RAG noise; preserve semantic structure only |
MarkdownDocumentReader |
Original Markdown (parsed and validated) | Already structured; no conversion needed |
PdfDocumentReader |
Text + structural hints (has_tables, page_count) | Layout info aids chunking decisions |
TextDocumentReader |
Original text | No transformation required |
JsonDocumentReader |
Flattened text with key paths | Preserve hierarchy context |
CsvDocumentReader |
Text with row/column structure | Tabular context for RAG |
Example: HTML Extract Output:
# Document Title
## Section 1
This is paragraph content with [a link](https://example.com).
- List item 1
- List item 2
- Nested item
--- TABLE: User Data ---
Name | Age | Role
John | 30 | Developer
--- END TABLE ---

```javascript
console.log("Code block preserved");
**Why HTML Extract outputs Markdown**:
1. **Single conversion point**: HTML→Markdown happens once in Extract, not twice
2. **RAG optimization**: Chunking stage receives structured input for better segmentation
3. **Consistency**: All Readers output text suitable for immediate chunking
4. **Performance**: No intermediate format transformation overhead
### 3. IDocumentRefiner (Stage 2)
**Role**: Content refinement and structure extraction
**Key Methods**:
- `RefineAsync(RawContent, RefineOptions)`: Refine raw content
**Output (RefinedContent)**:
- `Text`: Cleaned and normalized text
- `Sections`: Document structure (headings, paragraphs)
- `Structures`: Structured elements (code blocks, tables, images)
- `Metadata`: Document metadata (filename, type, created date)
- `Quality`: Refinement quality scores
**RefineOptions**:
- `CleanNoise`: Remove extra whitespace, normalize formatting
- `ConvertToMarkdown`: Convert to markdown for better structure
- `BuildSections`: Extract document sections from headings
- `ExtractStructures`: Extract code blocks, tables, images
### 4. IDocumentEnricher (Stage 4)
**Role**: LLM-powered chunk enrichment and graph building
**Key Methods**:
- `EnrichAsync(chunks, refinedContent, options)`: Enrich chunks with LLM
- `EnrichStreamAsync(chunks, refinedContent)`: Streaming enrichment
- `BuildGraphAsync(enrichedChunks, options)`: Build document graph
**Output (EnrichmentResult)**:
- `Chunks`: List of enriched chunks with metadata
- `Graph`: Document graph showing inter-chunk relationships
- `Stats`: Enrichment statistics
**EnrichedDocumentChunk**:
- `Chunk`: Original DocumentChunk
- `Summary`: LLM-generated chunk summary
- `Keywords`: Extracted keywords with relevance scores
- `ContextualText`: Context for RAG retrieval
- `SearchableText`: Combined text for search
**DocumentGraph**:
- `Nodes`: Chunk nodes with metadata
- `Edges`: Relationships (sequential, hierarchical, semantic)
- `NodeCount`, `EdgeCount`: Graph statistics
### 5. Legacy DocumentProcessor
**5-Stage Processing Pipeline** (maintained for backward compatibility):
📂 Extract → 📄 Parse → 🔄 Refine → 📦 Chunk → ✨ Enhance
| Stage | Description | Component |
|-------|-------------|-----------|
| **Extract** | Raw content extraction from files | IDocumentReader |
| **Parse** | Structure analysis and parsing | IDocumentParser |
| **Refine** | Content transformation (Markdown, Image-to-Text) | IMarkdownConverter, IImageToTextService |
| **Chunk** | Text segmentation into chunks | IChunkerFactory (FluxCurator) |
| **Enhance** | Metadata enrichment and quality scoring | FluxImprover |
**Refine Stage (Default Enabled)**:
- `ConvertToMarkdown = true`: Markdown conversion for structure preservation (enabled by default)
- `ProcessImagesToText = false`: Image text extraction (opt-in, cost consideration)
### 6. IDocumentReader (Content Extraction)
**Current Implementations**:
- **PdfDocumentReader**: PDF text and image extraction
- **WordDocumentReader**: DOCX with style preservation
- **ExcelDocumentReader**: XLSX multi-sheet support
- **PowerPointDocumentReader**: PPTX slide extraction
- **MarkdownDocumentReader**: Markdown structure preservation
- **HtmlDocumentReader**: HTML content extraction
- **TextDocumentReader**: Plain text processing
- **JsonDocumentReader**: JSON structured data
- **CsvDocumentReader**: CSV table data
### 7. IChunkingStrategy (Content Splitting)
**Strategy Types**:
- **AutoChunkingStrategy**: Automatic strategy selection (recommended)
- **SmartChunkingStrategy**: Sentence boundary-based with high completeness
- **IntelligentChunkingStrategy**: LLM-based semantic boundary detection
- **MemoryOptimizedIntelligentChunkingStrategy**: Memory-efficient intelligent chunking
- **SemanticChunkingStrategy**: Sentence-based semantic chunking
- **ParagraphChunkingStrategy**: Paragraph-level segmentation
- **FixedSizeChunkingStrategy**: Fixed-size token-based chunking
### 8. ILanguageProfile (Multilingual Text Segmentation)
**Purpose**: Language-specific rules for accurate sentence boundary detection and text segmentation.
**Key Properties**:
- `LanguageCode`: ISO 639-1 code (en, ko, zh, ja, etc.)
- `ScriptCode`: ISO 15924 script code (Latn, Hang, Hans, Arab, Deva, Cyrl, Jpan)
- `WritingDirection`: LTR, RTL, or TopToBottom
- `NumberFormat`: Decimal/thousands separator conventions
- `QuotationMarks`: Language-specific quote characters
- `SentenceEndPattern`: Regex for sentence boundaries
- `Abbreviations`: Non-breaking abbreviation list
- `CategorizedAbbreviations`: Typed abbreviations (Prepositive/Postpositive/General)
**Supported Languages** (11):
| Language | Script | Direction | Number Format |
|----------|--------|-----------|---------------|
| English | Latn | LTR | Standard (1,234.56) |
| Korean | Hang | LTR | Standard |
| Chinese | Hans | LTR | NoGrouping |
| Japanese | Jpan | LTR | Standard |
| Spanish | Latn | LTR | European (1.234,56) |
| French | Latn | LTR | SpaceSeparated (1 234,56) |
| German | Latn | LTR | European |
| Arabic | Arab | **RTL** | Standard |
| Hindi | Deva | LTR | Standard |
| Portuguese | Latn | LTR | European |
| Russian | Cyrl | LTR | SpaceSeparated |
**Provider Pattern**:
- `ILanguageProfileProvider`: Manages language profile lookup and auto-detection
- `DefaultLanguageProfileProvider`: Built-in provider with Unicode script analysis
- Auto-detection analyzes text Unicode ranges to determine language
## Processing Pipeline
```mermaid
graph TB
A[Document Input] --> B[Type Detection]
B --> C[Reader Selection]
C --> D[Content Extraction]
D --> E[Metadata Enrichment]
E --> F[Structure Parsing]
F --> G[Strategy Selection]
G --> H[Chunking Process]
H --> I[Post Processing]
I --> J[DocumentChunk[]]
style A fill:#e1f5fe
style E fill:#e8eaf6
style J fill:#e8f5e8
- File path or stream input support
- File existence and access permission validation
- Supported format verification
- Dedicated reader for each document type
- Text content and metadata extraction
- Document structure preservation
- AI-powered metadata extraction with IDocumentAnalysisService
- Three-tier fallback: AI → Hybrid → Rule-based
- Automatic caching based on file content hash
- Schema-based extraction (General, ProductManual, TechnicalDoc)
- Enriched metadata stored in CustomProperties with "enriched_" prefix
- Content splitting based on selected strategy
- Overlap between chunks
- Metadata propagation and indexing
- File extension-based reader selection
- New reader registration and management
- Unsupported format exception handling
- Extension discovery API
- Strategy name-based selection system
- Default and fallback strategy management
- Dynamic strategy registration support
Main Settings:
- Strategy: Chunking strategy name ("Auto", "Smart", "Intelligent", etc.)
- MaxChunkSize: Maximum chunk size (default: 1024 tokens)
- OverlapSize: Overlap size between chunks (default: 128 tokens)
- PreserveStructure: Whether to preserve document structure
- StrategyOptions: Strategy-specific detailed options
- CustomProperties: Extensible configuration dictionary for features like metadata enrichment
Metadata Enrichment Configuration:
var options = new ChunkingOptions
{
Strategy = "Auto",
CustomProperties = new Dictionary<string, object>
{
["enableMetadataEnrichment"] = true,
["metadataSchema"] = MetadataSchema.General,
["metadataOptions"] = new MetadataEnrichmentOptions
{
ExtractionStrategy = MetadataExtractionStrategy.Smart,
MinConfidence = 0.7
}
}
};Basic Registration (no AI):
services.AddFileFlux(); // Pure extraction + chunkingWith AI Services (Consumer-provided implementations):
// Use your own AI service implementations
services.AddScoped<IDocumentAnalysisService, YourLLMService>();
services.AddScoped<IImageToTextService, YourVisionService>();
services.AddScoped<IEmbeddingService, YourEmbeddingService>();
services.AddFileFlux();With LMSupply (CLI example - local AI processing):
// LMSupply is NOT a dependency of FileFlux
// Consumer applications reference LMSupply directly
// See cli/FileFlux.CLI/Services/LMSupply for implementation examples
var lmSupplyOptions = new LMSupplyOptions
{
UseGpuAcceleration = true,
EmbeddingModel = "default",
GeneratorModel = "microsoft/Phi-4-mini-instruct-onnx"
};
// Create and register LMSupply services
var embedder = await LMSupplyEmbedderService.CreateAsync(lmSupplyOptions);
services.AddSingleton<IEmbeddingService>(embedder);
services.AddFileFlux();Extension Registration: Add custom readers/strategies
- Stream-based processing for large files
- IDisposable pattern for resource cleanup
- ConfigureAwait(false) to minimize context switching
- All public interfaces are thread-safe
- Factories use immutable collections
- No shared mutable state
- Minimal memory allocation
- Efficient string processing
- Reusable component design
- Implement IDocumentReader interface
- Register in DI container
- Implement SupportedExtensions and CanRead methods
Example:
public class CustomDocumentReader : IDocumentReader
{
public string ReaderType => "CustomReader";
public IEnumerable<string> SupportedExtensions => [".custom"];
public bool CanRead(string fileName) =>
Path.GetExtension(fileName).Equals(".custom", StringComparison.OrdinalIgnoreCase);
public Task<ReadResult> ReadAsync(string filePath, CancellationToken cancellationToken)
{
// Stage 0: Return document structure (page count, metadata)
// Implementation
}
public Task<RawContent> ExtractAsync(string filePath, ExtractOptions? options = null, CancellationToken cancellationToken = default)
{
// Stage 1: Return extracted text content
// Implementation
}
}
// Registration
services.AddTransient<IDocumentReader, CustomDocumentReader>();FileFlux uses IChunker from FluxCurator for chunking. Register a custom IChunker implementation to provide a custom chunking strategy:
- Implement
IChunkerfromFluxCurator.Core.Core - Define
StrategyName - Implement chunking logic in
ChunkAsync
Example:
using FluxCurator.Core.Core;
using FluxCurator.Core.Domain;
public class CustomChunker : IChunker
{
public string StrategyName => "Custom";
public bool RequiresEmbedder => false;
public async Task<IReadOnlyList<DocumentChunk>> ChunkAsync(
string text,
ChunkOptions options,
CancellationToken cancellationToken = default)
{
// Implementation
}
public int EstimateChunkCount(string text, ChunkOptions options) => 1;
}
// Registration
services.AddTransient<IChunker, CustomChunker>();- FileFluxException: Base class for all exceptions
- UnsupportedFileFormatException: Unsupported file format
- DocumentProcessingException: Error during document processing
- ChunkingException: Error during chunking process
- Early error detection through input validation
- Meaningful error messages with context
- Preserve inner exceptions for cause tracking
- Include debugging information (filename, strategy name)
DocumentChunk:
Id(Guid): Unique chunk identifierContent(string): Chunk text contentIndex(int): Chunk order indexLocation(SourceLocation): StartChar/EndChar position infoQuality(double): Quality score (0.0~1.0)Props(Dictionary<string, object>): Extensible metadata
RawContent:
Text: Extracted raw textFile(SourceFileInfo): File informationReaderType: Reader type usedExtractedAt: Extraction timestampWarnings: Processing warningsHints: Processing hints
ParsedDocumentContent:
Content: Parsed text contentSections: Structured sections (optional)RawId: Reference to RawContent.IdProps: Additional metadata
SourceFileInfo:
Name: FilenameExtension: File extensionSize: File sizePath: File path (optional)
- Streaming Processing: Sequential processing per chunk with ProcessStreamAsync
- Batch Processing: Collect all chunks then batch process
- Pipeline Processing: Simultaneous chunk generation and embedding generation
// Extensible metadata with Props dictionary
chunk.Props["ContextualHeader"] = "Document: Technical";
chunk.Props["DocumentDomain"] = "Technical";
chunk.Props["HasImages"] = true;
// Maintain backward compatibility with extension methods
public static string? ContextualHeader(this DocumentChunk chunk)
=> chunk.Props.TryGetValue("ContextualHeader", out var v) ? v?.ToString() : null;// Enriched metadata storage in CustomProperties
chunk.Metadata.CustomProperties["enriched_topics"] = new[] { "AI", "Machine Learning" };
chunk.Metadata.CustomProperties["enriched_keywords"] = new[] { "neural networks", "deep learning" };
chunk.Metadata.CustomProperties["enriched_description"] = "Introduction to AI concepts";
chunk.Metadata.CustomProperties["enriched_confidence"] = 0.92;
chunk.Metadata.CustomProperties["enriched_extractionMethod"] = "ai";
// Access enriched metadata
var topics = chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_topics") as string[];
var confidence = Convert.ToDouble(chunk.Metadata.CustomProperties.GetValueOrDefault("enriched_confidence", 0.0));RawContent.Id (Guid)
↓
ParsedDocumentContent.RawId → RawContent.Id
↓
DocumentChunk.RawId → RawContent.Id
DocumentChunk.ParsedId → ParsedDocumentContent.Id
FileFlux focuses on transforming documents into structured chunks optimized for RAG systems.
Two-Package Strategy:
- FileFlux.Core: Pure extraction, zero AI dependencies
- Standard document readers
- Core interfaces and models
- AI service interface definitions
- For users implementing custom pipelines
- FileFlux: Full RAG pipeline (interface-driven)
- MultiModal readers (AI-enhanced)
- FluxCurator and FluxImprover
- No direct AI service implementations
Interface-Driven AI: FileFlux defines AI service interfaces without implementations:
IDocumentAnalysisService: Text generation for intelligent chunkingIImageToTextService: Image captioning and OCRIEmbeddingService: Embedding generation for semantic search
Consumer Responsibility: Applications provide AI service implementations:
- Use OpenAI, Anthropic, Azure OpenAI, or other cloud providers
- Use LMSupply for local AI processing (see CLI for examples)
- Implement custom providers as needed
Package Selection Guide:
// Extraction only - implement your own chunking
using FileFlux.Core;
var reader = new PdfDocumentReader();
var rawContent = await reader.ReadAsync("document.pdf");
// Full RAG pipeline with custom AI providers
using FileFlux;
services.AddScoped<IDocumentAnalysisService, OpenAIService>();
services.AddScoped<IEmbeddingService, OpenAIEmbeddingService>();
services.AddFileFlux();
var processor = serviceProvider.GetRequiredService<IDocumentProcessor>();
var chunks = await processor.ProcessAsync("document.pdf", options);Short answer: HTML tags are noise for RAG systems. Extract converts to Markdown to preserve semantic structure while removing presentation markup.
Detailed explanation:
-
RAG Optimization: Raw HTML contains
<div>,<span>, CSS classes, and other tags that add no semantic value for retrieval. Markdown preserves meaning (headings, lists, links) without noise. -
Chunking Quality: The Chunk stage needs structured text input. If Extract outputs raw HTML, the Refine stage would need to parse and convert it, adding complexity and potential errors.
-
Single Conversion Point: Converting HTML→Markdown once in Extract is more efficient than maintaining an intermediate format and converting again in Refine.
-
Consistency: All Readers output text that can be immediately chunked. HTML Reader's Markdown output follows this pattern.
What if I need the original HTML?
- Read the file directly using
File.ReadAllText()before calling FileFlux - FileFlux is designed for RAG preprocessing, not general-purpose HTML processing
| Stage | Focus | AI Required |
|---|---|---|
| Extract | Format normalization (HTML→Markdown, PDF→Text) | No |
| Refine | Quality enhancement (OCR fix, image→text, cleanup) | Optional |
Extract uses deterministic libraries (HtmlAgilityPack, PdfPig). Refine can optionally use AI services for advanced processing.
ConvertToMarkdown in Refine handles cases where:
- Extract output is plain text (from TextReader, legacy sources)
- Additional structure detection is needed (e.g., inferring headings from formatting)
- If input is already Markdown, Refine performs cleanup only (no re-conversion)
- Tutorial - Detailed usage guide and examples
- Changelog - Version history and release notes
- GitHub Repository
- NuGet Package