A Python tool that analyzes PDF documents using both visual and textual features to detect risks, anomalies, and important patterns. Unlike traditional text-only analysis, this tool "sees" documents like humans do.
Traditional PDF analysis: PDF → Extract Text → Analyze Text
This approach: PDF → Extract Text + Visual Features → Combined Analysis
The analyzer can detect:
- 📝 Missing or suspicious signatures
- 🔴 Visual tampering indicators
- 💰 Financial risks and monetary amounts
- ⚖️ Legal risk keywords and patterns
- 🎯 Layout inconsistencies
- Python 3.8+
- Windows users need Poppler for PDF to image conversion
- Tesseract OCR (optional, for enhanced text extraction)
- Clone this repository:
git clone https://github.com/chintanparekh2510/pdf-risk-analyzer.git
cd pdf-risk-analyzer- Install Python dependencies:
pip install -r requirements.txt- For Windows users:
- Download Poppler from here
- Extract and add the
binfolder to your PATH
-
Place your PDF file in the project directory and name it
test_document.pdf(or runpython create_test_pdf.pyto generate one) -
Run the analyzer:
python document_analyzer_simple.py- Check the results:
- Console output shows the formatted report
analysis_results.jsoncontains detailed analysis data
╔══════════════════════════════════════════════════════════════╗
║ MULTIMODAL DOCUMENT ANALYSIS REPORT ║
╚══════════════════════════════════════════════════════════════╝
📄 Document: contract.pdf
📅 Analysis Date: 2024-01-19
🎯 Overall Risk Score: 72.5/100
Text Risk: 65/100
Visual Risk: 85/100
📊 Document Statistics:
• Pages: 5
• Words: 2,341
• Risk Keywords: 13
💰 Monetary Amounts Found:
• $5,000,000
• $250,000
• $1.5M
🖼️ Visual Features:
• Signatures Detected: 0
• Official Stamps: 3
• Layout Consistency: 0.82
⚠️ HIGH RISK: Recommend legal review before signing
📝 No signatures detected - ensure proper signing
- Text Analysis: Extracts text and searches for risk keywords, monetary amounts, and concerning clauses
- Visual Analysis: Processes document as images to detect signatures, stamps, and visual anomalies
- Risk Scoring: Combines both analyses to generate a comprehensive risk score
- Recommendations: Provides actionable insights based on findings
You can modify risk keywords and visual patterns in the MultimodalDocumentAnalyzer class:
self.risk_keywords = [
'liability', 'penalty', 'breach', 'termination',
# Add your domain-specific keywords
]
self.visual_patterns = {
'signature_region': {'min_area': 5000, 'aspect_ratio_range': (2, 6)},
# Adjust detection parameters
}- Contract Review: Identify high-risk clauses before signing
- Document Verification: Detect potential tampering or forgery
- Compliance Checking: Ensure documents meet visual standards
- Due Diligence: Quick risk assessment of legal documents
- This is a demonstration tool, not a replacement for legal review
- Visual analysis accuracy depends on document quality
- Best results with standard business documents
Feel free to submit issues and enhancement requests!
Created by @ChintanParekhAI
MIT License - feel free to use in your projects!