##This repository is no longer maintained. Please see the new and improved version here.https://github.com/jcaperella29/Document_cleaning_CLI
This repository contains a Python-based pipeline for cleaning scanned document images. The pipeline leverages a DnCNN-based convolutional neural network for denoising, coupled with adaptive thresholding and post-processing, to generate clean, readable outputs that are ideal for both human readability and text mining.
-
Denoising with DnCNN: Uses a pre-trained DnCNN (Deep Convolutional Neural Network) to remove noise while preserving important text details.
-
Adaptive Thresholding: Sharpens text, enhances contrast, and creates uniform backgrounds for better readability and machine processing.
-
PDF Conversion: Converts cleaned images into grayscale, high-resolution PDFs for archival and text mining.
-
Batch Processing: Processes all images in a folder and generates cleaned images and PDFs in bulk.
git clone https://github.com/jcaperella29/document-cleaning-pipeline.git
cd document-cleaning-pipeline

