Document Watermark Cleaner

A conditional GAN that removes watermarks, binarizes, deblurs, and cleans scanned document images.

Overview

Scanned documents in the wild are messy — overlaid watermarks, faint copies, blur, and noise all hurt downstream OCR and analysis. This project implements DE-GAN (Souibgui & Kessentini, 2020), a conditional Generative Adversarial Network that learns to map degraded document images to their clean counterparts.

The same architecture handles four related tasks (watermark removal, binarization, deblurring, and general cleaning) by changing only the training data. After this README you'll be able to run inference with pretrained weights and train your own variants.

Highlights

🎯 Four document-enhancement tasks with a single model architecture
🖼️ Paired-image training (degraded → ground-truth) keeps the loss interpretable
🐍 Pure Python + TensorFlow 2.x — no exotic dependencies
📜 Inspired by published research (DE-GAN, IEEE TPAMI 2020)

Capabilities

Task	Description	Use case
Watermark removal	Removes overlaid watermarks from documents	Archival recovery, redaction reversal
Binarization	Color/grayscale → clean binary	OCR preprocessing
Deblurring	Sharpens blurred documents	Phone-captured scans
Cleaning	General noise and artifact removal	Quality normalization

How it works

DE-GAN follows the conditional-GAN recipe (Pix2Pix–style):

Generator — a U-Net-style encoder/decoder maps the degraded input to a candidate clean image.
Discriminator — a PatchGAN classifier tells real clean pairs from generated ones.
Combined loss — adversarial loss + a strong pixel-wise reconstruction term keeps outputs faithful to ground truth instead of just "plausibly clean."

The training script alternates discriminator and generator updates. Augmentation utilities in augmentation/ randomly perturb training pairs to improve generalization.

Project structure

.
├── augmentation/        # Data augmentation utilities
├── common/
│   ├── Generator.py     # U-Net generator network
│   └── utils.py         # Helpers (data loading, image ops)
├── service/             # Service wrapper (for inference deployment)
├── predict.py           # Inference entry point
├── train_wm.py          # Train the watermark-removal variant
├── train_dn.py          # Train the denoise variant
├── requirements.txt
└── LICENSE

Getting started

Prerequisites

Python 3.7+
TensorFlow 2.x
An NVIDIA GPU is strongly recommended for training (8 GB+ VRAM); CPU is fine for inference on small batches.

Installation

git clone https://github.com/aifriend/doc_watermark_cleaner.git
cd doc_watermark_cleaner
pip install -r requirements.txt

Download pretrained weights

Pretrained checkpoints are hosted on Google Drive — [TODO: real link] — save them to ./weights/.

Usage

Inference

# Watermark removal
python predict.py --task unwatermark --input ./data_wm --output ./results

# Binarization
python predict.py --task binarize --input ./input --output ./output

# Deblurring
python predict.py --task deblur --input ./input --output ./output

Training

Prepare paired datasets (degraded + ground-truth images, same filenames in matching folders), then:

python train_wm.py   # Watermark removal
python train_dn.py   # Denoising

Training configuration (epochs, batch size, learning rate) is set at the top of each script — straightforward to tweak.

Results

Task	Metric	This implementation
Watermark removal	PSNR (test set)	to be reported
Binarization	F-measure (DIBCO)	to be reported

Roadmap

Publish quantitative results on standard benchmarks (DIBCO, custom watermark set)
Add ONNX export for lightweight deployment
Wrap inference in a FastAPI service (started in service/)
Replace TF training loop with a PyTorch Lightning implementation for portability

Citation

If you use this work, please cite the original paper that motivated the architecture:

@article{souibgui2020degan,
  title   = {DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement},
  author  = {Souibgui, Mohamed Ali and Kessentini, Yousri},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year    = {2020}
}

License

This project is for academic and research purposes only. For commercial use, please contact the maintainer. See LICENSE for details.

Author

Jose Lopez — AI engineer in Madrid, working on document intelligence and the intersection of biological and artificial intelligence.

GitHub: @aifriend
LinkedIn: jafdl

Acknowledgments

DE-GAN paper authors for the architecture
The Pix2Pix family of work (Isola et al.) for the broader cGAN-for-image-translation framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Watermark Cleaner

Overview

Highlights

Capabilities

How it works

Project structure

Getting started

Prerequisites

Installation

Download pretrained weights

Usage

Inference

Training

Results

Roadmap

Citation

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
augmentation		augmentation
common		common
docs		docs
service		service
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
requirements.txt		requirements.txt
train_dn.py		train_dn.py
train_wm.py		train_wm.py

Folders and files

Latest commit

History

Repository files navigation

Document Watermark Cleaner

Overview

Highlights

Capabilities

How it works

Project structure

Getting started

Prerequisites

Installation

Download pretrained weights

Usage

Inference

Training

Results

Roadmap

Citation

License

Author

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages