-
Notifications
You must be signed in to change notification settings - Fork 1
OCR rendering
Although we did not know this initially, some number of the PDF files were already rendered through OCR in their original presentation. We re-rendered every PDF through OCR. One version of OCR was employed by Haley Boles in 2017 for the version stored in Google Drive. The second iteration of OCR was done in 2018 by Junwen Li, this time employing binary erosion on the files prior to processing. Binary erosion refines the original image, and is especially useful for halftone renderings of fonts, usually present when the original was in color and the PDF is of a scanned/photocopied version of the file. In Spring 2019, a team of undergraduate Electrical Engineering and Computer Science students ran tests to refine the binary erosion rules for processing files and found a scale for binary erosion that improved text-rendering quality. Basically, binary erosion helps optical character recognition by darkening/thickening lines of characters, especially where they are represented in halftone with sparse pixels. We do not yet have an estimate of the overall OCR accuracy for the full dataset, in part because original data quality varies largely across pages: some pages have tables and forms, handwriting, logos, pictures, etc, or are scanned in, while most of the strictly email-based pages exported more directly have transferred with virtually 100% accuracy. The final erosion grid determined to best improve OCR quality was 3x3. The process for OCR rendering was as follows:
- Splitting each PDF file by page-conversion to .png file
- binary erosion
- passing through OCR (using ___ package)
- rendering OCR as separate text file
soon I will organize the pages into a kind of table of contents/outline below.