Auto-Correct OCR: A Novel Method for Enhancing Character Recognition Accuracy through Error Correction
This paper proposes Auto-Correct OCR framework, which consists of a training-free basic recognition module and a domain-adaptive post-processing module named Structure-Aware Correction (SAC).
This warehouse opens the experimental data used in this paper and the core code of the post-processing methods of all scenarios mentioned in the article, including two algorithmic frameworks, SAC for Fixed-Length strings (SACFL) and SAC for Variable-Length strings (SACVL).
We have preliminarily extracted the image data of the two scenes using the paddepaddle OCR engine, and saved the recognition results to two files, result_with_fixstr_fixed.xlsx and ppocr_resultfl_cleaned_with_noise.xlsx. The details of these two documents are as follows, two files correspond to the brewery scene and the automotive parts factory scene respectively:
| Field | Type | Description |
|---|---|---|
| date | String | Date to which the sample belongs |
| pic | String | Filename of the sample image data |
| pse | String | Initially extracted string record |
| Identify | Bool | Whether it is completely correct (label field) |
| Fixstr | String | Fully correct string (label field) |
| Field | Type | Description |
|---|---|---|
| cinvaddcode | String | Inventory ID; filename without suffix |
| cinvname | String | Inventory name |
| cinvstd | String | Specification / model (label field) |
| iinvrcost | Float | Inventory receipt cost; not used in this study |
| inventorycode | String | Inventory code (label field) |
| ppocrstr1 | String | Initially extracted inventory code string |
| ppocrstr2 | String | Initially extracted specification/model string |
| ppocrstr3 | String | Initially extracted specification/model string (augmented) |
├── SACFL #SACFL experimental script and core algorithm code
│ ├── cross_time_validation_experiment.py
│ ├── ppocr_resultFL_cleaned_with_noise.xlsx
│ ├── result_with_fixstr_fixed.xlsx
│ └── structure_aware_corrector.py #algorithm code
│
├── SACVL #SACVL experimental script and core algorithm code
│ ├── baselines
│ │ └── rbp
│ ├── checkpoints
│ ├── checkpoints_final
│ ├── logs_structure_aware
│ ├── train_figs
│ ├── vocabularies_dual
│ ├── dual_vocab_builder.py
│ ├── ppocr_resultFL_cleaned_with_noise.xlsx
│ ├── structure_aware_corrector_v5.py #algorithm code
│ └── train_structure_corrector.ipynb #train and test script
├── imgdata_Auto_parts_factory
│ ├── BCQ3270000001.png
│ ├── BCQ3270000002.png
│ └── ...
├── imgdata_Brewery
│ ├── may
│ │ ├── 10002.jpg
│ │ ├── 10005.jpg
│ │ └── ...
│ └── sept
│ ├── IMG_20240918_081229_782.png
│ ├── IMG_20240918_081230_913.png
│ └── ...
├── README.md
└── requirements.txt
Due to the large size of the dataset (approximately 6GB), it is not included in this repository. Please download it from the link below:
Baidu Netdisk: https://pan.baidu.com/s/1nMapMPpCWdsg0AdADt5_pQ?pwd=1234
Extraction Code: 1234
- The repository already includes character record files that were preliminarily extracted using open-source OCR engines.
- Reproducing this extraction step is not recommended, as it is unnecessary and time-consuming.
- Please ensure the dataset structure matches the project requirements.
- Check data paths in the configuration or source code if any issues occur.
If the link is unavailable or you encounter any problems, please contact the project maintainer.