Auto-Correct OCR: A Novel Method for Enhancing Character Recognition Accuracy through Error Correction

introduction

This paper proposes Auto-Correct OCR framework, which consists of a training-free basic recognition module and a domain-adaptive post-processing module named Structure-Aware Correction (SAC).

This warehouse opens the experimental data used in this paper and the core code of the post-processing methods of all scenarios mentioned in the article, including two algorithmic frameworks, SAC for Fixed-Length strings (SACFL) and SAC for Variable-Length strings (SACVL).

We have preliminarily extracted the image data of the two scenes using the paddepaddle OCR engine, and saved the recognition results to two files, result_with_fixstr_fixed.xlsx and ppocr_resultfl_cleaned_with_noise.xlsx. The details of these two documents are as follows, two files correspond to the brewery scene and the automotive parts factory scene respectively:

Brewery Scene

Field	Type	Description
date	String	Date to which the sample belongs
pic	String	Filename of the sample image data
pse	String	Initially extracted string record
Identify	Bool	Whether it is completely correct (label field)
Fixstr	String	Fully correct string (label field)

Automotive Parts Factory Scene

Field	Type	Description
cinvaddcode	String	Inventory ID; filename without suffix
cinvname	String	Inventory name
cinvstd	String	Specification / model (label field)
iinvrcost	Float	Inventory receipt cost; not used in this study
inventorycode	String	Inventory code (label field)
ppocrstr1	String	Initially extracted inventory code string
ppocrstr2	String	Initially extracted specification/model string
ppocrstr3	String	Initially extracted specification/model string (augmented)

Requirements

catboost==1.2.10

numpy==1.23.0

pandas==1.2.5

python_Levenshtein==0.27.3

scikit_learn==1.8.0

torch==2.5.1+cu121

Directory Structure

├── SACFL     #SACFL experimental script and core algorithm code
│   ├── cross_time_validation_experiment.py  
│   ├── ppocr_resultFL_cleaned_with_noise.xlsx
│   ├── result_with_fixstr_fixed.xlsx
│   └── structure_aware_corrector.py   #algorithm code
│
├── SACVL     #SACVL experimental script and core algorithm code
│   ├── baselines
│   │   └── rbp
│   ├── checkpoints
│   ├── checkpoints_final
│   ├── logs_structure_aware
│   ├── train_figs
│   ├── vocabularies_dual
│   ├── dual_vocab_builder.py
│   ├── ppocr_resultFL_cleaned_with_noise.xlsx
│   ├── structure_aware_corrector_v5.py  #algorithm code
│   └── train_structure_corrector.ipynb  #train and test script
├── imgdata_Auto_parts_factory
│   ├── BCQ3270000001.png
│   ├── BCQ3270000002.png
│   └── ...
├── imgdata_Brewery
│   ├── may
│   │   ├── 10002.jpg
│   │   ├── 10005.jpg
│   │   └── ...
│   └── sept
│       ├── IMG_20240918_081229_782.png
│       ├── IMG_20240918_081230_913.png
│       └── ...
├── README.md
└── requirements.txt

Dataset

Due to the large size of the dataset (approximately 6GB), it is not included in this repository. Please download it from the link below:

🔗 Download Link

Baidu Netdisk: https://pan.baidu.com/s/1nMapMPpCWdsg0AdADt5_pQ?pwd=1234

Extraction Code: 1234

⚠️ Notes

The repository already includes character record files that were preliminarily extracted using open-source OCR engines.
Reproducing this extraction step is not recommended, as it is unnecessary and time-consuming.
Please ensure the dataset structure matches the project requirements.
Check data paths in the configuration or source code if any issues occur.

If the link is unavailable or you encounter any problems, please contact the project maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
SACFL		SACFL
SACVL		SACVL
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-Correct OCR: A Novel Method for Enhancing Character Recognition Accuracy through Error Correction

introduction

Brewery Scene

Automotive Parts Factory Scene

Requirements

catboost==1.2.10

numpy==1.23.0

pandas==1.2.5

python_Levenshtein==0.27.3

scikit_learn==1.8.0

torch==2.5.1+cu121

Directory Structure

Dataset

🔗 Download Link

⚠️ Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auto-Correct OCR: A Novel Method for Enhancing Character Recognition Accuracy through Error Correction

introduction

Brewery Scene

Automotive Parts Factory Scene

Requirements

catboost==1.2.10

numpy==1.23.0

pandas==1.2.5

python_Levenshtein==0.27.3

scikit_learn==1.8.0

torch==2.5.1+cu121

Directory Structure

Dataset

🔗 Download Link

⚠️ Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages