Handwriting Recognition Under Data Scarcity: Ukrainian Use Case

Overview

The initial aim of this project was to develop a system for recognizing Ukrainian characters. While a dataset containing 28,000 samples of Cyrillic handwritten characters (CoMNIST) was available, it was created in Russia and featured Soviet Union-related imagery. As a Ukrainian I have ethical considerations that made me to choose not to use this dataset. Instead I generated a small datset myself (yes, manually). I integrated my small dataset with external dataset 'Rukopys' of the same size, resulting in 20-30 samples/category dataset with 34 categories. I explored varaious CNN architectured for OCR however all faied due to data scarcity. This data scarcity condition prompted a shift in focus to explore data augmentation techniques, applying machine learning frameworks under conditions of data scarcity. The goal became analogous to developing OCR for a rare language with minimal written samples, necessitating innovative approaches to overcome the limitations.

Objectives

Develop a model for Ukrainian handwriting recognition with minimal data.
Explore and implement data augmentation techniques to increase the dataset’s effectiveness.
Train and evaluate machine learning models under constrained data conditions.

Features

Dataset Generation: A manually created dataset was merged with the external 'Rukopys' dataset to expand the variety of samples.
Data Augmentation: Techniques such as Gaussian noise reduction, grayscale conversion, thresholding, and normalization were applied to standardize the images.
Dimensionality Reduction: PCA and t-SNE were used for feature extraction and visualization.
Variational Autoencoders (VAEs): Implemented for generating synthetic data and improving the model's ability to generalize.

Installation

Ensure that you have the following libraries installed:

pip install numpy pandas matplotlib seaborn scikit-image scikit-learn tensorflow keras opencv-python

Usage

Data Preprocessing: Preprocess images using the provided scripts to convert them to grayscale, apply Gaussian blur, and resize.
Dataset Augmentation: Utilize ImageDataGenerator for augmenting the dataset.
Model Training: Train models using VAE or CNN architectures.
Evaluation: Evaluate the models using the test set and visualize the results using t-SNE and PCA.

Results

The final dataset included original and augmented samples, improving model robustness.
PCA and t-SNE visualizations highlighted the feature distribution, guiding further tuning of the models.
The VAE-based synthetic data generation proved effective in balancing the dataset and enhancing model performance.

Ethical Considerations

The decision to create a new dataset stems from the ethical implications of using datasets associated with unwanted imagery. This approach ensures that the project aligns with ethical research practices while promoting inclusive and responsible AI development.

References

Rukopys Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Handwriting_recognition_under_conditions_of_data_scarcity_Ukrainian_use_case.ipynb		Handwriting_recognition_under_conditions_of_data_scarcity_Ukrainian_use_case.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handwriting Recognition Under Data Scarcity: Ukrainian Use Case

Overview

Objectives

Features

Installation

Usage

Results

Ethical Considerations

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Handwriting Recognition Under Data Scarcity: Ukrainian Use Case

Overview

Objectives

Features

Installation

Usage

Results

Ethical Considerations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages