Releases · stjiris/OCR

14 Oct 00:23

AdventurousGui

v1.4.1

a99e5a5

1.4.1 Latest

Latest

What's Changed

Hotfixes to file types and segmentation page by @AdventurousGui in #288
Allow restarting worker pool from Flower by @AdventurousGui in #289
Fixup in call to custom autosegmentation and set version 1.4.1 by @AdventurousGui in #290
Fix dynamic maximum private space age by @AdventurousGui in #291
Avoid generating thumbnails for files from API by @AdventurousGui in #292
Ensure Storage Manager correctly orders spaces/docs by size as float by @AdventurousGui in #293
Update word counts on result edit by @AdventurousGui in #294

Full Changelog: v1.4.0...v1.4.1

Contributors

AdventurousGui

Assets 2

02 Oct 23:14

AdventurousGui

v1.4.0

11a541f

1.4.0

What's Changed

Disable Search feature and ElasticSearch container by @AdventurousGui in #280
UI changes and deactivation of search feature by @AdventurousGui in #281
Performance-related changes by @AdventurousGui in #282
Reimplement segment grouping/replicating functionality by @AdventurousGui in #283
Fix bugs related to upload process by @AdventurousGui in #284
Allow creation of extra results after result editing by @AdventurousGui in #285
Sync dev v1.4.0 into main by @AdventurousGui in #286
Allow altering max private space age in Storage Manager by @AdventurousGui in #287

Details of Most Important Changes

UI

Document rows and their context menus now show a thumbnail of the first page.
Folder rows list total size of its contents (calculated recursively), and the total size listed on the document rows now refers to the entire storage occupation by files related to the document, including extracted pages, OCR results as JSON, etc.
Added informative tooltips to several buttons, particularly some explaining why they are disabled.
Redesigned the interface with a colour palette and identity that is more in line with other STJ tools.
In the Results Editing menu, it's now possible to move around the page image on the left by dragging it.
Implemented skipping to a certain page on the Segmentation and Results Editing interfaces by writing the desired number, like in most document viewers.
Improved Segmentation functionality. See #283
Blocked attempts to making segmentation changes while automatic segmentation is being performed by the server. They would be overwritten anyway.
Website favicon changed.
Fixed TIFF page images not rendering when switching between pages in the Results Editing interface.
Ensured the Results Editing functionality all works in Edge and Chrome, which likely means it also works in other Chromium-based browsers.
Other small tweaks to the UI

Functionality

It is now possible to obtain more result types for a document without having to request OCR again:
1 - Select additional result types in the Configuration interface.
2 - Make any edit in the Result Editing interface.
3 - Request regeneration of result files. The newly selected types will also be generated.

Admin Tools

Admin users can now remotely alter the maximum age of private spaces on the Storage Manager page. Previously, it didn't make much sense to change the automatic cleanup to a monthly schedule, since the maximum acceptable age would probably stay as the default.

Performance

Removed spellchecking/dictionary from the Results Editing interface. For documents with hundreds of thousands of identified words, this functionality was heavy enough to make the interface unusable and even freeze the entire browser. It also only had any value for Portuguese and English texts where correct spelling would be more important than faithfully representing eventual errors in the original.
Increased allowed request size for the endpoints that edit segmentations and result text, specifically to support edits to dozens of megabytes of text results.
Disabled indexing and searching, as it was an underdeveloped feature and allocated a lot of memory to Elasticsearch.
Temporary images used for generating PDF results are now reused, if generating them with and without index successively. They are also no longer stored until the end of generation, each being deleted immediately after use.
Implemented lock system on read/write of _data.json files. There was a recurrent problem where a high volume of concurrent read and write attempts would lead to the file sometimes being read as empty and cause errors.
Reduced the rate of status updates when generating the PDF, as it was unnecessary I/O: the browser app doesn't refresh fast enough to care about every number of progress.

Other

Replaced Google Docs manual with a GitBook manual.
A few bugfixes for problems that only manifest when deploying on a path other than root.

Full Changelog: v1.3.0...v1.4.0

Contributors

AdventurousGui

Assets 2

16 Sep 00:24

AdventurousGui

v1.3.0

5c03e1d

1.3.0

What's Changed

Render TIFF images in the browser interface by @AdventurousGui in #264
Load image once to OCR all text boxes when using tesserOCR by @AdventurousGui in #265
Split multi-page TIFF files to allow processing all pages by @AdventurousGui in #267
Merge TIFF changes into main by @AdventurousGui in #269
UI improvements by @AdventurousGui in #273
Simplify React imports by @AdventurousGui in #274
Allow partial admin config presets by @AdventurousGui in #275
Set fixed filesystem table height and allow scroll with sticky header by @AdventurousGui in #276
Backend improvements to performance and resources by @AdventurousGui in #277
Improve manual segmentation by @AdventurousGui in #278
Reimplement auto-segmentation at block level with tesserOCR by @AdventurousGui in #270

Details of Most Important Changes

Changed "session" terminology to "space" to better reflect the nature of public and private workspaces.
Overhauled the file system interface in the browser client:
- files related to a document are listed in a dropdown when clicking on the document name, similarly to the behaviour of version 0.22, and the original file is always available, even during OCR;
- action buttons for folders and documents are accessible within a context menu;
- in some cases, action buttons show explanatory messages about why they're disabled;
- similarly to the OCR Configuration action button, the button for Layout Editing now reflects whether the document has already been manually segmented;
- buttons for new folder/document are moved out of the dashboard seen in all menus and placed within the table, where they are relevant;
- folders and documents are consistently grouped separately from each other in the list, similarly to file explorers in operating systems;
- the table can now be sorted not only alphabetically, but also by date created, number of pages/content, and by size occupied;
- the table is collapsed and can be navigated by scrolling with a sticky header when it gets too long, ensuring the dashboard at the top is always accessible;
- the "new private space" button is placed in the top right corner, to emphasize the separation between private spaces and the public space.
Added support for multi-page TIFF files.
TIFF pages are now rendered in the browser and support editing the layout and results.
The Layout Editing menu no longer misbehaves when the mouse hovers over existing segment boxes while creating new boxes or resizing existing ones. Overlapping segments are fully supported.
The Results Editing menu can now colour the obtained words according to the level of confidence, with green for high, orange for medium, and red for low. This can speed up the revision of the results by highlighting potential problems.
Spacing is improved in the Results Editing menu to appropriately reflect the different paragraphs that were recognised.
The OCR configuration menu now shows the order in which languages were selected, considering it is relevant to Tesseract.
If the OCR configuration includes the appropriate parameters, the system now extracts the estimated font names to the JSON results used internally. For single-page documents, they are included in the hOCR output.
The browser client now uploads much larger chunk sizes, effectively ensuring no chunking is used in the vast majority of use cases. The excessive requests using small chunk sizes only overburdened the server, dramatically reducing its response time and increasing the chance of an interrupted upload by the user leaving the page before all requests are completed.
In the admin tools, API docs and private spaces are ordered by size to better identify cleanup candidates.
Admins can now create partial OCR configuration presets; this prevents other parameters from being overwritten with unnecessary values on the user side.
Admins can now delete OCR configuration presets remotely from the editing menu.
Performance improvements:
- When using TesserOCR as the engine, processing documents with many segment boxes per page becomes much faster.
- Celery concurrency is set to autoscale between 8 and 16 (the expected number of cores in a deployment machine), which lowers baseline resource usage and ensures maximum speed when processing a large volume of tasks.
- Celery worker processes are recreated after performing a single task, ensuring memory is freed, with no noticeable impact to speed, since process recreation is significantly faster than the heavier and most numerous tasks (page OCR).
- Celery tasks now have a priority system, to ensure light or time-sensitive tasks are processed before potentially heavy ones (page OCR). This was crucial specifically for automatic segmentation and admin-related operations, which were prone to being delayed excessively when a document with over a hundred pages was taking up all worker processes.
- Celery now performs as little prefetching as possible, guaranteeing the previous point takes effect. With prefetching, a higher-priority task would still be delayed until all low-priority prefetched tasks were completed.
- Task results are explicitly ignored when not necessary, which makes Redis no longer store unfreed results and slowly take up memory.
- Simple PDF creation no longer unnecessarily handles a sorted list of the identified words.
Redis is switched to a lighter image to minimize the required storage space.
The seemingly defective invisible font used for result PDFs is replaced with invisible Times New Roman so that it can actually be searched and copied.
Fonts are self-hosted to ensure the browser interface is consistent even in off-the-grid deployments.
All React components are removed from the unnecessary "Geral" folders wrapping them, which was complicating importing.
Small improvements in UI wording, namely of buttons and page titles.
Updates to dependency versions.
General bugfixes.

Full Changelog: v1.2.0...v1.3.0

Contributors

AdventurousGui

Assets 2

05 Aug 13:18

AdventurousGui

v1.2.0

58313db

1.2.0

What's Changed

Implement API for third-party and command-line usage by @AdventurousGui in #257
Implement creation and usage of OCR presets by @AdventurousGui in #258
Correct config verification for tesserOCR and fix indexing feature by @AdventurousGui in #259
Implement OCR requests for entire folder by @AdventurousGui in #260
Fix OCR pipeline issues by @AdventurousGui in #261

Detailed Summary

Admins can create named OCR presets, which users can select and apply.
OCR can be configured and requested for a folder, which will process all of its direct document contents.
- This works as a shortcut to configuring and requesting the OCR of each document. Each document is processed separately and results are made available in the same way.
- Folder OCR configurations are stored with respect to the folder and only applied to requests directed at the entire folder. Configurations that have been set up for each document are stored and used when requesting an individual OCR of the specific document.
- User-defined segments for each document are still considered in the batched request.
- Sub-folders and their contents are ignored, and must be handled as folders with their own configuration.
Implementation of a basic API for third-party or command-line use.
TesserOCR can now also speed up processing by directly producing PDF, text, hOCR, and ALTO outputs for single-page files.
The engine modules for PyTesseract and TesserOCR now accept either a PIL image file or a filename as input.
The engine module for TesserOCR now accepts and uses all the same parameters as the one for PyTesseract.
Requesting a new OCR of an indexed file now removes its previous results from the elasticsearch index.
hOCR result is no longer shown regardless of output types in OCR request. A hOCR file is always used internally to generate results.
OSD_ONLY segmentation option is removed due to hOCR output always being expected internally. It may be allowed in the future.
Changes to the UI (titles, return button, loading circle when fetching info for LayoutMenu and EditingMenu)
Minor fixes to browser interface bugs.
Fixes to bugs that appeared on deployment when serving the app from a non-root path.
Security updates to dependency versions.

Full Changelog: v1.1.1...v1.2.0

Contributors

AdventurousGui

Assets 2

16 Jul 15:00

AdventurousGui

v1.1.1

9f4ab84

1.1.1

What's Changed

Bugfix: incorrect use of dict.get()
Setup REACT_APP_BASENAME and APP_BASENAME to properly route the admin funcionalities when the app is deployed under a path other than /
Rate-limit the login endpoint to 1 request per second

Full Changelog: v1.1.0...v1.1.1

Assets 2

14 Jul 15:29

AdventurousGui

v1.1.0

7c5b951

1.1.0

What's Changed

Fix client-side validation of DPI parameter and validate it server-side by @AdventurousGui in #250
Implement admin page by @AdventurousGui in #251
Replace additional info in server response with async requests from client by @AdventurousGui in #252
Implement index as list of page numbers where words appear by @AdventurousGui in #253
Fix API call parameter value for pytesseract engine by @AdventurousGui in #254
Update pre-commit-config and cleanup python code by @AdventurousGui in #255

Full Changelog: v1.0.0...v1.1.0

Contributors

AdventurousGui

Assets 2

23 Jun 12:18

AdventurousGui

v1.0.0

7c8d7d7

1.0.0

What's Changed

Storage format of page images changed from JPEG to PNG
Bump flask-cors from 4.0.0 to 4.0.2 in /server/requirements by @dependabot in #215
Bump opencv-python-headless from 4.7.0.72 to 4.8.1.78 in /server/requirements by @dependabot in #216
Bump requests from 2.31.0 to 2.32.2 in /server/requirements by @dependabot in #217
Error in celery worker by @DiogoAPFernandes in #220
Load component classes at start of files by @AdventurousGui in #223
Enhance private sessions by @AdventurousGui in #221
fix import in PrivateFileRow.js by @AdventurousGui in #224
Fix path variable on endpoints to add and remove index by @AdventurousGui in #225
Standardize menu toolbar across menus by @AdventurousGui in #226
Add support for more file types by @AdventurousGui in #227
Bump gunicorn from 22.0.0 to 23.0.0 in /server/requirements by @dependabot in #230
Improve editing menu by @AdventurousGui in #228
Fix incorrect image URLs for indexed files by @AdventurousGui in #231
List default component props for documentation by @AdventurousGui in #229
Refactor filesystem into reusable component by @AdventurousGui in #232
Fix auto-segmentation and disable auto-segment for unsupported files by @AdventurousGui in #233
Adjust Layout menu style and implement zoom reset by @AdventurousGui in #234
Fix images zip not being created by @AdventurousGui in #235
Fix error on editing text with special characters by @AdventurousGui in #239
Keep layout menu available on OCR error by @AdventurousGui in #241
Split server and worker dependencies by @AdventurousGui in #236
Improve search page by @AdventurousGui in #237
Improve Layout Menu by @AdventurousGui in #238
Integrate Thesis work by @AdventurousGui in #242
Update deployment instructions and configuration of request paths by @AdventurousGui in #243
Fix dotenv module missing from server by @AdventurousGui in #244
Upgrade python version to 3.12 by @AdventurousGui in #245
Implement OCR options menu and update browser interface by @AdventurousGui in #246
Enable TesserOCR and NER options by @AdventurousGui in #247
hotfix: ensure non-default parameter values are still numeric by @AdventurousGui in #248
Update README.md by @AdventurousGui in #249

New Contributors

@DiogoAPFernandes made their first contribution in #220
@AdventurousGui made their first contribution in #223

Full Changelog: v0.22.6...v1.0.0

Contributors

dependabot, AdventurousGui, and DiogoAPFernandes

Assets 2

Releases: stjiris/OCR

1.4.1

What's Changed

Contributors

Uh oh!

1.4.0

What's Changed

Details of Most Important Changes

UI

Functionality

Admin Tools

Performance

Other

Contributors

Uh oh!

1.3.0

What's Changed

Details of Most Important Changes

Contributors

Uh oh!

1.2.0

What's Changed

Detailed Summary

Contributors

Uh oh!

1.1.1

What's Changed

Uh oh!

1.1.0

What's Changed

Contributors

Uh oh!

1.0.0

What's Changed

New Contributors

Contributors

Uh oh!