Releases: stjiris/OCR
1.4.1
What's Changed
- Hotfixes to file types and segmentation page by @AdventurousGui in #288
- Allow restarting worker pool from Flower by @AdventurousGui in #289
- Fixup in call to custom autosegmentation and set version 1.4.1 by @AdventurousGui in #290
- Fix dynamic maximum private space age by @AdventurousGui in #291
- Avoid generating thumbnails for files from API by @AdventurousGui in #292
- Ensure Storage Manager correctly orders spaces/docs by size as float by @AdventurousGui in #293
- Update word counts on result edit by @AdventurousGui in #294
Full Changelog: v1.4.0...v1.4.1
1.4.0
What's Changed
- Disable Search feature and ElasticSearch container by @AdventurousGui in #280
- UI changes and deactivation of search feature by @AdventurousGui in #281
- Performance-related changes by @AdventurousGui in #282
- Reimplement segment grouping/replicating functionality by @AdventurousGui in #283
- Fix bugs related to upload process by @AdventurousGui in #284
- Allow creation of extra results after result editing by @AdventurousGui in #285
- Sync dev v1.4.0 into main by @AdventurousGui in #286
- Allow altering max private space age in Storage Manager by @AdventurousGui in #287
Details of Most Important Changes
UI
- Document rows and their context menus now show a thumbnail of the first page.
- Folder rows list total size of its contents (calculated recursively), and the total size listed on the document rows now refers to the entire storage occupation by files related to the document, including extracted pages, OCR results as JSON, etc.
- Added informative tooltips to several buttons, particularly some explaining why they are disabled.
- Redesigned the interface with a colour palette and identity that is more in line with other STJ tools.
- In the Results Editing menu, it's now possible to move around the page image on the left by dragging it.
- Implemented skipping to a certain page on the Segmentation and Results Editing interfaces by writing the desired number, like in most document viewers.
- Improved Segmentation functionality. See #283
- Blocked attempts to making segmentation changes while automatic segmentation is being performed by the server. They would be overwritten anyway.
- Website favicon changed.
- Fixed TIFF page images not rendering when switching between pages in the Results Editing interface.
- Ensured the Results Editing functionality all works in Edge and Chrome, which likely means it also works in other Chromium-based browsers.
- Other small tweaks to the UI
Functionality
- It is now possible to obtain more result types for a document without having to request OCR again:
1 - Select additional result types in the Configuration interface.
2 - Make any edit in the Result Editing interface.
3 - Request regeneration of result files. The newly selected types will also be generated.
Admin Tools
- Admin users can now remotely alter the maximum age of private spaces on the Storage Manager page. Previously, it didn't make much sense to change the automatic cleanup to a monthly schedule, since the maximum acceptable age would probably stay as the default.
Performance
- Removed spellchecking/dictionary from the Results Editing interface. For documents with hundreds of thousands of identified words, this functionality was heavy enough to make the interface unusable and even freeze the entire browser. It also only had any value for Portuguese and English texts where correct spelling would be more important than faithfully representing eventual errors in the original.
- Increased allowed request size for the endpoints that edit segmentations and result text, specifically to support edits to dozens of megabytes of text results.
- Disabled indexing and searching, as it was an underdeveloped feature and allocated a lot of memory to Elasticsearch.
- Temporary images used for generating PDF results are now reused, if generating them with and without index successively. They are also no longer stored until the end of generation, each being deleted immediately after use.
- Implemented lock system on read/write of
_data.jsonfiles. There was a recurrent problem where a high volume of concurrent read and write attempts would lead to the file sometimes being read as empty and cause errors. - Reduced the rate of status updates when generating the PDF, as it was unnecessary I/O: the browser app doesn't refresh fast enough to care about every number of progress.
Other
- Replaced Google Docs manual with a GitBook manual.
- A few bugfixes for problems that only manifest when deploying on a path other than root.
Full Changelog: v1.3.0...v1.4.0
1.3.0
What's Changed
- Render TIFF images in the browser interface by @AdventurousGui in #264
- Load image once to OCR all text boxes when using tesserOCR by @AdventurousGui in #265
- Split multi-page TIFF files to allow processing all pages by @AdventurousGui in #267
- Merge TIFF changes into main by @AdventurousGui in #269
- UI improvements by @AdventurousGui in #273
- Simplify React imports by @AdventurousGui in #274
- Allow partial admin config presets by @AdventurousGui in #275
- Set fixed filesystem table height and allow scroll with sticky header by @AdventurousGui in #276
- Backend improvements to performance and resources by @AdventurousGui in #277
- Improve manual segmentation by @AdventurousGui in #278
- Reimplement auto-segmentation at block level with tesserOCR by @AdventurousGui in #270
Details of Most Important Changes
-
Changed "session" terminology to "space" to better reflect the nature of public and private workspaces.
-
Overhauled the file system interface in the browser client:
- files related to a document are listed in a dropdown when clicking on the document name, similarly to the behaviour of version 0.22, and the original file is always available, even during OCR;
- action buttons for folders and documents are accessible within a context menu;
- in some cases, action buttons show explanatory messages about why they're disabled;
- similarly to the OCR Configuration action button, the button for Layout Editing now reflects whether the document has already been manually segmented;
- buttons for new folder/document are moved out of the dashboard seen in all menus and placed within the table, where they are relevant;
- folders and documents are consistently grouped separately from each other in the list, similarly to file explorers in operating systems;
- the table can now be sorted not only alphabetically, but also by date created, number of pages/content, and by size occupied;
- the table is collapsed and can be navigated by scrolling with a sticky header when it gets too long, ensuring the dashboard at the top is always accessible;
- the "new private space" button is placed in the top right corner, to emphasize the separation between private spaces and the public space.
-
Added support for multi-page TIFF files.
-
TIFF pages are now rendered in the browser and support editing the layout and results.
-
The Layout Editing menu no longer misbehaves when the mouse hovers over existing segment boxes while creating new boxes or resizing existing ones. Overlapping segments are fully supported.
-
The Results Editing menu can now colour the obtained words according to the level of confidence, with green for high, orange for medium, and red for low. This can speed up the revision of the results by highlighting potential problems.
-
Spacing is improved in the Results Editing menu to appropriately reflect the different paragraphs that were recognised.
-
The OCR configuration menu now shows the order in which languages were selected, considering it is relevant to Tesseract.
-
If the OCR configuration includes the appropriate parameters, the system now extracts the estimated font names to the JSON results used internally. For single-page documents, they are included in the hOCR output.
-
The browser client now uploads much larger chunk sizes, effectively ensuring no chunking is used in the vast majority of use cases. The excessive requests using small chunk sizes only overburdened the server, dramatically reducing its response time and increasing the chance of an interrupted upload by the user leaving the page before all requests are completed.
-
In the admin tools, API docs and private spaces are ordered by size to better identify cleanup candidates.
-
Admins can now create partial OCR configuration presets; this prevents other parameters from being overwritten with unnecessary values on the user side.
-
Admins can now delete OCR configuration presets remotely from the editing menu.
-
Performance improvements:
- When using TesserOCR as the engine, processing documents with many segment boxes per page becomes much faster.
- Celery concurrency is set to autoscale between 8 and 16 (the expected number of cores in a deployment machine), which lowers baseline resource usage and ensures maximum speed when processing a large volume of tasks.
- Celery worker processes are recreated after performing a single task, ensuring memory is freed, with no noticeable impact to speed, since process recreation is significantly faster than the heavier and most numerous tasks (page OCR).
- Celery tasks now have a priority system, to ensure light or time-sensitive tasks are processed before potentially heavy ones (page OCR). This was crucial specifically for automatic segmentation and admin-related operations, which were prone to being delayed excessively when a document with over a hundred pages was taking up all worker processes.
- Celery now performs as little prefetching as possible, guaranteeing the previous point takes effect. With prefetching, a higher-priority task would still be delayed until all low-priority prefetched tasks were completed.
- Task results are explicitly ignored when not necessary, which makes Redis no longer store unfreed results and slowly take up memory.
- Simple PDF creation no longer unnecessarily handles a sorted list of the identified words.
-
Redis is switched to a lighter image to minimize the required storage space.
-
The seemingly defective invisible font used for result PDFs is replaced with invisible Times New Roman so that it can actually be searched and copied.
-
Fonts are self-hosted to ensure the browser interface is consistent even in off-the-grid deployments.
-
All React components are removed from the unnecessary "Geral" folders wrapping them, which was complicating importing.
-
Small improvements in UI wording, namely of buttons and page titles.
-
Updates to dependency versions.
-
General bugfixes.
Full Changelog: v1.2.0...v1.3.0
1.2.0
What's Changed
- Implement API for third-party and command-line usage by @AdventurousGui in #257
- Implement creation and usage of OCR presets by @AdventurousGui in #258
- Correct config verification for tesserOCR and fix indexing feature by @AdventurousGui in #259
- Implement OCR requests for entire folder by @AdventurousGui in #260
- Fix OCR pipeline issues by @AdventurousGui in #261
Detailed Summary
- Admins can create named OCR presets, which users can select and apply.
- OCR can be configured and requested for a folder, which will process all of its direct document contents.
- This works as a shortcut to configuring and requesting the OCR of each document. Each document is processed separately and results are made available in the same way.
- Folder OCR configurations are stored with respect to the folder and only applied to requests directed at the entire folder. Configurations that have been set up for each document are stored and used when requesting an individual OCR of the specific document.
- User-defined segments for each document are still considered in the batched request.
- Sub-folders and their contents are ignored, and must be handled as folders with their own configuration.
- Implementation of a basic API for third-party or command-line use.
- TesserOCR can now also speed up processing by directly producing PDF, text, hOCR, and ALTO outputs for single-page files.
- The engine modules for PyTesseract and TesserOCR now accept either a PIL image file or a filename as input.
- The engine module for TesserOCR now accepts and uses all the same parameters as the one for PyTesseract.
- Requesting a new OCR of an indexed file now removes its previous results from the elasticsearch index.
- hOCR result is no longer shown regardless of output types in OCR request. A hOCR file is always used internally to generate results.
OSD_ONLYsegmentation option is removed due to hOCR output always being expected internally. It may be allowed in the future.- Changes to the UI (titles, return button, loading circle when fetching info for LayoutMenu and EditingMenu)
- Minor fixes to browser interface bugs.
- Fixes to bugs that appeared on deployment when serving the app from a non-root path.
- Security updates to dependency versions.
Full Changelog: v1.1.1...v1.2.0
1.1.1
What's Changed
- Bugfix: incorrect use of dict.get()
- Setup REACT_APP_BASENAME and APP_BASENAME to properly route the admin funcionalities when the app is deployed under a path other than
/ - Rate-limit the login endpoint to 1 request per second
Full Changelog: v1.1.0...v1.1.1
1.1.0
What's Changed
- Fix client-side validation of DPI parameter and validate it server-side by @AdventurousGui in #250
- Implement admin page by @AdventurousGui in #251
- Replace additional info in server response with async requests from client by @AdventurousGui in #252
- Implement index as list of page numbers where words appear by @AdventurousGui in #253
- Fix API call parameter value for pytesseract engine by @AdventurousGui in #254
- Update pre-commit-config and cleanup python code by @AdventurousGui in #255
Full Changelog: v1.0.0...v1.1.0
1.0.0
What's Changed
-
Storage format of page images changed from JPEG to PNG
-
Bump flask-cors from 4.0.0 to 4.0.2 in /server/requirements by @dependabot in #215
-
Bump opencv-python-headless from 4.7.0.72 to 4.8.1.78 in /server/requirements by @dependabot in #216
-
Bump requests from 2.31.0 to 2.32.2 in /server/requirements by @dependabot in #217
-
Error in celery worker by @DiogoAPFernandes in #220
-
Load component classes at start of files by @AdventurousGui in #223
-
Enhance private sessions by @AdventurousGui in #221
-
fix import in PrivateFileRow.js by @AdventurousGui in #224
-
Fix path variable on endpoints to add and remove index by @AdventurousGui in #225
-
Standardize menu toolbar across menus by @AdventurousGui in #226
-
Add support for more file types by @AdventurousGui in #227
-
Bump gunicorn from 22.0.0 to 23.0.0 in /server/requirements by @dependabot in #230
-
Improve editing menu by @AdventurousGui in #228
-
Fix incorrect image URLs for indexed files by @AdventurousGui in #231
-
List default component props for documentation by @AdventurousGui in #229
-
Refactor filesystem into reusable component by @AdventurousGui in #232
-
Fix auto-segmentation and disable auto-segment for unsupported files by @AdventurousGui in #233
-
Adjust Layout menu style and implement zoom reset by @AdventurousGui in #234
-
Fix images zip not being created by @AdventurousGui in #235
-
Fix error on editing text with special characters by @AdventurousGui in #239
-
Keep layout menu available on OCR error by @AdventurousGui in #241
-
Split server and worker dependencies by @AdventurousGui in #236
-
Improve search page by @AdventurousGui in #237
-
Improve Layout Menu by @AdventurousGui in #238
-
Integrate Thesis work by @AdventurousGui in #242
-
Update deployment instructions and configuration of request paths by @AdventurousGui in #243
-
Fix dotenv module missing from server by @AdventurousGui in #244
-
Upgrade python version to 3.12 by @AdventurousGui in #245
-
Implement OCR options menu and update browser interface by @AdventurousGui in #246
-
Enable TesserOCR and NER options by @AdventurousGui in #247
-
hotfix: ensure non-default parameter values are still numeric by @AdventurousGui in #248
-
Update README.md by @AdventurousGui in #249
New Contributors
- @DiogoAPFernandes made their first contribution in #220
- @AdventurousGui made their first contribution in #223
Full Changelog: v0.22.6...v1.0.0