Skip to content

Releases: stjiris/OCR

1.4.1

14 Oct 00:23
a99e5a5

Choose a tag to compare

What's Changed

Full Changelog: v1.4.0...v1.4.1

1.4.0

02 Oct 23:14
11a541f

Choose a tag to compare

What's Changed

Details of Most Important Changes

UI

  • Document rows and their context menus now show a thumbnail of the first page.
  • Folder rows list total size of its contents (calculated recursively), and the total size listed on the document rows now refers to the entire storage occupation by files related to the document, including extracted pages, OCR results as JSON, etc.
  • Added informative tooltips to several buttons, particularly some explaining why they are disabled.
  • Redesigned the interface with a colour palette and identity that is more in line with other STJ tools.
  • In the Results Editing menu, it's now possible to move around the page image on the left by dragging it.
  • Implemented skipping to a certain page on the Segmentation and Results Editing interfaces by writing the desired number, like in most document viewers.
  • Improved Segmentation functionality. See #283
  • Blocked attempts to making segmentation changes while automatic segmentation is being performed by the server. They would be overwritten anyway.
  • Website favicon changed.
  • Fixed TIFF page images not rendering when switching between pages in the Results Editing interface.
  • Ensured the Results Editing functionality all works in Edge and Chrome, which likely means it also works in other Chromium-based browsers.
  • Other small tweaks to the UI

Functionality

  • It is now possible to obtain more result types for a document without having to request OCR again:
    1 - Select additional result types in the Configuration interface.
    2 - Make any edit in the Result Editing interface.
    3 - Request regeneration of result files. The newly selected types will also be generated.

Admin Tools

  • Admin users can now remotely alter the maximum age of private spaces on the Storage Manager page. Previously, it didn't make much sense to change the automatic cleanup to a monthly schedule, since the maximum acceptable age would probably stay as the default.

Performance

  • Removed spellchecking/dictionary from the Results Editing interface. For documents with hundreds of thousands of identified words, this functionality was heavy enough to make the interface unusable and even freeze the entire browser. It also only had any value for Portuguese and English texts where correct spelling would be more important than faithfully representing eventual errors in the original.
  • Increased allowed request size for the endpoints that edit segmentations and result text, specifically to support edits to dozens of megabytes of text results.
  • Disabled indexing and searching, as it was an underdeveloped feature and allocated a lot of memory to Elasticsearch.
  • Temporary images used for generating PDF results are now reused, if generating them with and without index successively. They are also no longer stored until the end of generation, each being deleted immediately after use.
  • Implemented lock system on read/write of _data.json files. There was a recurrent problem where a high volume of concurrent read and write attempts would lead to the file sometimes being read as empty and cause errors.
  • Reduced the rate of status updates when generating the PDF, as it was unnecessary I/O: the browser app doesn't refresh fast enough to care about every number of progress.

Other

  • Replaced Google Docs manual with a GitBook manual.
  • A few bugfixes for problems that only manifest when deploying on a path other than root.

Full Changelog: v1.3.0...v1.4.0

1.3.0

16 Sep 00:24
5c03e1d

Choose a tag to compare

What's Changed

Details of Most Important Changes

  • Changed "session" terminology to "space" to better reflect the nature of public and private workspaces.

  • Overhauled the file system interface in the browser client:

    • files related to a document are listed in a dropdown when clicking on the document name, similarly to the behaviour of version 0.22, and the original file is always available, even during OCR;
    • action buttons for folders and documents are accessible within a context menu;
    • in some cases, action buttons show explanatory messages about why they're disabled;
    • similarly to the OCR Configuration action button, the button for Layout Editing now reflects whether the document has already been manually segmented;
    • buttons for new folder/document are moved out of the dashboard seen in all menus and placed within the table, where they are relevant;
    • folders and documents are consistently grouped separately from each other in the list, similarly to file explorers in operating systems;
    • the table can now be sorted not only alphabetically, but also by date created, number of pages/content, and by size occupied;
    • the table is collapsed and can be navigated by scrolling with a sticky header when it gets too long, ensuring the dashboard at the top is always accessible;
    • the "new private space" button is placed in the top right corner, to emphasize the separation between private spaces and the public space.
  • Added support for multi-page TIFF files.

  • TIFF pages are now rendered in the browser and support editing the layout and results.

  • The Layout Editing menu no longer misbehaves when the mouse hovers over existing segment boxes while creating new boxes or resizing existing ones. Overlapping segments are fully supported.

  • The Results Editing menu can now colour the obtained words according to the level of confidence, with green for high, orange for medium, and red for low. This can speed up the revision of the results by highlighting potential problems.

  • Spacing is improved in the Results Editing menu to appropriately reflect the different paragraphs that were recognised.

  • The OCR configuration menu now shows the order in which languages were selected, considering it is relevant to Tesseract.

  • If the OCR configuration includes the appropriate parameters, the system now extracts the estimated font names to the JSON results used internally. For single-page documents, they are included in the hOCR output.

  • The browser client now uploads much larger chunk sizes, effectively ensuring no chunking is used in the vast majority of use cases. The excessive requests using small chunk sizes only overburdened the server, dramatically reducing its response time and increasing the chance of an interrupted upload by the user leaving the page before all requests are completed.

  • In the admin tools, API docs and private spaces are ordered by size to better identify cleanup candidates.

  • Admins can now create partial OCR configuration presets; this prevents other parameters from being overwritten with unnecessary values on the user side.

  • Admins can now delete OCR configuration presets remotely from the editing menu.

  • Performance improvements:

    • When using TesserOCR as the engine, processing documents with many segment boxes per page becomes much faster.
    • Celery concurrency is set to autoscale between 8 and 16 (the expected number of cores in a deployment machine), which lowers baseline resource usage and ensures maximum speed when processing a large volume of tasks.
    • Celery worker processes are recreated after performing a single task, ensuring memory is freed, with no noticeable impact to speed, since process recreation is significantly faster than the heavier and most numerous tasks (page OCR).
    • Celery tasks now have a priority system, to ensure light or time-sensitive tasks are processed before potentially heavy ones (page OCR). This was crucial specifically for automatic segmentation and admin-related operations, which were prone to being delayed excessively when a document with over a hundred pages was taking up all worker processes.
    • Celery now performs as little prefetching as possible, guaranteeing the previous point takes effect. With prefetching, a higher-priority task would still be delayed until all low-priority prefetched tasks were completed.
    • Task results are explicitly ignored when not necessary, which makes Redis no longer store unfreed results and slowly take up memory.
    • Simple PDF creation no longer unnecessarily handles a sorted list of the identified words.
  • Redis is switched to a lighter image to minimize the required storage space.

  • The seemingly defective invisible font used for result PDFs is replaced with invisible Times New Roman so that it can actually be searched and copied.

  • Fonts are self-hosted to ensure the browser interface is consistent even in off-the-grid deployments.

  • All React components are removed from the unnecessary "Geral" folders wrapping them, which was complicating importing.

  • Small improvements in UI wording, namely of buttons and page titles.

  • Updates to dependency versions.

  • General bugfixes.

Full Changelog: v1.2.0...v1.3.0

1.2.0

05 Aug 13:18
58313db

Choose a tag to compare

What's Changed

Detailed Summary

  • Admins can create named OCR presets, which users can select and apply.
  • OCR can be configured and requested for a folder, which will process all of its direct document contents.
    • This works as a shortcut to configuring and requesting the OCR of each document. Each document is processed separately and results are made available in the same way.
    • Folder OCR configurations are stored with respect to the folder and only applied to requests directed at the entire folder. Configurations that have been set up for each document are stored and used when requesting an individual OCR of the specific document.
    • User-defined segments for each document are still considered in the batched request.
    • Sub-folders and their contents are ignored, and must be handled as folders with their own configuration.
  • Implementation of a basic API for third-party or command-line use.
  • TesserOCR can now also speed up processing by directly producing PDF, text, hOCR, and ALTO outputs for single-page files.
  • The engine modules for PyTesseract and TesserOCR now accept either a PIL image file or a filename as input.
  • The engine module for TesserOCR now accepts and uses all the same parameters as the one for PyTesseract.
  • Requesting a new OCR of an indexed file now removes its previous results from the elasticsearch index.
  • hOCR result is no longer shown regardless of output types in OCR request. A hOCR file is always used internally to generate results.
  • OSD_ONLY segmentation option is removed due to hOCR output always being expected internally. It may be allowed in the future.
  • Changes to the UI (titles, return button, loading circle when fetching info for LayoutMenu and EditingMenu)
  • Minor fixes to browser interface bugs.
  • Fixes to bugs that appeared on deployment when serving the app from a non-root path.
  • Security updates to dependency versions.

Full Changelog: v1.1.1...v1.2.0

1.1.1

16 Jul 15:00
9f4ab84

Choose a tag to compare

What's Changed

  • Bugfix: incorrect use of dict.get()
  • Setup REACT_APP_BASENAME and APP_BASENAME to properly route the admin funcionalities when the app is deployed under a path other than /
  • Rate-limit the login endpoint to 1 request per second

Full Changelog: v1.1.0...v1.1.1

1.1.0

14 Jul 15:29
7c5b951

Choose a tag to compare

What's Changed

Full Changelog: v1.0.0...v1.1.0

1.0.0

23 Jun 12:18
7c8d7d7

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.22.6...v1.0.0