fix: Fix documents losing associated comments when changing storage path by pauliyobo · Pull Request #349 · blindpandas/bookworm

pauliyobo · 2025-11-15T00:18:02Z

Link to issue number:

closes #229

Summary of the issue:

Up until now, annotations for a document were associated only through the document's title and storage URI.
While this has worked for most cases, changing the location of the original document would consequently invalidate the association, even though the annotations and the document in question would not have any modification.

Description of how this pull request fixes the issue:

This PR allows documents to be found also through their content hash.
This will cover cases in which the storage location has changed, however, it will not cover the cases in which the content of a document has been modified, though if the storage location remains the same, there shouldn't be any changes i nthe behaviour.

Testing performed:

Manual and unit testing

Known issues with pull request:

While this isn't really an issue, I fear that the migration may take a significant amount of time for large collections. It would need to be tested.

PS: Sorry for the horrible review experience, ruff format touched way more than I'd have expected.

@cary-rowen I believe you were particularly affected by this issue. Any thoughts?
Also, I wonder if the user should be prompted whenever the storage location of the document differs from the actual path stored in the database.

cary-rowen · 2025-11-15T14:44:49Z

Hello @pauliyobo
Great work! I like this.

You wrote:

Sorry for the horrible review experience, ruff format touched way more than I'd have expected.

Is it possible for you to put the ruff lintting changes in a separate PR?

Also, I wonder if the user should be prompted whenever the storage location of the document differs from the actual path stored in the database.

I think giving dialog might be better, but I haven't tested this yet, I hope to test this soon, it's so cool.

pauliyobo · 2025-11-15T17:53:21Z

Is it possible for you to put the ruff lintting changes in a separate PR?

probably, though I accidentally ran format before committing the prior change, which makes things a bit complicated.

cary-rowen · 2025-11-29T15:00:27Z

Hello @pauliyobo
I downloaded the build of the current PR, but it seems that it cannot be started at all. Can you confirm it?

pauliyobo · 2025-12-06T08:04:44Z

Hello
@cary-rowen it should be fixed now.

cary-rowen · 2025-12-07T12:03:09Z

Thanks @pauliyobo
I installed the latest build and encountered the following error:

Failed to execute script 'Bookworm' due to unhandled exception: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Starting Bookworm.
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Bookworm Version: 2025.1
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Python version: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Platform: Windows-10-10.0.19045-SP0
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
OS description: Windows 10 (build 19045), 64-bit edition
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Application architecture: x64
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Running an installed copy of Bookworm.
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Debug mode is off.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up application subsystems.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up the configuration subsystem.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up the internationalization subsystem.
DEBUG - bookworm.bookworm.i18n.core - 07/12/2025 20:00:31 - MainThread (4720):
Setting application locale to zh_CN.
DEBUG - bookworm.bookworm.i18n.wx_i18n - 07/12/2025 20:00:31 - MainThread (4720):
Setting wx locale to LocaleInfo(identifier="zh_CN");language=zh.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Initializing the database subsystem.
DEBUG - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Using url sqlite:///C:\Users\cary\AppData\Roaming\bookworm\database\database.sqlite 
INFO - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Current revision is 52e39c4f7494
INFO - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Running database migrations and setup

pauliyobo · 2025-12-30T10:26:14Z

@cary-rowen apologies for the delay.
Can you try with the new build? It was likely an encoding problem caused by the fact that I was explicitly using UTF8 when generating the hashed content.

cary-rowen · 2025-12-30T23:11:03Z

Hi @pauliyobo
Thanks for your work.

Same error:

Failed to execute script 'Bookworm' due to unhandled exception: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence
Traceback (most recent call last):
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 31, in update_content_hashes
    document = create_document(doc.uri)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "bookworm\document\__init__.py", line 44, in create_document
  File "bookworm\document\formats\pdf.py", line 113, in read
  File "bookworm\document\formats\fitz.py", line 85, in read
  File "bookworm\document\base.py", line 192, in get_file_system_path
bookworm.document.exceptions.DocumentIOError: File D:\MyData\Desktop\Python编程：从入门到实践（第3版） ([美] 埃里克 • 马瑟斯（Eric Matthes）) (Z-Library).pdf does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bookworm\bootstrap.py", line 219, in run
  File "bookworm\bootstrap.py", line 199, in init_app_and_run_main_loop
  File "bookworm\bootstrap.py", line 157, in setupSubsystems
  File "bookworm\database\__init__.py", line 70, in init_database
  File "alembic\command.py", line 406, in upgrade
  File "alembic\script\base.py", line 582, in run_env
  File "alembic\util\pyfiles.py", line 95, in load_python_file
  File "alembic\util\pyfiles.py", line 113, in load_module_py
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 103, in <module>
    run_migrations_online()
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 97, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "alembic\runtime\environment.py", line 946, in run_migrations
  File "alembic\runtime\migration.py", line 628, in run_migrations
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 54, in upgrade
    update_content_hashes(session, Book)
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 35, in update_content_hashes
    print(f"Failed to apply content hash to {doc.title}, {e}")
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 107: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\__init__.py", line 1113, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 736: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\handlers.py", line 75, in emit
  File "logging\__init__.py", line 1230, in emit
  File "logging\__init__.py", line 1118, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bookworm\bookworm.py", line 52, in main
  File "bookworm\bootstrap.py", line 228, in run
  File "logging\__init__.py", line 1536, in critical
  File "logging\__init__.py", line 1634, in _log
  File "logging\__init__.py", line 1644, in handle
  File "logging\__init__.py", line 1706, in callHandlers
  File "logging\__init__.py", line 978, in handle
  File "logging\handlers.py", line 77, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\__init__.py", line 1113, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 635: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Bookworm.py", line 18, in <module>
  File "bookworm\bookworm.py", line 58, in main
  File "bookworm\bookworm.py", line 32, in report_fatal_error
  File "logging\__init__.py", line 1524, in exception
  File "logging\__init__.py", line 1518, in error
  File "logging\__init__.py", line 1634, in _log
  File "logging\__init__.py", line 1644, in handle
  File "logging\__init__.py", line 1706, in callHandlers
  File "logging\__init__.py", line 978, in handle
  File "logging\__init__.py", line 1118, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

pauliyobo · 2026-01-03T04:44:01Z

@cary-rowen
Up on further investigation, the error itself originated from a missing implementation for the content hash generation. Can you try again?
The encoding error likely happens when the document title is printed in the related migration whenever the error occurs.

alembic/versions/707543f03b6d_add_content_hash_column.py

cary-rowen · 2026-01-03T10:41:43Z

@pauliyobo
Sorry, I actually tried your latest commit, but the problem persisted and I investigated further locally and wrote code review suggestions. However I forgot to post it.

cary-rowen · 2026-01-03T15:26:22Z

Even after solving the above problems, I have discovered two other noteworthy issues:

With the migration time issue, I ran into extreme cases: if some of the books recorded in the database are stored in OneDrive, and those files need to be downloaded first. Then the migration will take a long time, including the time needed to download the files.
Downgrade issue: If the migration is completed using the current build, then the user installing an earlier version throws an error with the following log:

03/01/2026 23:20:42 root CRITICAL: Failed to start Bookworm
03/01/2026 23:20:42 root CRITICAL: A fatal error has occured. Please check the log for more details.
The log has been written to the file:
C:\Users\cary\bookworm.errors.log
03/01/2026 23:20:42 root ERROR: ERROR DETAILS:
Traceback (most recent call last):
  File "alembic\script\base.py", line 250, in _catch_revision_errors
  File "alembic\script\base.py", line 458, in _upgrade_revs
  File "alembic\script\revision.py", line 814, in iterate_revisions
  File "alembic\script\revision.py", line 1475, in _collect_upgrade_revisions
  File "alembic\script\revision.py", line 542, in get_revisions
  File "alembic\script\revision.py", line 542, in <listcomp>
  File "alembic\script\revision.py", line 565, in get_revisions
  File "alembic\script\revision.py", line 566, in <genexpr>
  File "alembic\script\revision.py", line 637, in _revision_for_ident
alembic.script.revision.ResolutionError: No such revision or branch '707543f03b6d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bookworm\bookworm.py", line 52, in main
  File "bookworm\bootstrap.py", line 229, in run
  File "bookworm\bootstrap.py", line 219, in run
  File "bookworm\bootstrap.py", line 199, in init_app_and_run_main_loop
  File "bookworm\bootstrap.py", line 157, in setupSubsystems
  File "bookworm\database\__init__.py", line 69, in init_database
  File "alembic\command.py", line 406, in upgrade
  File "alembic\script\base.py", line 582, in run_env
  File "alembic\util\pyfiles.py", line 95, in load_python_file
  File "alembic\util\pyfiles.py", line 113, in load_module_py
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 92, in <module>
    run_migrations_online()
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 86, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "alembic\runtime\environment.py", line 946, in run_migrations
  File "alembic\runtime\migration.py", line 616, in run_migrations
  File "alembic\command.py", line 395, in upgrade
  File "alembic\script\base.py", line 446, in _upgrade_revs
  File "contextlib.py", line 158, in __exit__
  File "alembic\script\base.py", line 282, in _catch_revision_errors
alembic.util.exc.CommandError: Can't locate revision identified by '707543f03b6d'

pauliyobo · 2026-01-03T19:41:16Z

@cary-rowen
Regarding the first point, I can make it so the content hash is generated when the book is loaded, without populating the data from the migration itself. This will however require the documents to be loaded at least once, rather than having the content hashes ready from the get go. Or alternatively I could try generating the data in parallel.
For the second point, I don't really have a solution at the moment, other tan documenting it.

pauliyobo · 2026-01-03T19:43:33Z

Apologies for the double comment. How likely is it that an user will update their version, to then run earlier versions against the same database? Perhaps we could offer a mechanism to downgrade the database version from the latest version in order to return to the previous revision, but I don't know if it's worth the effort.

cary-rowen · 2026-01-04T02:26:43Z

Regarding the first point, is it possible for us to design a migration dialog, even if it's just for showing progress, instead of working completely silently in the background, which would give the user the illusion that the software isn't running properly.

cary-rowen · 2026-01-04T02:28:35Z

On the second point, I don't think downgrading is a use case that should be supported.
btw, just to make sure you saw my code suggestion, maybe you have a better way to solve the Chinese character problem.

cary-rowen · 2026-01-04T05:14:35Z

Hey @pauliyobo
To keep this PR clean, I have reverted those files that contained only styling/whitespace changes. I verified that all logic changes (including the content_hash implementation and SQLite infrastructure updates) are preserved.

I'm sorry to have interfered with your work.

Co-authored-by: wencong <manchen_0528@outlook.com>

pauliyobo · 2026-01-05T08:28:05Z

@cary-rowen
No problem at all, i appreciate the suggestion. I have went ahead and included the ascii code change.

pauliyobo · 2026-01-06T07:52:41Z

I guess one other way could also be to just run the migration with only the schema changes and allow the user to manually populate the content hashes from the app itself. That should solve the long migration time and the progress issues.
@cary-rowen thoughts?

cary-rowen · 2026-01-06T08:07:52Z

That sounds like a good plan. We could run the schema migration at startup but defer the hash calculation until the user opens a specific document.
It wouldn't need to be manually triggered by the user; it could just happen automatically in the background upon opening. This effectively avoids the issue of freezing the app due to long download times (e.g. from OneDrive) for the entire library.

pauliyobo · 2026-01-06T08:14:32Z

Sorry, perhaps I was unclear.
With manual, I meant that perhaps we could add an option to populateall the content hashes. So basically what the data migration currently does.
Even if this operation is not performed, the content hash would still be generated whenever a document is loaded, but of course you wouldn't benefit from this change until all the relevant documents have been opened at least once.

cary-rowen · 2026-01-06T08:54:31Z

I understand. However, I suggest that we implement a popup dialog to ask the user if they want to perform this manual migration (e.g., after the update). This would greatly improve the feature's discoverability, ensuring users know they have the option to update everything at once.

Or, do you have other thoughts?

Conflicts resolved by moving 'blake3' dependency from requirements-app.txt to pyproject.toml.

cary-rowen · 2026-01-16T18:03:44Z

Looking forward to your further updates @pauliyobo. Also, is it possible for us to release a new version in the near future? I'd like to consolidate database migration work into a single version.

cary-rowen · 2026-03-01T17:13:39Z

Actually I'm really looking forward to this, if you need me to do anything, please let me know.

Initial implementation

b0310b2

pauliyobo marked this pull request as draft November 15, 2025 00:40

pauliyobo added 3 commits November 15, 2025 01:56

Add more tests and clean up obsolete comments

314d804

Migration commits after whole loop completes

363cb2e

Enable sqlite3 transaction

328300a

cary-rowen marked this pull request as ready for review November 20, 2025 13:38

cary-rowen marked this pull request as draft November 20, 2025 13:38

cary-rowen force-pushed the fix/issue229 branch from 328300a to 3aa86eb Compare November 20, 2025 13:39

cary-rowen self-assigned this Nov 20, 2025

cary-rowen force-pushed the fix/issue229 branch from 3aa86eb to 328300a Compare November 20, 2025 14:09

cary-rowen assigned pauliyobo and unassigned cary-rowen Nov 20, 2025

cary-rowen closed this Nov 20, 2025

cary-rowen reopened this Nov 20, 2025

pauliyobo added 2 commits December 6, 2025 08:27

Merge branch 'develop' into fix/issue229

f3fa032

add blake3 dependency to requirements

47e58d1

Remove explicit encoding when generating blake3 hex digest

8ed788f

Provide content hash definition for base documents

7ed7746

This comment was marked as outdated.

Sign in to view

cary-rowen reviewed Jan 3, 2026

View reviewed changes

alembic/versions/707543f03b6d_add_content_hash_column.py Show resolved Hide resolved

cary-rowen added 2 commits January 4, 2026 12:31

Merge branch 'develop' into fix/issue229

5bec33d

Revert formatting-only changes to match develop branch

ac05a79

Update alembic/versions/707543f03b6d_add_content_hash_column.py

2f633ed

Co-authored-by: wencong <manchen_0528@outlook.com>

cary-rowen added 3 commits January 8, 2026 18:35

Merge branch 'develop' into fix/issue229

8a6c06e

Conflicts resolved by moving 'blake3' dependency from requirements-app.txt to pyproject.toml.

Merge branch 'develop' into fix/issue229

b5bfb62

Merge branch 'develop' into fix/issue229

350f992

Merge branch 'develop' into fix/issue229

7f212c8

Conversation

pauliyobo commented Nov 15, 2025

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed:

Known issues with pull request:

Uh oh!

cary-rowen commented Nov 15, 2025

Uh oh!

pauliyobo commented Nov 15, 2025

Uh oh!

cary-rowen commented Nov 29, 2025

Uh oh!

pauliyobo commented Dec 6, 2025

Uh oh!

cary-rowen commented Dec 7, 2025

Uh oh!

pauliyobo commented Dec 30, 2025

Uh oh!

cary-rowen commented Dec 30, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

pauliyobo commented Jan 3, 2026

Uh oh!

Uh oh!

cary-rowen commented Jan 3, 2026

Uh oh!

cary-rowen commented Jan 3, 2026

Uh oh!

pauliyobo commented Jan 3, 2026

Uh oh!

pauliyobo commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cary-rowen commented Jan 4, 2026

Uh oh!

cary-rowen commented Jan 4, 2026

Uh oh!

cary-rowen commented Jan 4, 2026

Uh oh!

pauliyobo commented Jan 5, 2026

Uh oh!

pauliyobo commented Jan 6, 2026

Uh oh!

cary-rowen commented Jan 6, 2026

Uh oh!

pauliyobo commented Jan 6, 2026

Uh oh!

cary-rowen commented Jan 6, 2026

Uh oh!

cary-rowen commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cary-rowen commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pauliyobo commented Jan 3, 2026 •

edited

Loading

cary-rowen commented Jan 16, 2026 •

edited

Loading