Skip to content

fix: Fix documents losing associated comments when changing storage path#349

Draft
pauliyobo wants to merge 15 commits intodevelopfrom
fix/issue229
Draft

fix: Fix documents losing associated comments when changing storage path#349
pauliyobo wants to merge 15 commits intodevelopfrom
fix/issue229

Conversation

@pauliyobo
Copy link
Collaborator

Link to issue number:

closes #229

Summary of the issue:

Up until now, annotations for a document were associated only through the document's title and storage URI.
While this has worked for most cases, changing the location of the original document would consequently invalidate the association, even though the annotations and the document in question would not have any modification.

Description of how this pull request fixes the issue:

This PR allows documents to be found also through their content hash.
This will cover cases in which the storage location has changed, however, it will not cover the cases in which the content of a document has been modified, though if the storage location remains the same, there shouldn't be any changes i nthe behaviour.

Testing performed:

Manual and unit testing

Known issues with pull request:

While this isn't really an issue, I fear that the migration may take a significant amount of time for large collections. It would need to be tested.

PS: Sorry for the horrible review experience, ruff format touched way more than I'd have expected.

@cary-rowen I believe you were particularly affected by this issue. Any thoughts?
Also, I wonder if the user should be prompted whenever the storage location of the document differs from the actual path stored in the database.

@pauliyobo pauliyobo marked this pull request as draft November 15, 2025 00:40
@cary-rowen
Copy link
Collaborator

Hello @pauliyobo
Great work! I like this.

You wrote:

Sorry for the horrible review experience, ruff format touched way more than I'd have expected.

Is it possible for you to put the ruff lintting changes in a separate PR?

Also, I wonder if the user should be prompted whenever the storage location of the document differs from the actual path stored in the database.

I think giving dialog might be better, but I haven't tested this yet, I hope to test this soon, it's so cool.

@pauliyobo
Copy link
Collaborator Author

Is it possible for you to put the ruff lintting changes in a separate PR?

probably, though I accidentally ran format before committing the prior change, which makes things a bit complicated.

@cary-rowen cary-rowen marked this pull request as ready for review November 20, 2025 13:38
@cary-rowen cary-rowen marked this pull request as draft November 20, 2025 13:38
@cary-rowen cary-rowen self-assigned this Nov 20, 2025
@cary-rowen cary-rowen assigned pauliyobo and unassigned cary-rowen Nov 20, 2025
@cary-rowen cary-rowen closed this Nov 20, 2025
@cary-rowen cary-rowen reopened this Nov 20, 2025
@cary-rowen
Copy link
Collaborator

Hello @pauliyobo
I downloaded the build of the current PR, but it seems that it cannot be started at all. Can you confirm it?

@pauliyobo
Copy link
Collaborator Author

Hello
@cary-rowen it should be fixed now.

@cary-rowen
Copy link
Collaborator

Thanks @pauliyobo
I installed the latest build and encountered the following error:

Failed to execute script 'Bookworm' due to unhandled exception: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Starting Bookworm.
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Bookworm Version: 2025.1
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Python version: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Platform: Windows-10-10.0.19045-SP0
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
OS description: Windows 10 (build 19045), 64-bit edition
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Application architecture: x64
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Running an installed copy of Bookworm.
INFO - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Debug mode is off.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up application subsystems.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up the configuration subsystem.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Setting up the internationalization subsystem.
DEBUG - bookworm.bookworm.i18n.core - 07/12/2025 20:00:31 - MainThread (4720):
Setting application locale to zh_CN.
DEBUG - bookworm.bookworm.i18n.wx_i18n - 07/12/2025 20:00:31 - MainThread (4720):
Setting wx locale to LocaleInfo(identifier="zh_CN");language=zh.
DEBUG - bookworm.bookworm.bootstrap - 07/12/2025 20:00:31 - MainThread (4720):
Initializing the database subsystem.
DEBUG - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Using url sqlite:///C:\Users\cary\AppData\Roaming\bookworm\database\database.sqlite 
INFO - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Current revision is 52e39c4f7494
INFO - bookworm.bookworm.database - 07/12/2025 20:00:31 - MainThread (4720):
Running database migrations and setup

@pauliyobo
Copy link
Collaborator Author

@cary-rowen apologies for the delay.
Can you try with the new build? It was likely an encoding problem caused by the fact that I was explicitly using UTF8 when generating the hashed content.

@cary-rowen
Copy link
Collaborator

Hi @pauliyobo
Thanks for your work.

Same error:

Failed to execute script 'Bookworm' due to unhandled exception: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence
Traceback (most recent call last):
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 31, in update_content_hashes
    document = create_document(doc.uri)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "bookworm\document\__init__.py", line 44, in create_document
  File "bookworm\document\formats\pdf.py", line 113, in read
  File "bookworm\document\formats\fitz.py", line 85, in read
  File "bookworm\document\base.py", line 192, in get_file_system_path
bookworm.document.exceptions.DocumentIOError: File D:\MyData\Desktop\Python编程:从入门到实践(第3版) ([美] 埃里克 • 马瑟斯(Eric Matthes)) (Z-Library).pdf does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bookworm\bootstrap.py", line 219, in run
  File "bookworm\bootstrap.py", line 199, in init_app_and_run_main_loop
  File "bookworm\bootstrap.py", line 157, in setupSubsystems
  File "bookworm\database\__init__.py", line 70, in init_database
  File "alembic\command.py", line 406, in upgrade
  File "alembic\script\base.py", line 582, in run_env
  File "alembic\util\pyfiles.py", line 95, in load_python_file
  File "alembic\util\pyfiles.py", line 113, in load_module_py
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 103, in <module>
    run_migrations_online()
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 97, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "alembic\runtime\environment.py", line 946, in run_migrations
  File "alembic\runtime\migration.py", line 628, in run_migrations
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 54, in upgrade
    update_content_hashes(session, Book)
  File "C:\Program Files\Bookworm\_internal\alembic\versions\707543f03b6d_add_content_hash_column.py", line 35, in update_content_hashes
    print(f"Failed to apply content hash to {doc.title}, {e}")
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 107: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\__init__.py", line 1113, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 736: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\handlers.py", line 75, in emit
  File "logging\__init__.py", line 1230, in emit
  File "logging\__init__.py", line 1118, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bookworm\bookworm.py", line 52, in main
  File "bookworm\bootstrap.py", line 228, in run
  File "logging\__init__.py", line 1536, in critical
  File "logging\__init__.py", line 1634, in _log
  File "logging\__init__.py", line 1644, in handle
  File "logging\__init__.py", line 1706, in callHandlers
  File "logging\__init__.py", line 978, in handle
  File "logging\handlers.py", line 77, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "logging\__init__.py", line 1113, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 635: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Bookworm.py", line 18, in <module>
  File "bookworm\bookworm.py", line 58, in main
  File "bookworm\bookworm.py", line 32, in report_fatal_error
  File "logging\__init__.py", line 1524, in exception
  File "logging\__init__.py", line 1518, in error
  File "logging\__init__.py", line 1634, in _log
  File "logging\__init__.py", line 1644, in handle
  File "logging\__init__.py", line 1706, in callHandlers
  File "logging\__init__.py", line 978, in handle
  File "logging\__init__.py", line 1118, in emit
  File "logging\__init__.py", line 1032, in handleError
  File "traceback.py", line 125, in print_exception
  File "traceback.py", line 1022, in print
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 99: illegal multibyte sequence

cary-rowen

This comment was marked as outdated.

@pauliyobo
Copy link
Collaborator Author

@cary-rowen
Up on further investigation, the error itself originated from a missing implementation for the content hash generation. Can you try again?
The encoding error likely happens when the document title is printed in the related migration whenever the error occurs.

@cary-rowen
Copy link
Collaborator

@pauliyobo
Sorry, I actually tried your latest commit, but the problem persisted and I investigated further locally and wrote code review suggestions. However I forgot to post it.

@cary-rowen
Copy link
Collaborator

Even after solving the above problems, I have discovered two other noteworthy issues:

  1. With the migration time issue, I ran into extreme cases: if some of the books recorded in the database are stored in OneDrive, and those files need to be downloaded first. Then the migration will take a long time, including the time needed to download the files.
  2. Downgrade issue: If the migration is completed using the current build, then the user installing an earlier version throws an error with the following log:
03/01/2026 23:20:42 root CRITICAL: Failed to start Bookworm
03/01/2026 23:20:42 root CRITICAL: A fatal error has occured. Please check the log for more details.
The log has been written to the file:
C:\Users\cary\bookworm.errors.log
03/01/2026 23:20:42 root ERROR: ERROR DETAILS:
Traceback (most recent call last):
  File "alembic\script\base.py", line 250, in _catch_revision_errors
  File "alembic\script\base.py", line 458, in _upgrade_revs
  File "alembic\script\revision.py", line 814, in iterate_revisions
  File "alembic\script\revision.py", line 1475, in _collect_upgrade_revisions
  File "alembic\script\revision.py", line 542, in get_revisions
  File "alembic\script\revision.py", line 542, in <listcomp>
  File "alembic\script\revision.py", line 565, in get_revisions
  File "alembic\script\revision.py", line 566, in <genexpr>
  File "alembic\script\revision.py", line 637, in _revision_for_ident
alembic.script.revision.ResolutionError: No such revision or branch '707543f03b6d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bookworm\bookworm.py", line 52, in main
  File "bookworm\bootstrap.py", line 229, in run
  File "bookworm\bootstrap.py", line 219, in run
  File "bookworm\bootstrap.py", line 199, in init_app_and_run_main_loop
  File "bookworm\bootstrap.py", line 157, in setupSubsystems
  File "bookworm\database\__init__.py", line 69, in init_database
  File "alembic\command.py", line 406, in upgrade
  File "alembic\script\base.py", line 582, in run_env
  File "alembic\util\pyfiles.py", line 95, in load_python_file
  File "alembic\util\pyfiles.py", line 113, in load_module_py
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 92, in <module>
    run_migrations_online()
  File "C:\Program Files\Bookworm\_internal\alembic\env.py", line 86, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "alembic\runtime\environment.py", line 946, in run_migrations
  File "alembic\runtime\migration.py", line 616, in run_migrations
  File "alembic\command.py", line 395, in upgrade
  File "alembic\script\base.py", line 446, in _upgrade_revs
  File "contextlib.py", line 158, in __exit__
  File "alembic\script\base.py", line 282, in _catch_revision_errors
alembic.util.exc.CommandError: Can't locate revision identified by '707543f03b6d'

@pauliyobo
Copy link
Collaborator Author

@cary-rowen
Regarding the first point, I can make it so the content hash is generated when the book is loaded, without populating the data from the migration itself. This will however require the documents to be loaded at least once, rather than having the content hashes ready from the get go. Or alternatively I could try generating the data in parallel.
For the second point, I don't really have a solution at the moment, other tan documenting it.

@pauliyobo
Copy link
Collaborator Author

pauliyobo commented Jan 3, 2026

Apologies for the double comment. How likely is it that an user will update their version, to then run earlier versions against the same database? Perhaps we could offer a mechanism to downgrade the database version from the latest version in order to return to the previous revision, but I don't know if it's worth the effort.

@cary-rowen
Copy link
Collaborator

Regarding the first point, is it possible for us to design a migration dialog, even if it's just for showing progress, instead of working completely silently in the background, which would give the user the illusion that the software isn't running properly.

@cary-rowen
Copy link
Collaborator

On the second point, I don't think downgrading is a use case that should be supported.
btw, just to make sure you saw my code suggestion, maybe you have a better way to solve the Chinese character problem.

@cary-rowen
Copy link
Collaborator

Hey @pauliyobo
To keep this PR clean, I have reverted those files that contained only styling/whitespace changes. I verified that all logic changes (including the content_hash implementation and SQLite infrastructure updates) are preserved.

I'm sorry to have interfered with your work.

Co-authored-by: wencong <manchen_0528@outlook.com>
@pauliyobo
Copy link
Collaborator Author

@cary-rowen
No problem at all, i appreciate the suggestion. I have went ahead and included the ascii code change.

@pauliyobo
Copy link
Collaborator Author

I guess one other way could also be to just run the migration with only the schema changes and allow the user to manually populate the content hashes from the app itself. That should solve the long migration time and the progress issues.
@cary-rowen thoughts?

@cary-rowen
Copy link
Collaborator

That sounds like a good plan. We could run the schema migration at startup but defer the hash calculation until the user opens a specific document.
It wouldn't need to be manually triggered by the user; it could just happen automatically in the background upon opening. This effectively avoids the issue of freezing the app due to long download times (e.g. from OneDrive) for the entire library.

@pauliyobo
Copy link
Collaborator Author

Sorry, perhaps I was unclear.
With manual, I meant that perhaps we could add an option to populateall the content hashes. So basically what the data migration currently does.
Even if this operation is not performed, the content hash would still be generated whenever a document is loaded, but of course you wouldn't benefit from this change until all the relevant documents have been opened at least once.

@cary-rowen
Copy link
Collaborator

I understand. However, I suggest that we implement a popup dialog to ask the user if they want to perform this manual migration (e.g., after the update). This would greatly improve the feature's discoverability, ensuring users know they have the option to update everything at once.

Or, do you have other thoughts?

Conflicts resolved by moving 'blake3' dependency from requirements-app.txt to pyproject.toml.
@cary-rowen
Copy link
Collaborator

cary-rowen commented Jan 16, 2026

Looking forward to your further updates @pauliyobo. Also, is it possible for us to release a new version in the near future? I'd like to consolidate database migration work into a single version.

@cary-rowen
Copy link
Collaborator

Actually I'm really looking forward to this, if you need me to do anything, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments and notes are lost after changing the save location of the document

2 participants