Skip to content

Upgrade torch, replace torchtext with fasttext, add docstrings, fix bugs#6

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/upgrade-torch-and-replace-torchtext
Open

Upgrade torch, replace torchtext with fasttext, add docstrings, fix bugs#6
Copilot wants to merge 2 commits into
mainfrom
copilot/upgrade-torch-and-replace-torchtext

Conversation

Copilot AI commented Apr 24, 2026

Copy link
Copy Markdown

Drops the torchtext dependency (deprecated/unmaintained) and replaces it with direct fasttext-wheel. Upgrades to torch>=2.0. Adds docstrings throughout and fixes several bugs.

Dependency changes

  • torchtextfasttext-wheel in install_requires
  • torchtorch>=2.0

torchtext → fasttext replacement (diacritics_dataset.py)

Old approach iterated over the entire fasttext vocabulary and averaged vectors for diacritics variants. New approach uses ft.get_word_vector(word) directly per vocab token — simpler, supports OOV via subword composition:

# Before (torchtext)
embedding = FastText("ro")
for word, index in embedding.stoi.items():
    word = remove_diacritics(word)
    if word in self.vocab["stoi"]:
        self.vocab["vectors"][idx] += embedding.vectors[index]

# After (fasttext)
ft = _load_fasttext_model()  # downloads cc.ro.300.bin to .model/ cache
for word, idx in self.vocab["stoi"].items():
    self.vocab["vectors"][idx] = torch.tensor(ft.get_word_vector(word))

Bug fixes

  • diacritics_inference.py: if device == "cpu" was always False (comparing torch.device to str) — GPU warning never fired. Fixed to device.type == "cpu".
  • diacritics_inference.py: Added explicit weights_only=False to torch.load (checkpoint contains pickled Python objects; omitting this breaks on future PyTorch defaults).
  • diacritics_dataset.py: Unclosed file handle in load_texts pickle branch replaced with context manager.

Pythonic / optimizations

  • diacritics_utils.py: Regex patterns pre-compiled at module level (previously recompiled on every call).
  • diacritics_dataset.py: PunktSentenceTokenizer instantiated once as self._sent_tokenizer; __iter__ uses yield from; removed unused text_plain variable.
  • diacritics_inference.py: Extracted inline anonymous DS(IterableDataset) class to module-level _TensorListDataset; .get() for dict access.

Docstrings

Added module, class, and method docstrings to all five source files.

Copilot AI and others added 2 commits April 24, 2026 10:54
…rings, fix bugs, pythonic improvements

Agent-Logs-Url: https://github.com/AndyTheFactory/RO-Diacritics/sessions/42b8a0f4-b9a9-4152-8b7f-13083dd68e9b

Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>
…nizer, rename lines to sentence_spans

Agent-Logs-Url: https://github.com/AndyTheFactory/RO-Diacritics/sessions/42b8a0f4-b9a9-4152-8b7f-13083dd68e9b

Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants