Upgrade torch, replace torchtext with fasttext, add docstrings, fix bugs by Copilot · Pull Request #6 · AndyTheFactory/RO-Diacritics

Copilot · 2026-04-24T11:23:26Z

Drops the torchtext dependency (deprecated/unmaintained) and replaces it with direct fasttext-wheel. Upgrades to torch>=2.0. Adds docstrings throughout and fixes several bugs.

Dependency changes

torchtext → fasttext-wheel in install_requires
torch → torch>=2.0

torchtext → fasttext replacement (`diacritics_dataset.py`)

Old approach iterated over the entire fasttext vocabulary and averaged vectors for diacritics variants. New approach uses ft.get_word_vector(word) directly per vocab token — simpler, supports OOV via subword composition:

# Before (torchtext)
embedding = FastText("ro")
for word, index in embedding.stoi.items():
    word = remove_diacritics(word)
    if word in self.vocab["stoi"]:
        self.vocab["vectors"][idx] += embedding.vectors[index]

# After (fasttext)
ft = _load_fasttext_model()  # downloads cc.ro.300.bin to .model/ cache
for word, idx in self.vocab["stoi"].items():
    self.vocab["vectors"][idx] = torch.tensor(ft.get_word_vector(word))

Bug fixes

diacritics_inference.py: if device == "cpu" was always False (comparing torch.device to str) — GPU warning never fired. Fixed to device.type == "cpu".
diacritics_inference.py: Added explicit weights_only=False to torch.load (checkpoint contains pickled Python objects; omitting this breaks on future PyTorch defaults).
diacritics_dataset.py: Unclosed file handle in load_texts pickle branch replaced with context manager.

Pythonic / optimizations

diacritics_utils.py: Regex patterns pre-compiled at module level (previously recompiled on every call).
diacritics_dataset.py: PunktSentenceTokenizer instantiated once as self._sent_tokenizer; __iter__ uses yield from; removed unused text_plain variable.
diacritics_inference.py: Extracted inline anonymous DS(IterableDataset) class to module-level _TensorListDataset; .get() for dict access.

Docstrings

Added module, class, and method docstrings to all five source files.

…rings, fix bugs, pythonic improvements Agent-Logs-Url: https://github.com/AndyTheFactory/RO-Diacritics/sessions/42b8a0f4-b9a9-4152-8b7f-13083dd68e9b Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>

…nizer, rename lines to sentence_spans Agent-Logs-Url: https://github.com/AndyTheFactory/RO-Diacritics/sessions/42b8a0f4-b9a9-4152-8b7f-13083dd68e9b Co-authored-by: AndyTheFactory <863810+AndyTheFactory@users.noreply.github.com>

Copilot AI and others added 2 commits April 24, 2026 10:54

Copilot AI assigned Copilot and AndyTheFactory Apr 24, 2026

Copilot created this pull request from a session on behalf of AndyTheFactory April 24, 2026 11:23 View session

AndyTheFactory marked this pull request as ready for review April 24, 2026 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade torch, replace torchtext with fasttext, add docstrings, fix bugs#6

Upgrade torch, replace torchtext with fasttext, add docstrings, fix bugs#6
Copilot wants to merge 2 commits into
mainfrom
copilot/upgrade-torch-and-replace-torchtext

Copilot AI commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 24, 2026

Dependency changes

torchtext → fasttext replacement (diacritics_dataset.py)

Bug fixes

Pythonic / optimizations

Docstrings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

torchtext → fasttext replacement (`diacritics_dataset.py`)