[WIP] Update main.py for improved Заміни decoding quality by Copilot · Pull Request #5 · QmFkLVBp/decryptoinator

Copilot · 2025-12-01T22:02:05Z

Original prompt

Scope
Update ONLY main.py to significantly improve the "Заміни" decoding quality while keeping existing UI structure, function names, and workflow intact. You approved minor UI hints and labels; we will add small non-breaking text/tooltips and result header info. No new widgets beyond simple labels/tooltips created within the existing frames.

Core goals

Mixed numeric tokens (1- and 2-digit) both represent 1 letter. Maintain longest-match-first (2-digit priority) and greedy grouping for long digit runs without breaking text when a token is unknown.

Strongly improve readability by upgrading scoring to reflect real Ukrainian language constraints and your provided frequency tables.

Enhance hillclimb auto replacement to converge on plausible Ukrainian plaintext using token-aware selective swaps and multi-start seeding.

Minor UI improvements: inline hints for multi-digit tokens, a compact reference note for UA bigrams/rules, and a result header that shows Score and a few diagnostics (vowel ratio, bad repeats, max consonant-run). No changes to button wiring or widget hierarchy.

Algorithmic updates (main.py):

Replace/extend _subst_score_plaintext to include:

Updated UA letter frequencies from your table (mono-frequency list).

Vowel vs consonant ratio target (≈41% vowels / 59% consonants) with penalty for deviation.

Forbidden pattern penalties (ЬЬ, ЙЙ, long repeated letters like ГГГГ, ЖЖЖ, ФФФ). Heavy penalty ≥4 identical letters; disallow 5+ consonants in a row (strong penalty), light penalty for 4, none for ≤3.

Word-start penalties: words starting with Ь (severe), Й (strong). Soft penalties for И/Є/Ю/Ї starts.

Syllable shape rewards/penalties: reward CV and VC; penalize VV; moderate CC.

Rewards for common UA bigrams and trigrams (НА, ОН, СТ, ПР, РО, ЗА, ТА, ЛИ; ПРО, ПРИ, СТО, НИК, АНИ, ЕНН).

Retain chi-squared frequency component and bigram log-probabilities; combine into a single score.

Update auto_suggest_substitution / subst_hillclimb pipelines:

Frequency-based seeding for digit tokens using your 6.2 table (very high -> vowels; high -> N/T/R/S/L/V; medium -> L/K/M/D/P/U; low -> rare consonants).

Token-aware selective swaps: swap within digit-token mappings or within char mappings primarily; occasional cross swaps to escape local minima.

Multi-start: frequency-seeded, random, and bigram-biased.

Early-stopping when score stagnates; return best mapping and plaintext.

Keep existing tokenizers and mixed application (tokenize_digits_with_mapping, tokenize_text_two_digit_mode, apply_mapping_mixed_tokens, apply_substitution_mapping, detokenize_apply_mapping) as previously updated for longest-match-first and case-aware char mapping.

Minor UI improvements (allowed):

Add a short inline hint near the multi-digit checkbox: "Підказка: якщо у мапі є 7 і 78 → токен 78 має пріоритет." (no new complex UI elements, simple label).

Under bigram frequency area, add a small static text note listing top UA bigrams for reference.

In the result output header, append diagnostic info: "Score=..., Vowels=...%, BadRepeats=..., MaxCCRun=...". These are computed from the current plaintext.

Acceptance examples:

Numeric tokens: mapping {"12":"П","34":"Р","5":"И","67":"К","8":"А"}, text "12 34 5 678" -> "ПРИКА" (spaces preserved if present). Long runs greedily grouped with longest-match-first; unknown tokens left intact.

Case mapping: {"A":"x","b":"y"}, text "ABC abc" -> "xYC xyc".

Readability: Auto replacement produces Ukrainian-like plaintext (visible words, proper vowel placement, no forbidden repeats), score shown with diagnostics.

Constraints

Only change main.py.

Keep all function names and signatures unchanged.

Do not alter menu bindings, commands, or external modules.

Minor UI text changes are allowed per your approval; no heavy refactors.

Please create a branch and open a PR with these changes; include concise descriptions of scoring updates and UI hints.

This pull request was created as a result of the following prompt from Copilot chat.

Scope
Update ONLY main.py to significantly improve the "Заміни" decoding quality while keeping existing UI structure, function names, and workflow intact. You approved minor UI hints and labels; we will add small non-breaking text/tooltips and result header info. No new widgets beyond simple labels/tooltips created within the existing frames.

Core goals

Mixed numeric tokens (1- and 2-digit) both represent 1 letter. Maintain longest-match-first (2-digit priority) and greedy grouping for long digit runs without breaking text when a token is unknown.

Strongly improve readability by upgrading scoring to reflect real Ukrainian language constraints and your provided frequency tables.

Enhance hillclimb auto replacement to converge on plausible Ukrainian plaintext using token-aware selective swaps and multi-start seeding.

Minor UI improvements: inline hints for multi-digit tokens, a compact reference note for UA bigrams/rules, and a result header that shows Score and a few diagnostics (vowel ratio, bad repeats, max consonant-run). No changes to button wiring or widget hierarchy.

Algorithmic updates (main.py):

Replace/extend _subst_score_plaintext to include:

Updated UA letter frequencies from your table (mono-frequency list).

Vowel vs consonant ratio target (≈41% vowels / 59% consonants) with penalty for deviation.

Forbidden pattern penalties (ЬЬ, ЙЙ, long repeated letters like ГГГГ, ЖЖЖ, ФФФ). Heavy penalty ≥4 identical letters; disallow 5+ consonants in a row (strong penalty), light penalty for 4, none for ≤3.

Word-start penalties: words starting with Ь (severe), Й (strong). Soft penalties for И/Є/Ю/Ї starts.

Syllable shape rewards/penalties: reward CV and VC; penalize VV; moderate CC.

Rewards for common UA bigrams and trigrams (НА, ОН, СТ, ПР, РО, ЗА, ТА, ЛИ; ПРО, ПРИ, СТО, НИК, АНИ, ЕНН).

Retain chi-squared frequency component and bigram log-probabilities; combine into a single score.

Update auto_suggest_substitution / subst_hillclimb pipelines:

Frequency-based seeding for digit tokens using your 6.2 table (very high -> vowels; high -> N/T/R/S/L/V; medium -> L/K/M/D/P/U; low -> rare consonants).

Token-aware selective swaps: swap within digit-token mappings or within char mappings primarily; occasional cross swaps to escape local minima.

Multi-start: frequency-seeded, random, and bigram-biased.

Early-stopping when score stagnates; return best mapping and plaintext.

Keep existing tokenizers and mixed application (tokenize_digits_with_mapping, tokenize_text_two_digit_mode, apply_mapping_mixed_tokens, apply_substitution_mapping, detokenize_apply_mapping) as previously updated for longest-match-first and case-aware char mapping.

Minor UI improvements (allowed):

Add a short inline hint near the multi-digit checkbox: "Підказка: якщо у мапі є 7 і 78 → токен 78 має пріоритет." (no new complex UI elements, simple label).

Under bigram frequency area, add a small static text note listing top UA bigrams for reference.

In the result output header, append diagnostic info: "Score=..., Vowels=...%, BadRepeats=..., MaxCCRun=...". These are computed from the current plaintext.

Acceptance examples:

Numeric tokens: mapping {"12":"П","34":"Р","5":"И","67":"К","8":"А"}, text "12 34 5 678" -> "ПРИКА" (spaces preserved if present). Long runs greedily grouped with longest-match-first; unknown tokens left intact.

Case mapping: {"A":"x","b":"y"}, text "ABC abc" -> "xYC xyc".

Readability: Auto replacement produces Ukrainian-like plaintext (visible words, proper vowel placement, no forbidden repeats), score shown with diagnostics.

Constraints

Only change main.py.

Keep all function names and signatures unchanged.

Do not alter menu bindings, commands, or external modules.

Minor UI text changes are allowed per your approval; no heavy refactors.

Please create a branch and open a PR with these changes; include concise descriptions of scoring updates and UI hints.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Initial plan

4ce2bcf

Copilot AI assigned Copilot and QmFkLVBp Dec 1, 2025

Copilot started work on behalf of QmFkLVBp December 1, 2025 22:02 View session

QmFkLVBp marked this pull request as ready for review December 1, 2025 22:04

QmFkLVBp merged commit 9661556 into main Dec 1, 2025
1 check failed

Copilot AI requested a review from QmFkLVBp December 1, 2025 22:04

Copilot stopped work on behalf of QmFkLVBp due to an error December 1, 2025 22:04
Copilot has encountered an error. See logs for additional details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Update main.py for improved Заміни decoding quality#5

[WIP] Update main.py for improved Заміни decoding quality#5
QmFkLVBp merged 1 commit into
mainfrom
copilot/improve-decoding-quality

Copilot AI commented Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 1, 2025 •

edited

Loading