Skip to content

[WIP] Update main.py for improved Заміни decoding quality#5

Merged
QmFkLVBp merged 1 commit into
mainfrom
copilot/improve-decoding-quality
Dec 1, 2025
Merged

[WIP] Update main.py for improved Заміни decoding quality#5
QmFkLVBp merged 1 commit into
mainfrom
copilot/improve-decoding-quality

Conversation

Copilot AI commented Dec 1, 2025

Copy link
Copy Markdown
Contributor
  • Understand the existing codebase and substitution cipher implementation
  • Update _subst_score_plaintext function with improved Ukrainian scoring:
    • Add updated Ukrainian letter frequencies from provided table
    • Add vowel/consonant ratio target (~41% vowels / 59% consonants) with penalty
    • Add forbidden pattern penalties (ЬЬ, ЙЙ, long repeated letters, 5+ consonants)
    • Add word-start penalties (Ь severe, Й strong, И/Є/Ю/Ї soft)
    • Add syllable shape rewards/penalties (CV, VC rewards; VV, CC penalties)
    • Add rewards for common UA bigrams and trigrams
    • Combine chi-squared frequency + bigram log-probs into single score
  • Update auto_suggest_substitution / subst_hillclimb pipelines:
    • Add frequency-based seeding for digit tokens
    • Add token-aware selective swaps
    • Add multi-start seeding (frequency, random, bigram-biased)
    • Add early-stopping when score stagnates
  • Add minor UI improvements:
    • Add inline hint near multi-digit checkbox
    • Add reference note for top UA bigrams under bigram frequency area
    • Add diagnostic info in result output header (Score, Vowels%, BadRepeats, MaxCCRun)
  • Test changes manually
  • Run code review and address feedback
  • Run CodeQL security check
Original prompt

Scope
Update ONLY main.py to significantly improve the "Заміни" decoding quality while keeping existing UI structure, function names, and workflow intact. You approved minor UI hints and labels; we will add small non-breaking text/tooltips and result header info. No new widgets beyond simple labels/tooltips created within the existing frames.

Core goals

  1. Mixed numeric tokens (1- and 2-digit) both represent 1 letter. Maintain longest-match-first (2-digit priority) and greedy grouping for long digit runs without breaking text when a token is unknown.
  2. Strongly improve readability by upgrading scoring to reflect real Ukrainian language constraints and your provided frequency tables.
  3. Enhance hillclimb auto replacement to converge on plausible Ukrainian plaintext using token-aware selective swaps and multi-start seeding.
  4. Minor UI improvements: inline hints for multi-digit tokens, a compact reference note for UA bigrams/rules, and a result header that shows Score and a few diagnostics (vowel ratio, bad repeats, max consonant-run). No changes to button wiring or widget hierarchy.

Algorithmic updates (main.py):

  • Replace/extend _subst_score_plaintext to include:
    • Updated UA letter frequencies from your table (mono-frequency list).
    • Vowel vs consonant ratio target (≈41% vowels / 59% consonants) with penalty for deviation.
    • Forbidden pattern penalties (ЬЬ, ЙЙ, long repeated letters like ГГГГ, ЖЖЖ, ФФФ). Heavy penalty ≥4 identical letters; disallow 5+ consonants in a row (strong penalty), light penalty for 4, none for ≤3.
    • Word-start penalties: words starting with Ь (severe), Й (strong). Soft penalties for И/Є/Ю/Ї starts.
    • Syllable shape rewards/penalties: reward CV and VC; penalize VV; moderate CC.
    • Rewards for common UA bigrams and trigrams (НА, ОН, СТ, ПР, РО, ЗА, ТА, ЛИ; ПРО, ПРИ, СТО, НИК, АНИ, ЕНН).
    • Retain chi-squared frequency component and bigram log-probabilities; combine into a single score.
  • Update auto_suggest_substitution / subst_hillclimb pipelines:
    • Frequency-based seeding for digit tokens using your 6.2 table (very high -> vowels; high -> N/T/R/S/L/V; medium -> L/K/M/D/P/U; low -> rare consonants).
    • Token-aware selective swaps: swap within digit-token mappings or within char mappings primarily; occasional cross swaps to escape local minima.
    • Multi-start: frequency-seeded, random, and bigram-biased.
    • Early-stopping when score stagnates; return best mapping and plaintext.
  • Keep existing tokenizers and mixed application (tokenize_digits_with_mapping, tokenize_text_two_digit_mode, apply_mapping_mixed_tokens, apply_substitution_mapping, detokenize_apply_mapping) as previously updated for longest-match-first and case-aware char mapping.

Minor UI improvements (allowed):

  • Add a short inline hint near the multi-digit checkbox: "Підказка: якщо у мапі є 7 і 78 → токен 78 має пріоритет." (no new complex UI elements, simple label).
  • Under bigram frequency area, add a small static text note listing top UA bigrams for reference.
  • In the result output header, append diagnostic info: "Score=..., Vowels=...%, BadRepeats=..., MaxCCRun=...". These are computed from the current plaintext.

Acceptance examples:

  • Numeric tokens: mapping {"12":"П","34":"Р","5":"И","67":"К","8":"А"}, text "12 34 5 678" -> "ПРИКА" (spaces preserved if present). Long runs greedily grouped with longest-match-first; unknown tokens left intact.
  • Case mapping: {"A":"x","b":"y"}, text "ABC abc" -> "xYC xyc".
  • Readability: Auto replacement produces Ukrainian-like plaintext (visible words, proper vowel placement, no forbidden repeats), score shown with diagnostics.

Constraints

  • Only change main.py.
  • Keep all function names and signatures unchanged.
  • Do not alter menu bindings, commands, or external modules.
  • Minor UI text changes are allowed per your approval; no heavy refactors.

Please create a branch and open a PR with these changes; include concise descriptions of scoring updates and UI hints.

This pull request was created as a result of the following prompt from Copilot chat.

Scope
Update ONLY main.py to significantly improve the "Заміни" decoding quality while keeping existing UI structure, function names, and workflow intact. You approved minor UI hints and labels; we will add small non-breaking text/tooltips and result header info. No new widgets beyond simple labels/tooltips created within the existing frames.

Core goals

  1. Mixed numeric tokens (1- and 2-digit) both represent 1 letter. Maintain longest-match-first (2-digit priority) and greedy grouping for long digit runs without breaking text when a token is unknown.
  2. Strongly improve readability by upgrading scoring to reflect real Ukrainian language constraints and your provided frequency tables.
  3. Enhance hillclimb auto replacement to converge on plausible Ukrainian plaintext using token-aware selective swaps and multi-start seeding.
  4. Minor UI improvements: inline hints for multi-digit tokens, a compact reference note for UA bigrams/rules, and a result header that shows Score and a few diagnostics (vowel ratio, bad repeats, max consonant-run). No changes to button wiring or widget hierarchy.

Algorithmic updates (main.py):

  • Replace/extend _subst_score_plaintext to include:
    • Updated UA letter frequencies from your table (mono-frequency list).
    • Vowel vs consonant ratio target (≈41% vowels / 59% consonants) with penalty for deviation.
    • Forbidden pattern penalties (ЬЬ, ЙЙ, long repeated letters like ГГГГ, ЖЖЖ, ФФФ). Heavy penalty ≥4 identical letters; disallow 5+ consonants in a row (strong penalty), light penalty for 4, none for ≤3.
    • Word-start penalties: words starting with Ь (severe), Й (strong). Soft penalties for И/Є/Ю/Ї starts.
    • Syllable shape rewards/penalties: reward CV and VC; penalize VV; moderate CC.
    • Rewards for common UA bigrams and trigrams (НА, ОН, СТ, ПР, РО, ЗА, ТА, ЛИ; ПРО, ПРИ, СТО, НИК, АНИ, ЕНН).
    • Retain chi-squared frequency component and bigram log-probabilities; combine into a single score.
  • Update auto_suggest_substitution / subst_hillclimb pipelines:
    • Frequency-based seeding for digit tokens using your 6.2 table (very high -> vowels; high -> N/T/R/S/L/V; medium -> L/K/M/D/P/U; low -> rare consonants).
    • Token-aware selective swaps: swap within digit-token mappings or within char mappings primarily; occasional cross swaps to escape local minima.
    • Multi-start: frequency-seeded, random, and bigram-biased.
    • Early-stopping when score stagnates; return best mapping and plaintext.
  • Keep existing tokenizers and mixed application (tokenize_digits_with_mapping, tokenize_text_two_digit_mode, apply_mapping_mixed_tokens, apply_substitution_mapping, detokenize_apply_mapping) as previously updated for longest-match-first and case-aware char mapping.

Minor UI improvements (allowed):

  • Add a short inline hint near the multi-digit checkbox: "Підказка: якщо у мапі є 7 і 78 → токен 78 має пріоритет." (no new complex UI elements, simple label).
  • Under bigram frequency area, add a small static text note listing top UA bigrams for reference.
  • In the result output header, append diagnostic info: "Score=..., Vowels=...%, BadRepeats=..., MaxCCRun=...". These are computed from the current plaintext.

Acceptance examples:

  • Numeric tokens: mapping {"12":"П","34":"Р","5":"И","67":"К","8":"А"}, text "12 34 5 678" -> "ПРИКА" (spaces preserved if present). Long runs greedily grouped with longest-match-first; unknown tokens left intact.
  • Case mapping: {"A":"x","b":"y"}, text "ABC abc" -> "xYC xyc".
  • Readability: Auto replacement produces Ukrainian-like plaintext (visible words, proper vowel placement, no forbidden repeats), score shown with diagnostics.

Constraints

  • Only change main.py.
  • Keep all function names and signatures unchanged.
  • Do not alter menu bindings, commands, or external modules.
  • Minor UI text changes are allowed per your approval; no heavy refactors.

Please create a branch and open a PR with these changes; include concise descriptions of scoring updates and UI hints.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@QmFkLVBp QmFkLVBp marked this pull request as ready for review December 1, 2025 22:04
@QmFkLVBp QmFkLVBp merged commit 9661556 into main Dec 1, 2025
1 check failed
Copilot AI requested a review from QmFkLVBp December 1, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants