Skip to content

Fix misplaced U+0313 combining comma above (9828 occurrences, 111 files)#2862

Closed
chrisdrymon wants to merge 1 commit into
OpenGreekAndLatin:masterfrom
cherithanalytics:fix-misplaced-elision-marks-u0313
Closed

Fix misplaced U+0313 combining comma above (9828 occurrences, 111 files)#2862
chrisdrymon wants to merge 1 commit into
OpenGreekAndLatin:masterfrom
cherithanalytics:fix-misplaced-elision-marks-u0313

Conversation

@chrisdrymon
Copy link
Copy Markdown
Contributor

Summary

Throughout the corpus, the elision apostrophe in forms like ὑπʼ, ἐφʼ, ὑφʼ is encoded as U+0313 COMBINING COMMA ABOVE sitting after the consonant, rather than U+02BC MODIFIER LETTER APOSTROPHE which is the canonical Greek elision marker.

U+0313 is a combining diacritic; it composes with vowels under NFC normalization to produce smooth-breathing precomposed characters (α + ̓ → ἀ). After a consonant it has no valid composition and remains as a dangling combining mark in NFC text, producing strings like ὑπ̓ where the mark is structurally attached to π even though semantically it represents the elided syllable.

The bug is easy to miss because rendering varies by font and platform. Side-by-side comparison:

  • Current: ὑπ̓ αὐτοῦ — the U+0313 sits on the π, looking like a smooth breathing on a consonant.
  • Fixed: ὑπʼ αὐτοῦ — the U+02BC is a proper Greek elision apostrophe.

Why it matters

  • Search and tokenization treat ὑπ̓ as one token (consonant + combining mark) rather than apostrophe-separated. Indexers don't roll the elided form up to lemma ὑπό correctly.
  • NFC normalization can't repair this because there is no precomposed consonant + smooth breathing codepoint — the combining mark stays detached.
  • Rendering layers that fall back to combining-mark display show a comma floating awkwardly above the π.

This came up while building a search engine over the Perseus / First1KGreek corpora (korakes-ingest) — a parallel PR is going up against PerseusDL/canonical-greekLit for ~864 occurrences there.

Fix

Three-case replacement of U+0313 in invalid positions (i.e., where the preceding character won't NFC-compose with it):

  1. Consonant + U+0313 (the elision-marker case, e.g., ὑπ̓ ἐφ̓ ὑφ̓ κατ̓ μετ̓ ἀντ̓): replace U+0313 with U+02BC. Bulk of changes.
  2. Already-composed vowel-with-breathing + U+0313 (duplicate-breathing case, e.g., ἐ̓ where the ἐ already carries a smooth breathing): drop the duplicate combining mark.
  3. Ano teleia + U+0313: drop the combining mark.

9828 replacements across 111 files.

Test plan

  • CI passes
  • Spot-check a handful of diffs — every change is U+0313 → U+02BC (mostly) or removal of a stray U+0313
  • NFC-normalize affected files post-merge and confirm no information loss

🤖 Generated with Claude Code

Throughout the corpus, the elision apostrophe in forms like ὑπʼ, ἐφʼ, ὑφʼ
was encoded as U+0313 (COMBINING COMMA ABOVE) sitting after the
consonant, rather than U+02BC (MODIFIER LETTER APOSTROPHE) which is
the canonical Greek elision marker.

U+0313 is a combining diacritic; it composes with vowels under NFC
normalization to produce smooth-breathing precomposed characters
(α + ̓ → ἀ). But after a consonant it has no valid composition and
remains as a dangling combining mark in NFC text, producing strings
like 'ὑπ̓' where the combining mark is structurally attached to π
even though semantically it represents the elided syllable.

This breaks downstream consumers in ways that are easy to miss:
search indexers tokenize 'ὑπ̓ ἄλλου' as 'ὑπ̓' and 'ἄλλου' rather
than treating the apostrophe as a punctuation mark and producing the
underlying lemma form ὑπό; rendering layers that fall back to
combining-mark display show a comma floating awkwardly above the π.
NFC normalization can't fix this because there is no precomposed
'consonant + smooth breathing' codepoint.

The fix is a three-case replacement of U+0313 in invalid positions:

- Consonant + U+0313 (the elision-marker case): replace U+0313 with
  U+02BC. Most common in elided prepositions before rough-breathing
  vowels (ὑφʼ, ἐφʼ, ἀφʼ, ὑπʼ, ἐπʼ, ἀπʼ, ὑπʼ, ἀντʼ, μετʼ, κατʼ,
  etc.). The bulk of the changes.

- Already-composed vowel-with-breathing + U+0313 (the duplicate-
  breathing case, e.g., ἐ̓): drop the duplicate combining mark.
  About two dozen occurrences.

- Ano teleia / other non-letter + U+0313: drop the combining mark.
  Rare.

9828 replacements total across 111 files. All other content is
byte-identical.
@chrisdrymon
Copy link
Copy Markdown
Contributor Author

Closing — author found mistakes in the patch and is going to rework it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant