Fix misplaced U+0313 combining comma above (9828 occurrences, 111 files)#2862
Closed
chrisdrymon wants to merge 1 commit into
Closed
Conversation
Throughout the corpus, the elision apostrophe in forms like ὑπʼ, ἐφʼ, ὑφʼ was encoded as U+0313 (COMBINING COMMA ABOVE) sitting after the consonant, rather than U+02BC (MODIFIER LETTER APOSTROPHE) which is the canonical Greek elision marker. U+0313 is a combining diacritic; it composes with vowels under NFC normalization to produce smooth-breathing precomposed characters (α + ̓ → ἀ). But after a consonant it has no valid composition and remains as a dangling combining mark in NFC text, producing strings like 'ὑπ̓' where the combining mark is structurally attached to π even though semantically it represents the elided syllable. This breaks downstream consumers in ways that are easy to miss: search indexers tokenize 'ὑπ̓ ἄλλου' as 'ὑπ̓' and 'ἄλλου' rather than treating the apostrophe as a punctuation mark and producing the underlying lemma form ὑπό; rendering layers that fall back to combining-mark display show a comma floating awkwardly above the π. NFC normalization can't fix this because there is no precomposed 'consonant + smooth breathing' codepoint. The fix is a three-case replacement of U+0313 in invalid positions: - Consonant + U+0313 (the elision-marker case): replace U+0313 with U+02BC. Most common in elided prepositions before rough-breathing vowels (ὑφʼ, ἐφʼ, ἀφʼ, ὑπʼ, ἐπʼ, ἀπʼ, ὑπʼ, ἀντʼ, μετʼ, κατʼ, etc.). The bulk of the changes. - Already-composed vowel-with-breathing + U+0313 (the duplicate- breathing case, e.g., ἐ̓): drop the duplicate combining mark. About two dozen occurrences. - Ano teleia / other non-letter + U+0313: drop the combining mark. Rare. 9828 replacements total across 111 files. All other content is byte-identical.
Contributor
Author
|
Closing — author found mistakes in the patch and is going to rework it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Throughout the corpus, the elision apostrophe in forms like ὑπʼ, ἐφʼ, ὑφʼ is encoded as U+0313 COMBINING COMMA ABOVE sitting after the consonant, rather than U+02BC MODIFIER LETTER APOSTROPHE which is the canonical Greek elision marker.
U+0313 is a combining diacritic; it composes with vowels under NFC normalization to produce smooth-breathing precomposed characters (
α + ̓ → ἀ). After a consonant it has no valid composition and remains as a dangling combining mark in NFC text, producing strings likeὑπ̓where the mark is structurally attached to π even though semantically it represents the elided syllable.The bug is easy to miss because rendering varies by font and platform. Side-by-side comparison:
ὑπ̓ αὐτοῦ— the U+0313 sits on the π, looking like a smooth breathing on a consonant.ὑπʼ αὐτοῦ— the U+02BC is a proper Greek elision apostrophe.Why it matters
ὑπ̓as one token (consonant + combining mark) rather than apostrophe-separated. Indexers don't roll the elided form up to lemmaὑπόcorrectly.consonant + smooth breathingcodepoint — the combining mark stays detached.This came up while building a search engine over the Perseus / First1KGreek corpora (korakes-ingest) — a parallel PR is going up against PerseusDL/canonical-greekLit for ~864 occurrences there.
Fix
Three-case replacement of U+0313 in invalid positions (i.e., where the preceding character won't NFC-compose with it):
ὑπ̓ἐφ̓ὑφ̓κατ̓μετ̓ἀντ̓): replace U+0313 with U+02BC. Bulk of changes.ἐ̓where the ἐ already carries a smooth breathing): drop the duplicate combining mark.9828 replacements across 111 files.
Test plan
🤖 Generated with Claude Code