Correctly find the IPA language code for sal-apa#790
Conversation
Changed Files
|
794b117 to
c8cd90e
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #790 +/- ##
==========================================
+ Coverage 82.97% 82.99% +0.02%
==========================================
Files 47 47
Lines 4158 4163 +5
Branches 611 612 +1
==========================================
+ Hits 3450 3455 +5
Misses 576 576
Partials 132 132 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c8cd90e to
7e59f4e
Compare
roedoejet
left a comment
There was a problem hiding this comment.
Just a couple comments for now. not requesting these changes necessarily, but let's discuss next week.
| sal_apa_g2p = get_g2p_engine("sal-apa") | ||
| self.assertEqual(sal_apa_g2p("ac"), list("ats")) | ||
|
|
||
| # but iku-sro goes to iku-sro-ipa, not iku-ipa |
There was a problem hiding this comment.
Hm, why does this not go to iku-ipa?
There was a problem hiding this comment.
I guess we kind of assumed this *-ipa convention that isn't enforced
There was a problem hiding this comment.
g2p show-mappings | grep iku will tell you that we have iku->iku-equiv->iku-ipa->eng-ipa as the path from syllabics, and the path iku-sro->iku-sri-ipa->iku-sro-ipa->eng-ipa for romanized, and those two paths are just not connected. I don't know why we made the choice, but since we never had an official policy or way to declare "this is the IPA code for language X", whoever wrote the mapping thought that was intuitive to them.
Actually, the git logs tell me that's from back in 2019, with a commit log "first attempt at consolidating langs", so I'm going to guess this might have been an artefact of the merging process. We could change things in g2p, and probably we should add a function to the API that returns the IPA code for any non-IPA code that leads to IPA in a way or another. But my problem would remain that any such solution would be future only, it would not be compatible with older versions of g2p, hence my solution here.
| if lang_id + "-ipa" in LANGS_NETWORK.nodes: | ||
| return lang_id + "-ipa" | ||
| else: | ||
| return lang_id[:3] + "-ipa" |
There was a problem hiding this comment.
hm, what about lang_id.split('-')[0] + "-ipa" ? Do we enforce the initial code will be 3 letters somewhere (I mean, ISO639-3 stipulates this but I don't think we actually test this anywhere.
There was a problem hiding this comment.
NRC-ILT/g2p#489 addresses this question. I was hoping you'd review both PRs together. I'm not attached to any given solution, as long as we pick one and apply it identically to the two PRs.
There was a problem hiding this comment.
But actually, splitting on dash is a nice idea, more forward looking that taking the first 3 characters, so yeah, I'll make this change in both PRs.
8c5b7ad to
0fba6df
Compare
using the technique documented and tested in NRC-ILT/g2p#489 Fixes #789
I was keeping v2 for its speed, but v3.1 uses a cache correctly so it is fast again.
0fba6df to
3419832
Compare
roedoejet
left a comment
There was a problem hiding this comment.
Approving, but please rebase https://github.com/EveryVoiceTTS/wav2vec2aligner/tree/dev.ej/ci-fix-ffmpeg onto https://github.com/EveryVoiceTTS/wav2vec2aligner/tree/main
PR Goal?
Correctly find the IPA language code for
sal-apaFixes?
Fixes #789
Feedback sought?
regular review
Priority?
high
Tests added?
yes
How to test?
run through the wizard with data in
sal-apa, and see it pass by the language selection step (exception dumps there before this PR).Confidence?
high
Version change?
no, but we're due
Related PRs?
NRC-ILT/g2p#489