Skip to content

Correctly find the IPA language code for sal-apa#790

Merged
joanise merged 4 commits into
mainfrom
dev.ej/fix-sal-apa
Apr 28, 2026
Merged

Correctly find the IPA language code for sal-apa#790
joanise merged 4 commits into
mainfrom
dev.ej/fix-sal-apa

Conversation

@joanise
Copy link
Copy Markdown
Member

@joanise joanise commented Apr 24, 2026

PR Goal?

Correctly find the IPA language code for sal-apa

Fixes?

Fixes #789

Feedback sought?

regular review

Priority?

high

Tests added?

yes

How to test?

run through the wizard with data in sal-apa, and see it pass by the language selection step (exception dumps there before this PR).

Confidence?

high

Version change?

no, but we're due

Related PRs?

NRC-ILT/g2p#489

@semanticdiff-com
Copy link
Copy Markdown

semanticdiff-com Bot commented Apr 24, 2026

Review changes with  SemanticDiff

Changed Files
File Status
  everyvoice/text/phonemizer.py  28% smaller
  .github/workflows/matrix-tests.yml  0% smaller
  .github/workflows/test.yml  0% smaller
  everyvoice/model/aligner/wav2vec2aligner  0% smaller
  everyvoice/tests/test_custom_g2p.py  0% smaller

@joanise joanise force-pushed the dev.ej/fix-sal-apa branch from 794b117 to c8cd90e Compare April 24, 2026 20:54
@joanise joanise changed the title Dev.ej/fix sal apa Correctly find the IPA language code for sal-apa Apr 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

CLI load time: 0:00.20
Pull Request HEAD: 341983224c5f1c555886c61b39cbc46b04fd6d88
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.99%. Comparing base (459f4f1) to head (3419832).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #790      +/-   ##
==========================================
+ Coverage   82.97%   82.99%   +0.02%     
==========================================
  Files          47       47              
  Lines        4158     4163       +5     
  Branches      611      612       +1     
==========================================
+ Hits         3450     3455       +5     
  Misses        576      576              
  Partials      132      132              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joanise joanise force-pushed the dev.ej/fix-sal-apa branch from c8cd90e to 7e59f4e Compare April 24, 2026 21:10
@joanise joanise requested a review from roedoejet April 24, 2026 21:12
Copy link
Copy Markdown
Member

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments for now. not requesting these changes necessarily, but let's discuss next week.

sal_apa_g2p = get_g2p_engine("sal-apa")
self.assertEqual(sal_apa_g2p("ac"), list("ats"))

# but iku-sro goes to iku-sro-ipa, not iku-ipa
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why does this not go to iku-ipa?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we kind of assumed this *-ipa convention that isn't enforced

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g2p show-mappings | grep iku will tell you that we have iku->iku-equiv->iku-ipa->eng-ipa as the path from syllabics, and the path iku-sro->iku-sri-ipa->iku-sro-ipa->eng-ipa for romanized, and those two paths are just not connected. I don't know why we made the choice, but since we never had an official policy or way to declare "this is the IPA code for language X", whoever wrote the mapping thought that was intuitive to them.

Actually, the git logs tell me that's from back in 2019, with a commit log "first attempt at consolidating langs", so I'm going to guess this might have been an artefact of the merging process. We could change things in g2p, and probably we should add a function to the API that returns the IPA code for any non-IPA code that leads to IPA in a way or another. But my problem would remain that any such solution would be future only, it would not be compatible with older versions of g2p, hence my solution here.

Comment thread everyvoice/text/phonemizer.py Outdated
if lang_id + "-ipa" in LANGS_NETWORK.nodes:
return lang_id + "-ipa"
else:
return lang_id[:3] + "-ipa"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, what about lang_id.split('-')[0] + "-ipa" ? Do we enforce the initial code will be 3 letters somewhere (I mean, ISO639-3 stipulates this but I don't think we actually test this anywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NRC-ILT/g2p#489 addresses this question. I was hoping you'd review both PRs together. I'm not attached to any given solution, as long as we pick one and apply it identically to the two PRs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually, splitting on dash is a nice idea, more forward looking that taking the first 3 characters, so yeah, I'll make this change in both PRs.

@joanise joanise force-pushed the dev.ej/fix-sal-apa branch 2 times, most recently from 8c5b7ad to 0fba6df Compare April 27, 2026 19:12
@joanise joanise force-pushed the dev.ej/fix-sal-apa branch from 0fba6df to 3419832 Compare April 27, 2026 20:20
@roedoejet roedoejet self-requested a review April 27, 2026 20:58
Copy link
Copy Markdown
Member

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to rebase onto main

@joanise joanise merged commit 3419832 into main Apr 28, 2026
13 checks passed
@joanise joanise deleted the dev.ej/fix-sal-apa branch April 28, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

using sal-apa as language triggers InvalidLanguageCode on sal-apa-ipa

2 participants