Skip to content

Multipron align#57

Merged
lenzo-ka merged 12 commits into
masterfrom
multipron-align
Apr 9, 2026
Merged

Multipron align#57
lenzo-ka merged 12 commits into
masterfrom
multipron-align

Conversation

@lenzo-ka
Copy link
Copy Markdown
Contributor

@lenzo-ka lenzo-ka commented Apr 8, 2026

Multipron alignment, robust builds, optional vocab dict

Summary

Multipron is on by default: after CI (stage 20), stage 21 runs multipron force alignment (sphinx3_align via multipron_align) so CD and later stages can use
$CFG_BASE_DIR/multipron_align/$CFG_EXPTNAME.multipron.transcription when that file exists. Set CFG_MULTIPRON = 'no' in etc/sphinx_train.cfg to skip stage 21 and use only the reference transcript everywhere.

Stage 21 driver and driver UX: the Perl wrapper imports dirname for the lib path, does not pass a redundant etc argument into multipron_align.pl (which would confuse multipron_align.py). scripts/sphinxtrain decodes POSIX os.system() wait status so failed Perl stages return a non-zero shell exit code on Unix.

Also tightens CMake on macOS when Xcode or SDK paths change, hardens sphinx3_align (no assert on bad alignment; fallback outsent), and adds optional vocabulary-restricted dictionary (00a.vocab_dict) / cmusphinx.vocab_dict.

Docs: README.md describes default multipron behavior and how to disable it.

User-visible behavior

  • CFG_MULTIPRON: default yes. no disables automatic multipron alignment and multipron transcripts. Training uses the multipron file only when it exists (after stage 21); CI still uses CFG_TRANSCRIPTFILE until then.
  • sphinx3_align: alignment failures do not abort the whole run where fallback applies.
  • CFG_VOCAB_DICT / CFG_VOCAB_DICTIONARY: optional vocabulary-restricted dict step; defaults documented in sphinx_train.cfg.

Migration

Projects that want the old single-transcript-only behavior: set CFG_MULTIPRON = 'no' in etc/sphinx_train.cfg.

Testing

  • Full CMake build including sphinx3_align.
  • CI: Apple-specific CMake logic is guarded with if(APPLE); Ubuntu builds should behave as before.

lenzo-ka added 8 commits April 8, 2026 13:52
When the cached sysroot path is missing, set CMAKE_OSX_SYSROOT from xcrun. Before add_subdirectory(src), clear cached FindBLAS, FindLAPACK, and MATH paths that reference a removed SDK or a different SDK tree than the active sysroot. Remove duplicate find_library(MATH_LIBRARY) in src/CMakeLists.txt.
Return -1 from align_frame on bad alignment instead of asserting. On failure, emit a reference outsent line when possible and continue. Log how many outsent fallbacks were written at the end of a run.
Add CFG_MULTIPRON to sphinx_train.cfg: when yes, Baum-Welch and related steps use BASE_DIR/multipron_align/EXPT.multipron.transcription. Extend GetLists and the untied-HMM, lattice, and BW slaves accordingly. Add multipron_align.pl and multipron_align.py to merge dicts, run sphinx3_align, and write that transcript (beam from CFG_FORCE_ALIGN_BEAM or default 1e-308).
Python reduces the dictionary to words in the transcript vocabulary while keeping pronunciation variants. Perl script is invoked from the optional 00a.vocab_dict Makefile step; uses python like other SphinxTrain drivers. Define CFG_VOCAB_DICT and CFG_VOCAB_DICTIONARY in sphinx_train.cfg.
Keeps optional local design notes out of the repository.
Insert training stage 21.multipron_align after CI to run multipron_align.pl
when CFG_MULTIPRON is set and not no. Use multipron transcript for CD and
later only when the file exists (Util::ShouldUseMultipronTranscript). Default
CFG_MULTIPRON to yes in sphinx_train.cfg; omit or unset variable keeps legacy
behavior without stage 21.

Stage 21 driver: import dirname for lib path; do not pass an extra etc path
to multipron_align.pl. sphinxtrain decodes POSIX wait status from os.system()
so failed Perl stages fail the shell on Unix.
Describe default multipron alignment after CI, sphinx3_align, and how to
disable via CFG_MULTIPRON. Fix Acknowledgments heading spelling.
@lenzo-ka
Copy link
Copy Markdown
Contributor Author

lenzo-ka commented Apr 8, 2026

trigger ci

lenzo-ka added 4 commits April 8, 2026 15:01
CMake installs SCRIPTDIRS including 22.ci_hmm_multipron; the directory was
listed but not in the tree on CI, so cmake --install failed.
@lenzo-ka lenzo-ka requested a review from dhdaines April 9, 2026 12:05
@lenzo-ka lenzo-ka merged commit eabc949 into master Apr 9, 2026
8 checks passed
@lenzo-ka lenzo-ka deleted the multipron-align branch April 9, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant