fix(asr): clamp diarization cluster count to max_num_speakers#15835
Open
vprosoho wants to merge 2 commits into
Open
fix(asr): clamp diarization cluster count to max_num_speakers#15835vprosoho wants to merge 2 commits into
vprosoho wants to merge 2 commits into
Conversation
For short sessions, SpeakerClustering.forward_infer estimates the speaker count via getEnhancedSpeakerCount(), which constructs NMESC with max_num_speakers=emb.shape[0] (the number of embedding segments) instead of the configured max_num_speakers. The resulting est_num_of_spk_enhanced is then consumed in forward_unit_infer without re-applying the limit, so a short audio file can be clustered into more speakers than max_num_speakers allows. Clamp n_clusters to max_num_speakers after the speaker count is selected. This is a no-op for the oracle and standard NME estimation paths (both already bounded by max_num_speakers) and fixes the over-counting that can occur on the enhanced-count path. Signed-off-by: Vadym Prokopov <vprokopov@sohosquared.com>
getEnhancedSpeakerCount estimates the speaker count with max_num_speakers=emb.shape[0], so for short sessions est_num_of_spk_enhanced can exceed the requested max_num_speakers. Add a CPU unit test that calls SpeakerClustering.forward_unit_infer with an enhanced count larger than max_num_speakers and asserts the number of output clusters is capped at max_num_speakers. Fails before the clamp fix (returns 8 clusters), passes after (capped at 2/3). Signed-off-by: Vadym Prokopov <vprokopov@sohosquared.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Small fix to speaker over-counting in clustering diarization for short sessions: the final number of clusters could exceed the configured
max_num_speakers.Collection: ASR (speaker diarization / clustering)
Changelog
nemo/collections/asr/parts/utils/offline_clustering.py: inSpeakerClustering.forward_unit_infer, limit the chosen cluster count withn_clusters = min(n_clusters, max_num_speakers).tests/collections/speaker_tasks/utils/test_diar_utils.py: addtest_offline_speaker_clustering_enhanced_count_respects_max_num_speakers_cpu, unit test for verifying with count larger thanmax_num_speakers.Usage
No usage change. Behavior is the same as before, just fixes the problem.
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Hopefully these are right CCs from what I can tell by looking at git...
cc @tango4j & @nithinraok
Apologies if not!
Additional Information