"last one means y is already too long, shouldn't happen, but put it here"

When working with the model I encountered some situations where the desired text was removed, but nothing was generated instead of it.
After some investigation, I found the following line in your code:
https://github.com/jasonppy/VoiceCraft/blob/a702dfd2ced6d4fd6b04bdc160c832c6efc8f6c5/models/voicecraft.py#L752
which checks if y_input > 10 * x_lens and if so, it doesn't generate anything.

Why do we need this check?
I'm not sure why  the target transcript length and the input size should limit our generation.
In the code you wrote it should happen, but it might happen if the audio doesn't include a lot of words, but it is longer because of silences in it.

All audios I tested are 4~5 seconds as you suggest it works best for.

I tried removing this check and for the few examples I tried it gave good results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"last one means y is already too long, shouldn't happen, but put it here" #175

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

"last one means y is already too long, shouldn't happen, but put it here" #175

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions