Hi,
Thanks for sharing this interesting work! I need to do inference on videos that are approximately 13 minutes long (~780 seconds). It seems that the inference script trim the audio to 30 seconds "speech = whisper.pad_or_trim(speech.astype(np.float32))". When I tried to comment this step, I got an AssertionError: incorrect audio shape. So, I wanted to make sure if I have to work around this and do chunking for example.
Thanks.
Hi,
Thanks for sharing this interesting work! I need to do inference on videos that are approximately 13 minutes long (~780 seconds). It seems that the inference script trim the audio to 30 seconds "speech = whisper.pad_or_trim(speech.astype(np.float32))". When I tried to comment this step, I got an AssertionError: incorrect audio shape. So, I wanted to make sure if I have to work around this and do chunking for example.
Thanks.