Cloned the Step-Audio "64cf0a6", and cloned the Step-Audio-Tokenizer "af7e5a3" and the Step-Audio-TTS-3B "9ddb7cb". No code was changed. Ran the tts_inference.py with parameter "synthesis-type" being "tts" or "clone". Generated completely noise. What's wrong?
output_tts.wav
output_clone.wav