A speech dataset synthesis project with speaker-style and emotion control. It contains two independent pipelines:
tts_speaker_control/: first generates{text, answer}pairs with Qwen3, then synthesizes only the answer-side speech with CosyVoice Instruct.textis the speaker style/emotion description (used as the instruction),answeris the line to read (TTS content).s2s_emo_control/: takes existing(text, answer)pairs and synthesizes both query-side and answer-side speech with CosyVoice zero-shot plus an emotion instruction.
Framework boundary: text generation and audio post-processing use MindSpore, CosyVoice inference uses torch_npu, and ark/scp + JSONL utilities are shared via common/. Requires an Ascend/CANN environment.
common/ shared ark / audio / JSONL utilities
tts_speaker_control/
build_text_pairs.py Step 1: Qwen3 generates {text, answer} pairs
merge_metadata.py Step 2: filter moods, tag into TTS records
synthesize_dataset.py Step 3: CosyVoice Instruct synthesizes answer speech
scripts/run_pipeline.sh
s2s_emo_control/
build_dataset.py entry point: synthesize query / answer speech
pipeline.py
scripts/run_shard.sh
cd VoiceStyleControl/tts_speaker_control
bash scripts/run_pipeline.sh all # dedup → text → merge → tts
# single step: bash scripts/run_pipeline.sh {dedup|text|merge|tts}cd VoiceStyleControl/s2s_emo_control
python build_dataset.py \
--meta_path ../personify_text_answer_clean_language_part1.jsonl \
--output_path output/s2s_emo_control/part1 \
--model_path ../pretrained_models \
--speaker_zh ../prompt_speech \
--speaker_en ../en_prompt_speech \
--id 0 --device npu:0After synthesis, ark pointers and wav paths are appended to the metadata JSONL:
- tts_speaker_control:
answer_token_25hz/answer_audio_ark/answer_audio_path - s2s_emo_control: in addition to the answer fields above, also appends
query_token_25hz/query_audio_ark/query_audio_path
mindspore, mindformers, transformers, cosyvoice, torch, torch_npu, scipy, tqdm (Ascend/CANN runtime).