VoiceStyleControl Synthesis MindSpore Pipeline

A speech dataset synthesis project with speaker-style and emotion control. It contains two independent pipelines:

tts_speaker_control/: first generates {text, answer} pairs with Qwen3, then synthesizes only the answer-side speech with CosyVoice Instruct. text is the speaker style/emotion description (used as the instruction), answer is the line to read (TTS content).
s2s_emo_control/: takes existing (text, answer) pairs and synthesizes both query-side and answer-side speech with CosyVoice zero-shot plus an emotion instruction.

Framework boundary: text generation and audio post-processing use MindSpore, CosyVoice inference uses torch_npu, and ark/scp + JSONL utilities are shared via common/. Requires an Ascend/CANN environment.

Layout

common/                 shared ark / audio / JSONL utilities
tts_speaker_control/
  build_text_pairs.py   Step 1: Qwen3 generates {text, answer} pairs
  merge_metadata.py     Step 2: filter moods, tag into TTS records
  synthesize_dataset.py Step 3: CosyVoice Instruct synthesizes answer speech
  scripts/run_pipeline.sh
s2s_emo_control/
  build_dataset.py      entry point: synthesize query / answer speech
  pipeline.py
  scripts/run_shard.sh

Run

tts_speaker_control

cd VoiceStyleControl/tts_speaker_control
bash scripts/run_pipeline.sh all     # dedup → text → merge → tts
# single step: bash scripts/run_pipeline.sh {dedup|text|merge|tts}

s2s_emo_control

cd VoiceStyleControl/s2s_emo_control
python build_dataset.py \
    --meta_path ../personify_text_answer_clean_language_part1.jsonl \
    --output_path output/s2s_emo_control/part1 \
    --model_path ../pretrained_models \
    --speaker_zh ../prompt_speech \
    --speaker_en ../en_prompt_speech \
    --id 0 --device npu:0

Output fields

After synthesis, ark pointers and wav paths are appended to the metadata JSONL:

tts_speaker_control: answer_token_25hz / answer_audio_ark / answer_audio_path
s2s_emo_control: in addition to the answer fields above, also appends query_token_25hz / query_audio_ark / query_audio_path

Dependencies

mindspore, mindformers, transformers, cosyvoice, torch, torch_npu, scipy, tqdm (Ascend/CANN runtime).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
common		common
en_prompt_speech		en_prompt_speech
prompt_speech		prompt_speech
s2s_emo_control		s2s_emo_control
tts_speaker_control		tts_speaker_control
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceStyleControl Synthesis MindSpore Pipeline

Layout

Run

tts_speaker_control

s2s_emo_control

Output fields

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceStyleControl Synthesis MindSpore Pipeline

Layout

Run

tts_speaker_control

s2s_emo_control

Output fields

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages