diff --git a/docs/en/part12/ch42_speech_audio_interaction_data_engineering.md b/docs/en/part12/ch42_speech_audio_interaction_data_engineering.md index 91ca6e37..9d1c0d6c 100644 --- a/docs/en/part12/ch42_speech_audio_interaction_data_engineering.md +++ b/docs/en/part12/ch42_speech_audio_interaction_data_engineering.md @@ -67,7 +67,7 @@ VoiceStyleControl should therefore not be understood simply as a TTS dataset. Th ### VoiceStyleControl.3: Sample Schema: Separate Modeling of the Semantic Channel and Style Channel -![Figure 42-1: Dual-channel schema for semantic response and style control](../../images/part12/ch42_fig02_dual_channel_schema.svg) +![Figure 42-1: Dual-channel schema for semantic response and style control](../../images/part12/ch42_fig02_dual_channel_schema_en.svg) *Figure 42-1: Dual-channel schema for semantic response and style control. The semantic channel answers "what to say," the style channel answers "with which voice and emotion to say it," and the acoustic supervision channel binds both to audio files, speech tokens, and sampling configuration.* @@ -268,7 +268,7 @@ Once training samples enter the dataloader, they are projected from the standard ### VoiceStyleControl.4: Construction Pipeline: From Text Conversation to Controllable Voice Records -![Figure 42-2: VoiceStyleControl data construction pipeline](../../images/part12/ch42_fig01_data_pipeline.svg) +![Figure 42-2: VoiceStyleControl data construction pipeline](../../images/part12/ch42_fig01_data_pipeline_en.svg) *Figure 42-2: VoiceStyleControl data construction pipeline. Text conversation or style content is first assigned speaker and emotion conditions, then audio is generated or collected through the authorized reference voice pool, and finally the samples are tokenized, quality-checked, balanced, and packaged.* @@ -362,7 +362,7 @@ The packaging artifacts include not only JSONL, Parquet, or Hugging Face Dataset ### VoiceStyleControl.5: Quality Assessment and Closed-Loop Remediation -![Figure 42-3: Quality assessment and data flywheel closed loop](../../images/part12/ch42_fig03_quality_loop.svg) +![Figure 42-3: Quality assessment and data flywheel closed loop](../../images/part12/ch42_fig03_quality_loop_en.svg) *Figure 42-3: Quality assessment and data flywheel closed loop. Automated validation, reverse ASR, style assessment, and manual sampling together form a defective-sample queue that feeds back into re-synthesis, re-annotation, downweighting, or removal.* diff --git a/docs/images/part12/ch42_fig01_data_pipeline_en.svg b/docs/images/part12/ch42_fig01_data_pipeline_en.svg new file mode 100644 index 00000000..006cbd9e --- /dev/null +++ b/docs/images/part12/ch42_fig01_data_pipeline_en.svg @@ -0,0 +1,69 @@ + + + + + + + + + Text / style content + Qwen3-8B or human + + + + Style attributes + gender / mood / language + + + + Authorized voice pool + multi-speaker x emotion + + + + Speech synth / collect + CosyVoice2 zero-shot + + + + Speech tokens + S3Tokenizer 25 Hz + + + + QC and balance + ASR / voice / emotion + + + + Package / release + JSONL / audio / token + + + + + + + target + style + + + + reference choice + + + + + + + + + + +S2SEmoControl: synthesize query and answer separately +TTSSpeakerControl: synthesize style description and answer + diff --git a/docs/images/part12/ch42_fig02_dual_channel_schema_en.svg b/docs/images/part12/ch42_fig02_dual_channel_schema_en.svg new file mode 100644 index 00000000..8e6d9c74 --- /dev/null +++ b/docs/images/part12/ch42_fig02_dual_channel_schema_en.svg @@ -0,0 +1,16 @@ + + + + + + +Dual-channel schema for semantic response and style control +Semantic Channel• query / spoken user query• answer / assistant response• task: S2S or TTS• language: zh +Style Channel• query_gender / answer_gender• query_mood / answer_mood• gender / mood• query_id / answer_id +Acoustic Target• query_audio_path / answer_audio_path• query_token_25hz• answer_token_25hz• speech_token_25hz• sample_rate: 16000 +Training Record• Text loss: semantic match• Speech-token loss: pronounceable• Speaker constraint: identity• Emotion constraint: style +textalignment +controlconditions +unifiedpackage +Core principle: model semantic correctness and vocal expression separately, then merge them into one trainable record. + diff --git a/docs/images/part12/ch42_fig03_quality_loop_en.svg b/docs/images/part12/ch42_fig03_quality_loop_en.svg new file mode 100644 index 00000000..8c2189bb --- /dev/null +++ b/docs/images/part12/ch42_fig03_quality_loop_en.svg @@ -0,0 +1,12 @@ + + + +Quality evaluation and data-flywheel loop +Automatic checksschema / duration / rate +Reverse ASRtext consistency / CER +Style evaluationspeaker / emotion +Human samplingnaturalness / misuse +Repair and versioningresynth / relabel / downweight / remove +issue samplesrule updates +Online feedback enters review; only samples passing semantic, style, audio, and safety gates enter the next training set. +