Skip to content

Inconsistent pacing and emphasis during synthesis #15

@DragonbornElric

Description

@DragonbornElric

Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12

Description:
The model applies inconsistent pacing and emphasis during synthesis. Some words receive dramatic emphasis that doesn't match the sentence context, and the speaking rate varies unpredictably — sometimes speeding up mid-sentence, sometimes slowing down. This may be related to the base model's training data influencing the output cadence.

Steps to reproduce:
Send several synthesis requests with conversational text. Observe that emphasis placement and speaking speed vary unpredictably between and within utterances.

Expected behavior:
Pacing and emphasis should be relatively consistent and contextually appropriate.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions