Skip to content

Long text (>500 chars) causes repetition and audio degradation #13

@DragonbornElric

Description

@DragonbornElric

Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12

Description:
When synthesis input exceeds approximately 500 characters, the model begins repeating the last sentence of the output multiple times and audio quality degrades progressively. The documentation suggests a 1000-character limit, but in practice the model becomes unreliable well before that.

Steps to reproduce:

  1. Send a synthesis request with 500-800 characters of text
  2. Observe: the model repeats the final sentence multiple times, it doesn't happen 100% of the time, but is inconsistent on longer synthesis.
  3. Audio quality degrades toward the end of the output (pacing issues, garbled speech)

Expected behavior:
Model should synthesize the full text without looping or repeating sentences.

Workaround:
Split input text into chunks of ~250 characters before sending to the model. At 250 characters we saw improvement in the performance, but there are still pacing issues for shorter character lengths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions