- Accepted to Interspeech 2025
- Additional results (Baseline, SSL Frozen, SSL FT in Figure 4) can be found here: espnet/dsu_baseline
This repository provides code and scripts for streaming discrete speech unit (DSU) extraction and evaluation, targeting on-device scenarios. The workflow is organized into several stages, each with corresponding shell scripts for easy execution. This document details each stage and how to run it.
We recommend using Conda for environment management.
conda create -p ./envs python=3.10
conda activate ./envs
conda install -y pytorch=2.4.0 torchaudio=2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt- Shell scripts are provided to automate most stages.
- Core Python scripts implement dataset creation, training, evaluation, and analysis.
Key files and directories include:
create_dataset.sh— Dataset creationtrain_kmeans.sh— K-means clustering for unit extractionrun.sh— Main trainingevaluate.sh— Model evaluationsubmit.sh,submit_everything.sh— Submission utilitiesconfig/— YAML configuration filessrc/,egs/,unit_analyze/, etc. — Code and experiments
Prepare the required dataset by running:
bash create_dataset.sh- This will invoke
create_dataset.pyand process raw audio/text into the required format. - Adjust the script or configs as needed for your dataset location.
To perform K-means clustering on features (for discrete unit extraction):
bash train_kmeans.sh- This will run
train_kmeans.pyusing parameters set in the shell script.
Train your streaming DSU model by running:
bash run.sh- This runs
train.pyusing the configuration specified inrun.sh(default:config/soundstream/soundstream.yaml). - You can override parameters by editing the script or passing options as environment variables.
Evaluate the trained model:
bash evaluate.sh- This will call
evaluate.pyand/orevaluate_unit2text.pyto compute metrics and generate results.
For challenge or benchmark submissions:
bash submit.sh
# or
bash submit_everything.sh- These scripts package results for submission or evaluation.
- All major stages reference configuration files under
config/. - Edit YAML files (e.g.,
soundstream.yaml) to change model, training, or preprocessing parameters.
calc_gflops.py— Measure model computational cost.visualize_cooccurrence.py— Analyze and visualize unit co-occurrences.grid2tsv.py— Convert alignment grids for analysis.
- Each shell script sets up the environment via
path.shand parses options withparse_options.sh. - GPU resources are managed via SLURM directives in the scripts; adjust as needed for your compute cluster.
- For further details on baseline and ablation results, see: espnet/dsu_baseline.
- For the complete list of files and scripts, visit the repository contents.
If you use this repository, please cite our Interspeech 2025 paper.
@inproceedings{choi25_interspeech,
title={On-device Streaming Discrete Speech Units},
author={Kwanghee Choi and Masao Someki and Emma Strubell and Shinji Watanabe},
booktitle={Interspeech},
year={2025}
}
For questions, please open an issue or contact the authors.