On-device Streaming Discrete Speech Units

Accepted to Interspeech 2025
Additional results (Baseline, SSL Frozen, SSL FT in Figure 4) can be found here: espnet/dsu_baseline

Overview

This repository provides code and scripts for streaming discrete speech unit (DSU) extraction and evaluation, targeting on-device scenarios. The workflow is organized into several stages, each with corresponding shell scripts for easy execution. This document details each stage and how to run it.

Installation

We recommend using Conda for environment management.

conda create -p ./envs python=3.10
conda activate ./envs
conda install -y pytorch=2.4.0 torchaudio=2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

Project Structure

Shell scripts are provided to automate most stages.
Core Python scripts implement dataset creation, training, evaluation, and analysis.

Key files and directories include:

create_dataset.sh — Dataset creation
train_kmeans.sh — K-means clustering for unit extraction
run.sh — Main training
evaluate.sh — Model evaluation
submit.sh, submit_everything.sh — Submission utilities
config/ — YAML configuration files
src/, egs/, unit_analyze/, etc. — Code and experiments

Stage-by-Stage Usage

1. Dataset Creation

Prepare the required dataset by running:

bash create_dataset.sh

This will invoke create_dataset.py and process raw audio/text into the required format.
Adjust the script or configs as needed for your dataset location.

2. K-means Training (Unit Extraction)

To perform K-means clustering on features (for discrete unit extraction):

bash train_kmeans.sh

This will run train_kmeans.py using parameters set in the shell script.

3. Main Model Training

Train your streaming DSU model by running:

bash run.sh

This runs train.py using the configuration specified in run.sh (default: config/soundstream/soundstream.yaml).
You can override parameters by editing the script or passing options as environment variables.

4. Model Evaluation

Evaluate the trained model:

bash evaluate.sh

This will call evaluate.py and/or evaluate_unit2text.py to compute metrics and generate results.

5. Submission

For challenge or benchmark submissions:

bash submit.sh
# or
bash submit_everything.sh

These scripts package results for submission or evaluation.

Configuration

All major stages reference configuration files under config/.
Edit YAML files (e.g., soundstream.yaml) to change model, training, or preprocessing parameters.

Useful Scripts

calc_gflops.py — Measure model computational cost.
visualize_cooccurrence.py — Analyze and visualize unit co-occurrences.
grid2tsv.py — Convert alignment grids for analysis.

Tips

Each shell script sets up the environment via path.sh and parses options with parse_options.sh.
GPU resources are managed via SLURM directives in the scripts; adjust as needed for your compute cluster.

Additional Notes

For further details on baseline and ablation results, see: espnet/dsu_baseline.
For the complete list of files and scripts, visit the repository contents.

Citation

If you use this repository, please cite our Interspeech 2025 paper.

@inproceedings{choi25_interspeech,
  title={On-device Streaming Discrete Speech Units},
  author={Kwanghee Choi and Masao Someki and Emma Strubell and Shinji Watanabe},
  booktitle={Interspeech},
  year={2025}
}

For questions, please open an issue or contact the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On-device Streaming Discrete Speech Units

Overview

Installation

Project Structure

Stage-by-Stage Usage

1. Dataset Creation

2. K-means Training (Unit Extraction)

3. Main Model Training

4. Model Evaluation

5. Submission

Configuration

Useful Scripts

Tips

Additional Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
config		config
egs		egs
plots		plots
src		src
stats		stats
test_clean		test_clean
unit_analyze		unit_analyze
.gitignore		.gitignore
README.md		README.md
calc_gflops.py		calc_gflops.py
create_dataset.py		create_dataset.py
create_dataset.sh		create_dataset.sh
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
evaluate_unit2text.py		evaluate_unit2text.py
exp_final_U2T.sh		exp_final_U2T.sh
generate_dsu.py		generate_dsu.py
grid2tsv.py		grid2tsv.py
parse_options.sh		parse_options.sh
path.sh		path.sh
requirements.txt		requirements.txt
run.sh		run.sh
submit.sh		submit.sh
submit_everything.sh		submit_everything.sh
train.py		train.py
train_kmeans.py		train_kmeans.py
train_kmeans.sh		train_kmeans.sh
visualize_cooccurrence.py		visualize_cooccurrence.py

Folders and files

Latest commit

History

Repository files navigation

On-device Streaming Discrete Speech Units

Overview

Installation

Project Structure

Stage-by-Stage Usage

1. Dataset Creation

2. K-means Training (Unit Extraction)

3. Main Model Training

4. Model Evaluation

5. Submission

Configuration

Useful Scripts

Tips

Additional Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages