Sidon

Large-scale text-to-speech (TTS) systems are bottlenecked by the scarcity of clean, multilingual recordings. Sidon tackles this by pairing a fast, open-source speech restoration model with reproducible tooling so researchers can turn noisy in-the-wild corpora into studio-quality datasets that scale across dozens of languages.

Sidon consists of two stages: a w2v-BERT 2.0 feature predictor finetuned to cleanse representations from degraded speech, and a vocoder trained to synthesise restored waveforms from those features. The stack achieves restoration quality comparable to Miipher—Google's internal speech restoration pipeline—while running up to 500× faster than real time on a single GPU. We also observe that training downstream TTS models on Sidon-cleansed automatic speech recognition corpora improves zero-shot synthesis quality. This repository releases the code, configs, and models needed to reproduce Sidon's dataset cleansing workflow for the community.

This repository ships two models:

Sidon (arXiv:2509.17052) — the single-speaker speech restoration pipeline described above.
DialogueSidon (arXiv:2604.09344) — a diffusion-based two-speaker dialogue separator that reuses the Sidon backbone. See the DialogueSidon section.

Requirements

Python 3.10+
Recent PyTorch / CUDA stack (tested with torch>=2.8, torchaudio>=2.8)
uv for dependency management (or an equivalent toolchain you are comfortable with)

Install project dependencies:

uv sync

If you rely on a different environment manager, replicate the dependencies listed in pyproject.toml.

Repository layout

src/sidon/model/sidon/lightning_module.py — Feature predictor, decoder, and discriminator Lightning modules.
src/sidon/data — WebDataset helpers, preprocessing augmentations, and the PreprocessedDataModule used for training.
src/sidon/preprocess.py — Parallel writer that turns augmented samples into on-disk shards.
config/ — Hydra configuration tree with defaults for preprocessing, data, models, and trainer settings.
scripts/ — Utility scripts plus PBS job templates for batch processing.

Preparing data

Training consumes WebDataset shards that contain tensors expected by the PreprocessedDataModule:

input_wav.pth and noisy_input_wav.pth — paired clean / degraded waveforms stored as 1D float tensors.
Optional SSL features (ssl_inputs.pickle, noisy_ssl_inputs.pickle) that provide contextual embeddings for the model.
sr.index and other metadata entries produced by the preprocessing pipeline.

Update config/data/preprocessed.yaml with the locations of your prepared shards. You can point the train_urls and val_urls entries at directories of .tar / .tar.gz files, or text manifests containing S3 URIs. Set is_s3=true to stream from object storage via the AWS CLI.

Generating preprocessed shards

Use the Hydra-driven preprocessing entrypoint to convert raw WebDataset collections into the tensorised format described above.

Choose the base configuration in config/preprocess.yaml (e.g. webdataset_preprocess_24k or webdataset_preprocess_48k). These configs reference the augmentation pipeline, SSL encoders, and noise sources defined in config/data/webdataset_preprocess_*.yaml.
Set output parameters in config/preprocess/default.yaml (target directory, shard size, number of writer processes).
Launch preprocessing locally:
```
uv run python -m sidon.preprocess \
  data=webdataset_preprocess_24k \
  preprocess.writer_name=my_preprocessed_run
```
Hydra creates run-specific subdirectories under outputs/ and writes shards into ${preprocess.output_root}/{writer_name}/{split}/{job_id}.
On PBS-based clusters, adapt the templates in scripts/pbs/ (e.g. preprocess_24k.sh) to submit distributed jobs. The scripts activate a local virtual environment, set MPI-friendly environment variables, and forward Hydra overrides to the preprocessing entrypoint.

Utilities such as scripts/summarise_shard_durations.py can help audit the duration distribution of generated shards before training.

Training pipeline

Sidon training runs in three sequential stages. Every invocation of python -m sidon.train resolves a Hydra config and writes artefacts under outputs/<timestamped_run>/.

Feature predictor pretraining — LoRA-adapts the SSL encoder to denoise representations before they are fed to the vocoder.
```
uv run python -m sidon.train \
  model=sidon_feature_predictor \
  data=preprocessed
```
The resulting checkpoint (e.g. outputs/<run>/checkpoints/last.ckpt) becomes the model.cfg.ssl_model_name input for the finetuning stage.
Vocoder pretraining — Trains the decoder and discriminator while the SSL encoder remains frozen on clean features.
```
uv run python -m sidon.train \
  model=sidon_vocoder_pretrain \
  data=preprocessed
```
Capture the checkpoint path; it will be referenced as model.cfg.pretrain_path during finetuning.

Vocoder finetuning — Warm-starts from the pretraining weights and swaps in the denoised SSL features predicted by the feature predictor.

uv run python -m sidon.train \
  model=sidon_vocoder_finetune \
  data=preprocessed_48k \
  model.cfg.ssl_model_name=/path/to/feature_predictor.ckpt \
  model.cfg.pretrain_path=/path/to/vocoder_pretrain.ckpt

Adjust optimiser, scheduler, or trainer parameters via the files in config/model/ and config/train/, and use train.ckpt_path to resume a run.

DialogueSidon — diffusion-based dialogue separation

Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.

This repository implements DialogueSidon on top of the Sidon feature backbone: a diffusion transformer head predicts per-speaker latents over a frozen SSL-VAE, conditioned on features from a LoRA-adapted w2v-BERT encoder.

Architecture

SSL encoder — a LoRA-adapted facebook/w2v-bert-2.0 student encodes the noisy mixture into frame-level features.
SSL-VAE — a pretrained SSLVAE (loaded from cfg.vae_checkpoint_path) provides the target latents; its weights are frozen during training.
Conditioning heads — two linear projections (output_linear1, output_linear2) map SSL features to per-speaker VAE latents used as a conditioning signal.
Diffusion transformer head — a DiT with AdaLN conditioning, RoPE attention, and sinusoidal timestep embeddings predicts the noise (or v target) for the concatenated two-speaker latents.
DDPM training — noise is sampled with a DDPMScheduler (prediction_type=v_prediction by default, 1000 training timesteps). Speaker assignment is resolved with Permutation-Invariant Training on the conditioning heads.
Latent normalisation — running mean/std buffers are initialised from the first training batch and re-used at inference to stabilise diffusion.

The matching inference script is infer.py (not infer_geneses.py, which is reserved for the flow-matching GENESES separator).

Model variants

Available under config/model/:

Config	Head hidden	Head layers	Heads	Notes
`diffusion_dialogue_sidon`	768	8	12	default
`diffusion_dialogue_sidon_small`	384	12	6	small
`diffusion_dialogue_sidon_xsmall`	384	6	6	xsmall
`diffusion_dialogue_sidon_ac`	768	8	12	activation checkpointing
`diffusion_dialogue_sidon_wo_diffusion_head`	—	—	—	baseline without diffusion head
`diffusion_dialogue_sidon_wo_vae_latent`	768	8	12	ablation without VAE latent conditioning
`diffusion_dialogue_sidon_decoder_finetune`	—	—	—	decoder finetuning stage

Training

DialogueSidon requires a pretrained SSL-VAE checkpoint. Train it first with model=ssl_vae, then pass the resulting checkpoint into the diffusion run via model.cfg.vae_checkpoint_path.

SSL-VAE pretraining — learns the latent space that the diffusion head will predict over.
```
uv run python -m sidon.train \
  model=ssl_vae \
  data=dialogue_preprocessed
```

Diffusion training — point model.cfg.vae_checkpoint_path at the SSL-VAE checkpoint from step 1.

uv run python -m sidon.train \
  model=diffusion_dialogue_sidon \
  data=dialogue_preprocessed \
  model.cfg.vae_checkpoint_path=/path/to/ssl_vae.ckpt

PBS templates for each variant are provided in scripts/pbs/diffusion_dialogue_sidon*.sh.

Inference

infer.py runs chunked inference with overlap, resolves speaker permutation across chunks by cosine similarity in VAE latent space, concatenates the per-chunk latents, and performs a single VAE decode at the end.

# Batch mode — directory of wav files
python infer.py \
  --checkpoint sidon/<run_id> \
  --input-dir ./wavs \
  --output-dir ./out \
  --device cuda:0 \
  --num-steps 30 \
  --chunk-seconds 20 \
  --overlap-seconds 5

# Single audio or video file (replaces the audio track when given a video)
python infer.py \
  --checkpoint sidon/<run_id> \
  --input-video input.mp4 \
  --output-wav separated.wav \
  --output-video output.mp4

Use scripts/pbs/infer_dialogue.sh to submit the same job on an rt_QG (single GPU) PBS queue.

Validation and troubleshooting

Perform a quick syntax sweep with python -m compileall src before submitting jobs.
Ensure CUDA kernels are available and match the Torch build; most sidon experiments assume a GPU-backed environment.
If streaming from S3, check that the AWS CLI is installed and accessible in your job environment.
The stack is ported from an internal codebase and only partially smoke-checked; if something breaks, please open an issue with details so we can follow up.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
config		config
scripts		scripts
spaces		spaces
src/sidon		src/sidon
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
checkpoint_ids.txt		checkpoint_ids.txt
eval_rtf.py		eval_rtf.py
eval_rtf.sh		eval_rtf.sh
export.py		export.py
export_diffusion_dialogue.py		export_diffusion_dialogue.py
infer.ipynb		infer.ipynb
infer.py		infer.py
infer.sh		infer.sh
infer_eval.sh		infer_eval.sh
infer_geneses.py		infer_geneses.py
infer_geneses.sh		infer_geneses.sh
infer_size_ablation.sh		infer_size_ablation.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sidon

Requirements

Repository layout

Preparing data

Generating preprocessed shards

Training pipeline

DialogueSidon — diffusion-based dialogue separation

Architecture

Model variants

Training

Inference

Validation and troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sidon

Requirements

Repository layout

Preparing data

Generating preprocessed shards

Training pipeline

DialogueSidon — diffusion-based dialogue separation

Architecture

Model variants

Training

Inference

Validation and troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages