Large-scale text-to-speech (TTS) systems are bottlenecked by the scarcity of clean, multilingual recordings. Sidon tackles this by pairing a fast, open-source speech restoration model with reproducible tooling so researchers can turn noisy in-the-wild corpora into studio-quality datasets that scale across dozens of languages.
Sidon consists of two stages: a w2v-BERT 2.0 feature predictor finetuned to cleanse representations from degraded speech, and a vocoder trained to synthesise restored waveforms from those features. The stack achieves restoration quality comparable to Miipher—Google's internal speech restoration pipeline—while running up to 500× faster than real time on a single GPU. We also observe that training downstream TTS models on Sidon-cleansed automatic speech recognition corpora improves zero-shot synthesis quality. This repository releases the code, configs, and models needed to reproduce Sidon's dataset cleansing workflow for the community.
This repository ships two models:
- Sidon (arXiv:2509.17052) — the single-speaker speech restoration pipeline described above.
- DialogueSidon (arXiv:2604.09344) — a diffusion-based two-speaker dialogue separator that reuses the Sidon backbone. See the DialogueSidon section.
- Python 3.10+
- Recent PyTorch / CUDA stack (tested with
torch>=2.8,torchaudio>=2.8) uvfor dependency management (or an equivalent toolchain you are comfortable with)
Install project dependencies:
uv syncIf you rely on a different environment manager, replicate the dependencies
listed in pyproject.toml.
src/sidon/model/sidon/lightning_module.py— Feature predictor, decoder, and discriminator Lightning modules.src/sidon/data— WebDataset helpers, preprocessing augmentations, and thePreprocessedDataModuleused for training.src/sidon/preprocess.py— Parallel writer that turns augmented samples into on-disk shards.config/— Hydra configuration tree with defaults for preprocessing, data, models, and trainer settings.scripts/— Utility scripts plus PBS job templates for batch processing.
Training consumes WebDataset shards that contain tensors expected by the
PreprocessedDataModule:
input_wav.pthandnoisy_input_wav.pth— paired clean / degraded waveforms stored as 1D float tensors.- Optional SSL features (
ssl_inputs.pickle,noisy_ssl_inputs.pickle) that provide contextual embeddings for the model. sr.indexand other metadata entries produced by the preprocessing pipeline.
Update config/data/preprocessed.yaml with the locations of your prepared
shards. You can point the train_urls and val_urls entries at directories of
.tar / .tar.gz files, or text manifests containing S3 URIs. Set is_s3=true to stream from object storage via the AWS CLI.
Use the Hydra-driven preprocessing entrypoint to convert raw WebDataset collections into the tensorised format described above.
-
Choose the base configuration in
config/preprocess.yaml(e.g.webdataset_preprocess_24korwebdataset_preprocess_48k). These configs reference the augmentation pipeline, SSL encoders, and noise sources defined inconfig/data/webdataset_preprocess_*.yaml. -
Set output parameters in
config/preprocess/default.yaml(target directory, shard size, number of writer processes). -
Launch preprocessing locally:
uv run python -m sidon.preprocess \ data=webdataset_preprocess_24k \ preprocess.writer_name=my_preprocessed_run
Hydra creates run-specific subdirectories under
outputs/and writes shards into${preprocess.output_root}/{writer_name}/{split}/{job_id}. -
On PBS-based clusters, adapt the templates in
scripts/pbs/(e.g.preprocess_24k.sh) to submit distributed jobs. The scripts activate a local virtual environment, set MPI-friendly environment variables, and forward Hydra overrides to the preprocessing entrypoint.
Utilities such as scripts/summarise_shard_durations.py can help audit the
duration distribution of generated shards before training.
Sidon training runs in three sequential stages. Every invocation of
python -m sidon.train resolves a Hydra config and writes artefacts under
outputs/<timestamped_run>/.
-
Feature predictor pretraining — LoRA-adapts the SSL encoder to denoise representations before they are fed to the vocoder.
uv run python -m sidon.train \ model=sidon_feature_predictor \ data=preprocessed
The resulting checkpoint (e.g.
outputs/<run>/checkpoints/last.ckpt) becomes themodel.cfg.ssl_model_nameinput for the finetuning stage. -
Vocoder pretraining — Trains the decoder and discriminator while the SSL encoder remains frozen on clean features.
uv run python -m sidon.train \ model=sidon_vocoder_pretrain \ data=preprocessed
Capture the checkpoint path; it will be referenced as
model.cfg.pretrain_pathduring finetuning. -
Vocoder finetuning — Warm-starts from the pretraining weights and swaps in the denoised SSL features predicted by the feature predictor.
uv run python -m sidon.train \ model=sidon_vocoder_finetune \ data=preprocessed_48k \ model.cfg.ssl_model_name=/path/to/feature_predictor.ckpt \ model.cfg.pretrain_path=/path/to/vocoder_pretrain.ckpt
Adjust optimiser, scheduler, or trainer parameters via the files in
config/model/ and config/train/, and use train.ckpt_path to resume a run.
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
This repository implements DialogueSidon on top of the Sidon feature backbone: a diffusion transformer head predicts per-speaker latents over a frozen SSL-VAE, conditioned on features from a LoRA-adapted w2v-BERT encoder.
- SSL encoder — a LoRA-adapted
facebook/w2v-bert-2.0student encodes the noisy mixture into frame-level features. - SSL-VAE — a pretrained
SSLVAE(loaded fromcfg.vae_checkpoint_path) provides the target latents; its weights are frozen during training. - Conditioning heads — two linear projections (
output_linear1,output_linear2) map SSL features to per-speaker VAE latents used as a conditioning signal. - Diffusion transformer head — a DiT with AdaLN conditioning, RoPE
attention, and sinusoidal timestep embeddings predicts the noise (or
vtarget) for the concatenated two-speaker latents. - DDPM training — noise is sampled with a
DDPMScheduler(prediction_type=v_predictionby default, 1000 training timesteps). Speaker assignment is resolved with Permutation-Invariant Training on the conditioning heads. - Latent normalisation — running mean/std buffers are initialised from the first training batch and re-used at inference to stabilise diffusion.
The matching inference script is infer.py (not infer_geneses.py, which is
reserved for the flow-matching GENESES separator).
Available under config/model/:
| Config | Head hidden | Head layers | Heads | Notes |
|---|---|---|---|---|
diffusion_dialogue_sidon |
768 | 8 | 12 | default |
diffusion_dialogue_sidon_small |
384 | 12 | 6 | small |
diffusion_dialogue_sidon_xsmall |
384 | 6 | 6 | xsmall |
diffusion_dialogue_sidon_ac |
768 | 8 | 12 | activation checkpointing |
diffusion_dialogue_sidon_wo_diffusion_head |
— | — | — | baseline without diffusion head |
diffusion_dialogue_sidon_wo_vae_latent |
768 | 8 | 12 | ablation without VAE latent conditioning |
diffusion_dialogue_sidon_decoder_finetune |
— | — | — | decoder finetuning stage |
DialogueSidon requires a pretrained SSL-VAE checkpoint. Train it first
with model=ssl_vae, then pass the resulting checkpoint into the diffusion
run via model.cfg.vae_checkpoint_path.
-
SSL-VAE pretraining — learns the latent space that the diffusion head will predict over.
uv run python -m sidon.train \ model=ssl_vae \ data=dialogue_preprocessed
-
Diffusion training — point
model.cfg.vae_checkpoint_pathat the SSL-VAE checkpoint from step 1.uv run python -m sidon.train \ model=diffusion_dialogue_sidon \ data=dialogue_preprocessed \ model.cfg.vae_checkpoint_path=/path/to/ssl_vae.ckpt
PBS templates for each variant are provided in
scripts/pbs/diffusion_dialogue_sidon*.sh.
infer.py runs chunked inference with overlap, resolves speaker permutation
across chunks by cosine similarity in VAE latent space, concatenates the
per-chunk latents, and performs a single VAE decode at the end.
# Batch mode — directory of wav files
python infer.py \
--checkpoint sidon/<run_id> \
--input-dir ./wavs \
--output-dir ./out \
--device cuda:0 \
--num-steps 30 \
--chunk-seconds 20 \
--overlap-seconds 5
# Single audio or video file (replaces the audio track when given a video)
python infer.py \
--checkpoint sidon/<run_id> \
--input-video input.mp4 \
--output-wav separated.wav \
--output-video output.mp4Use scripts/pbs/infer_dialogue.sh to submit the same job on an rt_QG (single
GPU) PBS queue.
- Perform a quick syntax sweep with
python -m compileall srcbefore submitting jobs. - Ensure CUDA kernels are available and match the Torch build; most sidon experiments assume a GPU-backed environment.
- If streaming from S3, check that the AWS CLI is installed and accessible in your job environment.
- The stack is ported from an internal codebase and only partially smoke-checked; if something breaks, please open an issue with details so we can follow up.