This guide provides instructions for evaluating the PAST model using the metrics reported in the paper.
Before running the evaluation:
- 📥 Clone the repository
git clone https://github.com/slp-rl/PAST.git
cd PAST-
🛠️ Set up the environment
Follow the instructions in the main README -
📁 Prepare the dataset
See our Data Preparation Guide
✅ Note: For evaluation only, you do not need transcription alignments — this makes the process much faster!
The PAST model is evaluated on both acoustic and phonetic criteria:
- SISNR (Scale-Invariant Signal-to-Noise Ratio): Measures fidelity of waveform reconstruction.
- PESQ (Perceptual Evaluation of Speech Quality): Quantifies perceptual speech quality.
- ViSQOL: Approximates human MOS scores using perceptual similarity.
- PNMI (Phone-Normalized Mutual Information): Measures how informative the token sequence is about the phonemes.
- ABX: Evaluates whether phonetic distinctions are preserved after tokenization.
- WER (Word Error Rate): Evaluates ASR quality based on discrete tokens.
- sWUGGY: Evaluates speech-language model performance.
Use this script to compute reconstruction metrics:
python scripts/eval_acoustic.py \
--model-cp PAST \
--output-path <PATH_TO_OUTPUT_DIR> \
--acoustic-menifest <YOUR_MANIFESTS_DIR>/LibriSpeech_test.jsonlYou can replace PAST with:
- A different HuggingFace model (e.g.
PAST_streamable) - A path to a local checkpoint
You can also use any manifest you wish, as long as it contains audio files, and in Audiocraft format.
⚠️ Note: To compute the PESQ metric, you must install thepypesqdependency manually, as it's not included in the project's environment by default.
Run the following command inside your conda environment:pip install git+https://github.com/vBaiCai/python-pesq.gitIf you skip this step, the evaluation will still run, but PESQ will not be computed.
Use this script to compute PNMI using phoneme labels:
python scripts/eval_phonme.py \
--model-cp PAST \
--output-path <PATH_TO_OUTPUT_DIR> \
--timit-manifest <YOUR_MANIFESTS_DIR>/timit_test.jsonlAs before, you can use any manifest and model you wish.
WER is computed using the DASB Benchmark
🔗 Guide for tokenizer integration:
→ Incorporating Your Audio Tokenizer
ABX tests how well the tokenizer preserves phoneme distinctions using continuous embeddings.
We use the official tools from the Libri-light repo.
In our evaluation, we extracted reconstructed embeddings using:
model.decode(codes, scale, return_latent=True)This allows evaluation directly on the latent representations after RVQ.
ViSQOL is computed using Audiocraft’s wrapper around Google’s implementation.
📘 Documentation:
→ Audiocraft Metrics
To evaluate spoken language modeling performance, we trained language models over different tokenizers’ outputs.
We then used the salmon library to compute the sWUGGY metric.
This benchmark tests the model’s ability to assign higher likelihoods to real words over pseudo-words.
Happy evaluating! 🎧📈