This guide walks you through the process of preparing audio datasets (LibriSpeech and TIMIT) for model training, including the generation of aligned transcriptions using a pre-trained Wav2Vec2 model. The output is a structured manifest directory ready for use in downstream training tasks.
The provided script performs the following steps:
- Scans audio files under the provided dataset roots.
- Optionally aligns transcriptions using
facebook/wav2vec2-base-960h. - Saves audio metadata and alignment files to a manifest directory.
The output will contain:
*.jsonlfiles fortrain,dev, andtestsplits- A
transcriptions_alignment/directory with aligned character timestamps
You must manually download and extract the datasets before using the script.
| Dataset | Link |
|---|---|
| LibriSpeech | Official Site |
| TIMIT | LDC Catalog (LDC93S1) (licensed) |
Expected input structure:
<LibriSpeech_root>/train-clean-100/...
<TIMIT_root>/TRAIN/DR1/...
Output structure:
<output_dir>/
βββ LibriSpeech_train.jsonl
βββ LibriSpeech_dev.jsonl
βββ LibriSpeech_test.jsonl
βββ timit_train.jsonl
βββ timit_test.jsonl
βββ transcriptions_alignment/
βββ [mirrors dataset folder structure]
python scripts/audio_dataset_extraction.py \
--LibriSpeech_root /path/to/LibriSpeech \
--timit_root /path/to/TIMIT \
--output_dir /path/to/output_dir| Argument | Required | Description |
|---|---|---|
--LibriSpeech_root |
β | Path to extracted LibriSpeech dataset |
--timit_root |
β | Path to TIMIT dataset (required for phoneme head training) |
--output_dir |
β | Where to save manifest files and alignments |
--skip_transcriptions_alignment |
β | Speeds up processing, but disables auxiliary head training |
--debug |
β | Limits file count for quick testing |
--num_processes |
β | Number of parallel processes (default: 8) |
- Alignment uses HuggingFace's
facebook/wav2vec2-base-960hto compute character-level timestamps. - Alignment is required for training asr auxiliary head.
- Skipping alignment will produce valid
.jsonlmetadata, but alignment files will be missing. - TIMIT dataset is used for phoneme classification auxiliary task.
| Setting | Recommendation |
|---|---|
| Machine Type | Use a machine with GPU (preferably multi-GPU) |
| Number of Processes | Use a high number (e.g., 8-32) for speed |
| Alignment Skipped | Much faster, but no support forasr auxiliary head |
python scripts/audio_dataset_extraction.py \
--LibriSpeech_root /data/LibriSpeech \
--timit_root /data/TIMIT \
--output_dir /data/manifests \
--num_processes 16python scripts/audio_dataset_extraction.py \
--LibriSpeech_root /data/LibriSpeech \
--output_dir /data/manifests \
--skip_transcriptions_alignment \
--num_processes 16Q: Can I run with just LibriSpeech? A: Yes. But phoneme alignment head training will not be possible.
Q: What if I skip --skip_transcriptions_alignment?
A: The script will generate transcriptions using Wav2Vec2. This is slower but enables alignment-based training.
Q: What model is used for alignment?
A: facebook/wav2vec2-base-960h from HuggingFace Transformers.