Skip to content

validate_dataset checks file counts but not stem-name pairing #86

@dennisthemenacing

Description

@dennisthemenacing

Opened on behalf of @BeckettFrey

Summary

In src/voxkit/storage/datasets.py, the validate_dataset function checks that the number of audio files equals the number of label files per speaker directory, but it never verifies that each audio file has a matching label file by stem name.

Example

Given a speaker directory:

speaker_001/
├── recording_A.wav
└── recording_B.lab

This passes validation (1 audio file, 1 label file — counts match), even though recording_A has no label and recording_B has no audio. The dataset would then fail silently downstream during alignment or training.

Relevant code

https://github.com/BrainBehaviorAnalyticsLab/voxkit-desktop/blob/main/src/voxkit/storage/datasets.py — in validate_dataset, around the per-speaker loop:

if len(audio_files) != len(label_files):
    return (
        False,
        f"Mismatch between number of audio and label files in speaker "
        f"directory '{speaker_path}'.",
    )

Suggested fix

After the count check, add a stem-name comparison:

audio_stems = {Path(f).stem for f in audio_files}
label_stems = {Path(f).stem for f in label_files}
unmatched = audio_stems.symmetric_difference(label_stems)
if unmatched:
    return (
        False,
        f"Unpaired audio/label files in '{speaker_path}': {unmatched}",
    )

This keeps the scope minimal — one additional check in the existing validation loop.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions