Python script that performs word error rate (WER) calculation for a set of reference and generated TXT, VTT, or SRT files and outputs results to csv.
python wer_calc.py [path/to/reference-directory] [path/to/generated-directory] [path/to/output.csv]
Create a CSV file ("output.csv") where WER will be written to, with the following headers: "Reference", "Generated", "WER". Include the names of each reference file in the "Reference" column, and the name of its corresponding generated file in the "Generated" column. For example:
| Reference | Generated | WER |
|---|---|---|
| reference1.srt | generated1.srt | |
| reference2.txt | generated2.txt |
The script matches reference files with generated files by looking up the pairs of filenames in each row. The corresponding WER will be written to the "WER" column in the same row.
Files can be in TXT (".txt"), SRT (".srt"), or VTT (".vtt") format.
Before running this script, install werpy: pip install werpy or pip3 install werpy
This script relies on werpy to do the following:
- preprocess/normalize input text to remove punctuation, remove duplicated spaces, leading/trailing blanks and convert all words to lowercase
- calculate word error rate (WER) for each of the reference and hypothesis texts
This script is created with an MIT license.
werpy is released under the terms of the BSD 3-Clause License. Please refer to its LICENSE file for full details.
werpy also includes third-party packages distributed under the BSD-3-Clause license (NumPy, Pandas) and the Apache License 2.0 (Cython). The full NumPy, Pandas and Cython licenses can be found in the werpy LICENSES directory. They can also be found directly in the following source codes:
- NumPy - https://github.com/numpy/numpy/blob/main/LICENSE.txt
- Pandas - https://github.com/pandas-dev/pandas/blob/main/LICENSE
- Cython - https://github.com/cython/cython/blob/master/LICENSE.txt
Conversion from SRT is adapted from srt2text. Please refer to its LICENSE for full details.
Feedback, comments, suggestions, etc are welcome!