Locally deployed video transcription tool using OpenVINO, tailored to use Intel NPU for low power, high performance transcription and diarization.
transcriber.py is the main script: it transcribes video (or audio) into text with speaker labels. You pick a video file and an output path; the script converts the track to 16 kHz WAV, runs speaker diarization to determine “who spoke when,” then transcribes each segment.
- Transcription is done by a local Whisper model (OpenVINO) running on the Intel NPU, so inference stays on-device and power-efficient.
- Speaker diarization (“who spoke when”) is handled by Pyannote and runs on the CPU. You can set the number of speakers in the dialog (default: 2; allowed range: 1–50). If you leave it unset (Cancel), the pipeline is called without
num_speakersso Pyannote auto-detects the number of speakers.
The result is a timestamped transcript saved as a .txt file (e.g. [0.0s - 5.2s] SPEAKER_00: Hello everyone.).
Before running NPUminator, complete these steps:
-
Ensure hardware is NPU-enabled
The app targets Intel® Core™ Ultra series processors (and compatible NPU-enabled hardware). Confirm your system has an NPU and that it is enabled in BIOS/firmware if applicable. -
Install Intel NPU Driver for Windows (if not already installed)
Download and install from:
Intel® NPU Driver - Windows -
Install Intel Graphics Driver (if not already installed)
Download and install from:
Intel® Arc™ Graphics - Windows -
Install Microsoft Visual C++ Redistributable (if not already installed)
Download and install the latest supported version for your architecture (x64 recommended):
Latest supported Visual C++ Redistributable downloads -
Install FFmpeg and add it to PATH
- Download FFmpeg from FFmpeg download (e.g. a Windows build from gyan.dev or BtbN).
- Unzip it to a local folder (e.g.
C:\ffmpeg). - Add that folder’s
bindirectory to your system PATH environment variable. - A reboot may be required for PATH changes to take effect.
-
Clone the repository
Linux (bash):
git clone https://github.com/crackthedata/NPUminator.git cd NPUminatorWindows (CMD):
git clone https://github.com/crackthedata/NPUminator.git cd NPUminator -
Create a virtual environment
python -m venv venv -
Activate the environment
Linux (bash):
source venv/bin/activateWindows (CMD):
venv\Scripts\activate -
Install dependencies
Run with the virtual environment activated:pip install -r requirements.txt -
Download the required Whisper OpenVINO model
Run with the virtual environment activated:optimum-cli export openvino --model openai/whisper-base --trust-remote-code whisper-base-ovThis exports the
openai/whisper-basemodel to thewhisper-base-ovdirectory for use with OpenVINO/NPU. -
Create a Hugging Face access token and accept Pyannote license terms
The transcriber uses Pyannote for speaker diarization; Pyannote models require a Hugging Face token and accepted license.- Go to Hugging Face → Access Tokens and create a token (read access is enough).
- Open the pyannote/speaker-diarization-3.1 model page and Accept the license terms if you haven’t already. Do the same for pyannote/segmentation-3.0 if the pipeline prompts you to.
- In the project root, create or edit a
.envfile and add:ReplaceHF_TOKEN=your_token_hereyour_token_herewith your actual token. The script loads this viapython-dotenvand uses it for the Pyannote pipeline.
Run with the virtual environment activated:
python transcriber.py
When you run the script, the following will happen:
- An explorer box will open to select the video file to transcribe.
- An explorer box will open to save the transcript as a
.txtfile. - A dialog box will open for the user to select how many speakers should be identified in the conversation, if they know. If they don't know, the user should leave it null and the pipeline will try to identify how many speakers, but this is subject to error.
- Expand coverage to other Windows computers (that don't have NPU) and Mac computers, using GPU for transcription.
- Evaluate performance of other models that perform transcription locally.