Important
MaidWhisper is preparing for its first public beta. Features, packaging, and GPT-SoVITS integration may still change during beta testing.
Caution
An NVIDIA GPU is strongly recommended. GPT-SoVITS may start on some CPU-only systems, but synthesis is usually very slow and can feel unsuitable for daily reading. For the intended real-time Select-to-Speak experience, use an NVIDIA GPU with enough VRAM and a compatible CUDA / PyTorch environment.
MaidWhisper is a Windows Select-to-Speak desktop tool for reading selected text with character-based GPT-SoVITS voices.
Instead of opening a heavy WebUI every time, you select text in any app, press a global hotkey, and use a lightweight floating panel to choose the character, language, tone, speed, volume, and playback state.
Select text -> Global hotkey -> Language detection -> Optional LLM preprocessing -> Segmentation -> GPT-SoVITS synthesis -> Playback + lyrics sync
| Highlight | Description |
|---|---|
| System-level trigger | Works from browsers, readers, documents, chat apps, and most selectable text areas. |
| Floating control panel | Control character, language, tone, playback, speed, and volume without returning to a main window. |
| Easy character switching | Manage multiple characters, languages, tones, reference audio, and model files. |
| Daily-use workflow | Designed for repeated reading, not one-off manual generation. |
| No WebUI routine | GPT-SoVITS still generates the voice, while MaidWhisper handles capture, routing, segmentation, playback, and lyrics. |
| Optional AI preprocessing | Translate into the character language or rewrite text toward a configured character speaking style. |
Select text, trigger MaidWhisper, and read with a character voice |
Translate with AI preprocessing, then read in the selected character language |
Floating playback panel |
Clickable lyrics and segment navigation |
Character model settings |
AI character style and translation |
| Demo | Audio |
|---|---|
| Character voice demo 1 | DemoVoice_1.wav |
| Character voice demo 2 | DemoVoice_2.wav |
| Character voice demo 3 | DemoVoice_3.wav |
| Character voice demo 4 | DemoVoice_4.wav |
| Character voice demo 5 | DemoVoice_5.wav |
Demo audio is provided only to show the reading workflow and voice output style. Actual results depend on your GPT-SoVITS model, reference audio, synthesis speed, and playback settings.
| Item | Requirement |
|---|---|
| OS | Windows 10 or later |
| GPU / VRAM | NVIDIA GPU recommended; CPU-only synthesis is usually very slow |
| GPT-SoVITS model | SoVITS .pth, GPT .ckpt, reference audio, and matching reference text |
| LLM provider | Optional; required only for AI character style or AI translation |
- Download the latest
MaidWhisper-*-Setup.exefrom GitHub Releases. - Run the installer.
- If you do not already have GPT-SoVITS, enable the GPT-SoVITS installation option.
- Wait for the download and extraction to finish. This usually takes about 5 to 10 minutes depending on network and disk speed.
- Open MaidWhisper and go to Settings -> Characters.
- Add a character, then add a language and tone configuration.
- Select the
.pth,.ckpt, reference audio, and reference text. - Select text in another app and press the default hotkey:
Alt+Shift+M. - Choose the character, language, and tone from the floating panel, then press Play.
MaidWhisper starts the GPT-SoVITS API service when needed. It does not launch the training UI or the full GPT-SoVITS WebUI.
LLM preprocessing is optional.
| Feature | Description |
|---|---|
| AI character style | Rewrites text toward the speaking style configured for the selected character. |
| AI translation | Translates input text into the selected character language. |
| Smart skip | If the detected input language already matches the target language and no style rewrite is needed, MaidWhisper skips the LLM call. |
API keys are stored in Windows Credential Manager instead of normal settings JSON.
MaidWhisper is primarily local. Settings, character metadata, model paths, generated audio cache, and playback preferences are stored in the user data folder.
Text is sent to an external LLM provider only when AI character style or AI translation is enabled. If those features are disabled, MaidWhisper can read text through the local GPT-SoVITS workflow without calling an external LLM service.
| Resource | Purpose |
|---|---|
| GPT-SoVITS official project | Installation, versions, and model format documentation |
| Official Windows package | Windows integrated package used by the setup flow |
| Hugging Face model search | Community models |
| ModelScope model search | Community models, especially Chinese resources |
| GPT-SoVITS Model Collection | Community model collection |
Warning
Community models may have unclear licenses or unsafe files. Verify the source, license, model version, and usage rights before using them.
No. MaidWhisper does not provide or host third-party character models, voices, or reference audio. You need to prepare compatible GPT-SoVITS assets yourself.
Yes. In Settings, point MaidWhisper to your existing .bat / .cmd launcher and GPT-SoVITS API URL.
CPU-only usage may work in some environments, but it is usually too slow for daily reading. An NVIDIA GPU is recommended.
No. MaidWhisper focuses on daily reading, hotkeys, floating controls, and playback. Training, slicing, labeling, and fine-tuning still belong in the GPT-SoVITS toolchain.
LLM API keys are stored in Windows Credential Manager. Normal settings JSON stores only non-sensitive provider and model settings.
| Document | Description |
|---|---|
| Project Guide | Architecture overview and entry points |
| System Design | Pipeline contracts and restart strategy |
| Development Plan | Remaining milestone tasks |
| Release Checklist | Release preparation checklist |
| Release Acceptance Matrix | Manual acceptance cases |
| CI/CD and Windows Installer | Automated build and release workflow |
| 2026-06-05 Analysis | Coupling, data separation, and beta risk review |
MaidWhisper is a local voice-generation workflow tool. It does not provide, host, or license third-party character models, voice assets, text content, or LLM services.
Users are responsible for ensuring they have the rights to use their models, reference audio, text, character names, and generated output. Do not use generated audio for impersonation, fraud, harassment, misinformation, or other harmful activity.
| Method | Address |
|---|---|
| USDT TRC-20 | TXmghw1R9QaQSKGWEWP9ydr1xj8G1MqHrL |
| ETH ERC-20 | 0xe7b06b924cbca4b922585a4ccd665cd2c3d0e02c |
MaidWhisper is released under the GNU General Public License v3.0.






