Hold to talk. Release to transcribe. Human vibe coding, faster.
Offline voice input for writing, coding, prompting, and human vibe coding on Windows.
Part of the Yongan Toolkit for coding and academic research: Everything to MD for Agent · Gemini Workflows for Agents
Quick Start · Features · Local API · How It Works · Yongan Toolkit · 中文 · 日本語 · 한국어
Most voice input tools are built for chat, not for real desktop work. SpeakFlow is optimized for direct text entry into whatever you are already doing:
- writing papers, notes, prompts, and documentation
- coding with hands mostly on mouse and keyboard
- dictating in Chinese, English, Cantonese, Japanese, or Korean
- working offline without sending audio to cloud services
The workflow is simple:
Hold trigger -> speak -> release -> text appears at the cursor
- Hold-to-record voice input with mouse or keyboard triggers
- Auto-paste transcribed text into the active app
- Offline multilingual ASR with SenseVoice
- GPU acceleration with CUDA, plus CPU fallback
- Browser-based setup wizard for first-time configuration
- Tray app with status icons and quick controls
- Startup integration for always-available dictation
- Built-in
doctorcommand for environment diagnostics - Optional emotion and event text output such as
(情感:高兴) - Local HTTP API for other local projects
- Runtime logs written to
logs/speakflow.log
git clone https://github.com/YonganZhang/SpeakFlow.git
cd SpeakFlow
pip install -r speakflow/requirements.txt
python -m speakflowOn first launch:
- The setup wizard opens in your browser.
- Choose microphone, trigger key, and language settings.
- Save the configuration.
- The speech model downloads on first use.
- SpeakFlow stays in the system tray and is ready to use.
# Start the app
python -m speakflow
# Run environment diagnostics
python -m speakflow doctor
# Re-open the setup wizard
python -m speakflow setup- Windows 10 or 11
- Python 3.10+
- NVIDIA GPU with CUDA recommended
- CPU mode also works, but slower
When enabled, SpeakFlow exposes a local-only API for other apps on the same machine:
GET http://127.0.0.1:18360/healthPOST http://127.0.0.1:18360/api/v1/transcribe
Form parameters:
| Parameter | Required | Description |
|---|---|---|
file |
Yes | Audio file such as wav or mp3 |
language |
No | auto, zh, en, yue, ja, ko |
emotion_mode |
No | text or off |
event_mode |
No | text or off |
Response JSON:
| Field | Description |
|---|---|
text |
Final formatted text |
plain_text |
Plain transcription without appended labels |
emotion |
Recognized emotion label |
event |
Recognized event label |
language |
Detected language |
raw_text |
Raw SenseVoice output |
Python example:
import requests
with open("test.wav", "rb") as f:
resp = requests.post(
"http://127.0.0.1:18360/api/v1/transcribe",
files={"file": ("test.wav", f, "audio/wav")},
data={
"language": "auto",
"emotion_mode": "text",
"event_mode": "off",
},
timeout=120,
)
print(resp.json()["text"])Recommended reuse pattern:
- Start SpeakFlow once in the background.
- Let your other project send audio files to
http://127.0.0.1:18360/api/v1/transcribe. - Use
textdirectly, or split intoplain_text,emotion, andevent.
The default config is generated under ~/.speakflow/config.yaml.
trigger:
type: mouse
mouse_button: middle
keyboard_hotkey: f9
mode: hold
audio:
sample_rate: 16000
device: null
asr:
model: "iic/SenseVoiceSmall"
device: "cuda:0"
language: auto
output:
auto_paste: true
notification: true
emotion_mode: text
event_mode: off
api:
enabled: true
host: 127.0.0.1
port: 18360Trigger listener -> audio recorder -> SenseVoice ASR -> clipboard / paste output
Key components in this repo:
speakflow/input_trigger.pylistens for mouse or keyboard triggersspeakflow/audio.pyrecords microphone audiospeakflow/transcriber.pyruns ASR via FunASR / SenseVoicespeakflow/local_api.pyexposes a local HTTP APIspeakflow/output.pyhandles clipboard and auto-pastespeakflow/setup_server.pypowers the browser-based setup wizard
- dictating research notes into Obsidian, Notion, or Word
- speaking prompts into Claude, ChatGPT, Gemini, or coding agents
- writing code comments, commit messages, and docs faster
- low-friction multilingual text entry on Windows
- reusing the ASR service from other local tools
This repo is one part of the Yongan Toolkit: a small collection of coding and research tools that work well together.
| Project | What It Helps With |
|---|---|
| speakflow-for-human-vibe-coding | speak ideas, notes, and prompts directly into any Windows app |
| gemini-workflows-for-agents | run Gemini-powered workflows for agents |
| everything-to-md-for-agent | turn documents and equations into Markdown for agents |
Recommended flow: draft with speakflow-for-human-vibe-coding, research with gemini-workflows-for-agents, then convert papers with everything-to-md-for-agent.
speakflow-for-human-vibe-coding/
├── speakflow/
│ ├── __main__.py
│ ├── app.py
│ ├── audio.py
│ ├── doctor.py
│ ├── input_trigger.py
│ ├── local_api.py
│ ├── output.py
│ ├── setup_server.py
│ ├── transcriber.py
│ └── resources/
├── build.ps1
├── install_startup.ps1
├── SpeakFlow.bat
├── SpeakFlow.vbs
└── README.md
SpeakFlow 是一个更偏生产力场景的 Windows 离线语音输入工具,不是单纯的“语音转文字 demo”。
核心体验是:
- 按住触发键开始录音
- 松开后自动转写
- 文本直接粘贴到当前光标位置
它现在还支持:
- 本机 HTTP 接口,供其他项目直接调用
- 情感/事件文字输出
- 日志写入
logs/speakflow.log
最常用命令:
python -m speakflow
python -m speakflow doctor
python -m speakflow setupSpeakFlow は、Windows 上で使えるオフライン音声入力ツールです。ボタンを押して話し、離すと文字起こしされてカーソル位置へ自動貼り付けされます。
- オフライン音声入力
- 中国語、英語、広東語、日本語、韓国語に対応
- CUDA による高速推論
- ローカル API で他のアプリから再利用可能
主なコマンド:
python -m speakflow
python -m speakflow doctor
python -m speakflow setupSpeakFlow는 Windows에서 사용하는 오프라인 음성 입력 도구입니다. 버튼을 누른 채 말하고, 놓으면 텍스트로 변환되어 현재 커서 위치에 자동으로 붙여넣어집니다.
- 오프라인 음성 입력
- 중국어, 영어, 광둥어, 일본어, 한국어 지원
- CUDA 기반 고속 추론
- 로컬 API로 다른 프로젝트에서 재사용 가능
주요 명령:
python -m speakflow
python -m speakflow doctor
python -m speakflow setupMIT. See LICENSE.
SenseVoice is developed by the FunAudioLLM team. Please also follow the upstream model license when using that model.