SpeakFlow for Human Vibe Coding

Hold to talk. Release to transcribe. Human vibe coding, faster.

Offline voice input for writing, coding, prompting, and human vibe coding on Windows.

Part of the Yongan Toolkit for coding and academic research: Everything to MD for Agent · Gemini Workflows for Agents

Quick Start · Features · Local API · How It Works · Yongan Toolkit · 中文 · 日本語 · 한국어

Why SpeakFlow

Most voice input tools are built for chat, not for real desktop work. SpeakFlow is optimized for direct text entry into whatever you are already doing:

writing papers, notes, prompts, and documentation
coding with hands mostly on mouse and keyboard
dictating in Chinese, English, Cantonese, Japanese, or Korean
working offline without sending audio to cloud services

The workflow is simple:

Hold trigger -> speak -> release -> text appears at the cursor

Features

Hold-to-record voice input with mouse or keyboard triggers
Auto-paste transcribed text into the active app
Offline multilingual ASR with SenseVoice
GPU acceleration with CUDA, plus CPU fallback
Browser-based setup wizard for first-time configuration
Tray app with status icons and quick controls
Startup integration for always-available dictation
Built-in doctor command for environment diagnostics
Optional emotion and event text output such as （情感:高兴）
Local HTTP API for other local projects
Runtime logs written to logs/speakflow.log

Quick Start

git clone https://github.com/YonganZhang/SpeakFlow.git
cd SpeakFlow
pip install -r speakflow/requirements.txt
python -m speakflow

On first launch:

The setup wizard opens in your browser.
Choose microphone, trigger key, and language settings.
Save the configuration.
The speech model downloads on first use.
SpeakFlow stays in the system tray and is ready to use.

Common Commands

# Start the app
python -m speakflow

# Run environment diagnostics
python -m speakflow doctor

# Re-open the setup wizard
python -m speakflow setup

Requirements

Windows 10 or 11
Python 3.10+
NVIDIA GPU with CUDA recommended
CPU mode also works, but slower

Local API

When enabled, SpeakFlow exposes a local-only API for other apps on the same machine:

GET http://127.0.0.1:18360/health
POST http://127.0.0.1:18360/api/v1/transcribe

Form parameters:

Parameter	Required	Description
`file`	Yes	Audio file such as `wav` or `mp3`
`language`	No	`auto`, `zh`, `en`, `yue`, `ja`, `ko`
`emotion_mode`	No	`text` or `off`
`event_mode`	No	`text` or `off`

Response JSON:

Field	Description
`text`	Final formatted text
`plain_text`	Plain transcription without appended labels
`emotion`	Recognized emotion label
`event`	Recognized event label
`language`	Detected language
`raw_text`	Raw SenseVoice output

Python example:

import requests

with open("test.wav", "rb") as f:
    resp = requests.post(
        "http://127.0.0.1:18360/api/v1/transcribe",
        files={"file": ("test.wav", f, "audio/wav")},
        data={
            "language": "auto",
            "emotion_mode": "text",
            "event_mode": "off",
        },
        timeout=120,
    )

print(resp.json()["text"])

Recommended reuse pattern:

Start SpeakFlow once in the background.
Let your other project send audio files to http://127.0.0.1:18360/api/v1/transcribe.
Use text directly, or split into plain_text, emotion, and event.

Configuration

The default config is generated under ~/.speakflow/config.yaml.

trigger:
  type: mouse
  mouse_button: middle
  keyboard_hotkey: f9
  mode: hold

audio:
  sample_rate: 16000
  device: null

asr:
  model: "iic/SenseVoiceSmall"
  device: "cuda:0"
  language: auto

output:
  auto_paste: true
  notification: true
  emotion_mode: text
  event_mode: off

api:
  enabled: true
  host: 127.0.0.1
  port: 18360

How It Works

Trigger listener -> audio recorder -> SenseVoice ASR -> clipboard / paste output

Key components in this repo:

speakflow/input_trigger.py listens for mouse or keyboard triggers
speakflow/audio.py records microphone audio
speakflow/transcriber.py runs ASR via FunASR / SenseVoice
speakflow/local_api.py exposes a local HTTP API
speakflow/output.py handles clipboard and auto-paste
speakflow/setup_server.py powers the browser-based setup wizard

Use Cases

dictating research notes into Obsidian, Notion, or Word
speaking prompts into Claude, ChatGPT, Gemini, or coding agents
writing code comments, commit messages, and docs faster
low-friction multilingual text entry on Windows
reusing the ASR service from other local tools

Yongan Toolkit

This repo is one part of the Yongan Toolkit: a small collection of coding and research tools that work well together.

Project	What It Helps With
speakflow-for-human-vibe-coding	speak ideas, notes, and prompts directly into any Windows app
gemini-workflows-for-agents	run Gemini-powered workflows for agents
everything-to-md-for-agent	turn documents and equations into Markdown for agents

Recommended flow: draft with speakflow-for-human-vibe-coding, research with gemini-workflows-for-agents, then convert papers with everything-to-md-for-agent.

Project Structure

speakflow-for-human-vibe-coding/
├── speakflow/
│   ├── __main__.py
│   ├── app.py
│   ├── audio.py
│   ├── doctor.py
│   ├── input_trigger.py
│   ├── local_api.py
│   ├── output.py
│   ├── setup_server.py
│   ├── transcriber.py
│   └── resources/
├── build.ps1
├── install_startup.ps1
├── SpeakFlow.bat
├── SpeakFlow.vbs
└── README.md

中文说明

SpeakFlow 是一个更偏生产力场景的 Windows 离线语音输入工具，不是单纯的“语音转文字 demo”。

核心体验是：

按住触发键开始录音
松开后自动转写
文本直接粘贴到当前光标位置

它现在还支持：

本机 HTTP 接口，供其他项目直接调用
情感/事件文字输出
日志写入 logs/speakflow.log

最常用命令：

python -m speakflow
python -m speakflow doctor
python -m speakflow setup

日本語

SpeakFlow は、Windows 上で使えるオフライン音声入力ツールです。ボタンを押して話し、離すと文字起こしされてカーソル位置へ自動貼り付けされます。

オフライン音声入力
中国語、英語、広東語、日本語、韓国語に対応
CUDA による高速推論
ローカル API で他のアプリから再利用可能

主なコマンド:

python -m speakflow
python -m speakflow doctor
python -m speakflow setup

한국어

SpeakFlow는 Windows에서 사용하는 오프라인 음성 입력 도구입니다. 버튼을 누른 채 말하고, 놓으면 텍스트로 변환되어 현재 커서 위치에 자동으로 붙여넣어집니다.

오프라인 음성 입력
중국어, 영어, 광둥어, 일본어, 한국어 지원
CUDA 기반 고속 추론
로컬 API로 다른 프로젝트에서 재사용 가능

주요 명령:

python -m speakflow
python -m speakflow doctor
python -m speakflow setup

License

MIT. See LICENSE.

SenseVoice is developed by the FunAudioLLM team. Please also follow the upstream model license when using that model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeakFlow for Human Vibe Coding

Why SpeakFlow

Features

Quick Start

Common Commands

Requirements

Local API

Configuration

How It Works

Use Cases

Yongan Toolkit

Project Structure

中文说明

日本語

한국어

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
speakflow		speakflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SpeakFlow.bat		SpeakFlow.bat
SpeakFlow.vbs		SpeakFlow.vbs
build.ps1		build.ps1
install_startup.ps1		install_startup.ps1

Folders and files

Latest commit

History

Repository files navigation

SpeakFlow for Human Vibe Coding

Why SpeakFlow

Features

Quick Start

Common Commands

Requirements

Local API

Configuration

How It Works

Use Cases

Yongan Toolkit

Project Structure

中文说明

日本語

한국어

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages