Skip to content

opeoyeleke/voicescribe

Repository files navigation

VoiceScribe

A native macOS menu bar app that turns your voice into clean, formatted text — anywhere your cursor is. Press a hotkey, speak, and watch the cleaned-up text stream into your active text field. Like WispFlow, but open source and bring-your-own-OpenAI-key.


Features

  • Global hotkey⌥Space from any app (configurable)
  • Floating overlay — translucent pill that appears over your work without stealing focus
  • Live waveform — visual feedback as you speak
  • Whisper transcription — OpenAI Whisper API; accent-aware, multilingual
  • GPT formatting — GPT-4o-mini cleans grammar, removes filler words, preserves your voice and intent
  • Streaming responses — formatted text appears token-by-token in the overlay as GPT generates it
  • Smart text injection — types at your cursor via Accessibility API, falls back to clipboard paste in apps that don't expose native text fields (Cursor, VS Code, Chrome, Slack, etc.)
  • Silence auto-stop — configurable pause length (default 3 s) before recording auto-stops
  • Silence detection — empty recordings are dropped before they reach Whisper, so you never get ghost transcripts like "Thank you" or "You" from silence hallucinations
  • Edit-before-inject (optional) — preview the formatted text in an editable field; ⌘↵ to inject, Esc to cancel
  • Recent transcripts — last 10 transcripts available from the menu bar for one-click copy
  • Custom system prompt — fully editable in Settings
  • Launch at login — toggleable
  • Clipboard-history hygiene — uses the org.nspasteboard.TransientType convention so text injections aren't recorded by Raycast / Paste / Maccy / Pastebot

Requirements

  • macOS 13 (Ventura) or later
  • Xcode 15+ installed (the Swift compiler + frameworks are needed; you never need to open Xcode itself)
  • An OpenAI API key — covers both Whisper transcription and GPT-4o-mini formatting

Quick start

git clone https://github.com/opeoyeleke/voicescribe.git
cd voicescribe/VoiceScribe

# (Recommended, one-time) Create a stable signing identity in your login
# keychain so the macOS Accessibility grant survives across rebuilds.
./scripts/setup-signing.sh

# Build a release .app bundle
./scripts/build.sh

# Move it wherever you want and launch
mv .build/release/VoiceScribe.app ~/Applications/
open ~/Applications/VoiceScribe.app

On first launch:

  1. Grant Microphone access when prompted
  2. Open System Settings → Privacy & Security → Accessibility → enable VoiceScribe
  3. Click the menu bar icon → Settings…API tab → paste your OpenAI key
  4. Press ⌥Space anywhere and start speaking

Usage

  1. Place your cursor in any text field (any app)
  2. Press ⌥Space
  3. Speak naturally
  4. Press ⌥Space again, or pause for ~3 s — recording auto-stops
  5. Whisper transcribes → GPT formats → text appears at your cursor

If you've enabled Preview before injecting in Settings, an editable overlay appears after formatting. ⌘↵ injects, Esc cancels.


Settings

Open via the menu bar icon → Settings…

  • General
    • Hotkey (default ⌥Space)
    • Auto-inject text into focused app
    • Auto-stop after silence (1.0–6.0 s)
    • Preview before injecting
    • Launch at login
  • API
    • OpenAI API key
    • Formatting model (GPT-4o-mini or GPT-4o)
    • Test Connection button
  • Prompt
    • Edit the system prompt that GPT receives
    • Reset to default

Architecture

VoiceScribe/
├── Package.swift                     SPM manifest (one dep: KeyboardShortcuts)
├── Resources/
│   ├── Info.plist                    Bundle metadata, LSUIElement, permission strings
│   └── AppIcon.icns                  App icon (regenerate with scripts/generate-icon.sh)
├── scripts/
│   ├── build.sh                      Release build → .build/release/VoiceScribe.app
│   ├── setup-signing.sh              One-time: self-signed cert in login keychain
│   ├── generate-icon.swift           CoreGraphics icon renderer
│   └── generate-icon.sh              Wraps Swift renderer + sips + iconutil
└── Sources/VoiceScribe/
    ├── App/
    │   ├── VoiceScribeApp.swift      @main entry; placeholder Settings scene
    │   ├── AppDelegate.swift         Menu bar, hotkey, settings window, recent submenu
    │   └── OverlayWindow.swift       Borderless NSWindow; key only in edit-confirm mode
    ├── Models/
    │   └── AppState.swift            Observable state + @AppStorage settings
    ├── Services/
    │   ├── AudioRecorderService.swift  AVAudioRecorder → 16 kHz mono PCM .wav, level meter, silence auto-stop
    │   ├── WhisperService.swift        POST .wav to OpenAI Whisper API → raw transcript
    │   ├── GPTFormatterService.swift   POST transcript to OpenAI Chat API → clean text (streaming + non-streaming)
    │   ├── TextInjectionService.swift  AXUIElement injection, clipboard fallback (transient pasteboard type)
    │   └── RecordingCoordinator.swift  Owns the pipeline: record → transcribe → format → inject
    └── Views/
        ├── OverlayView.swift         SwiftUI floating pill UI
        └── SettingsView.swift        Tabbed Settings: General / API / Prompt

Customisation

Change language

VoiceScribe currently sends language=en to Whisper. To change, edit WhisperService.swift — adjust the language init parameter or expose it in Settings.

Custom formatting prompt

Settings → Prompt tab. The default prompt instructs GPT to act as a transcription editor (not a chatbot), preserve greetings, and never answer questions or invent content. Reset to default if you want to pick up prompt updates after a git pull.

Different Whisper biasing

The Whisper API call passes a prompt parameter to discourage common training-set hallucinations ("Thank you", "Subscribe to my channel"). Tune it in WhisperService.swift buildMultipartBody().


Building from source

Dev loop (debug build, fast iteration)

swift build
cp .build/debug/VoiceScribe .build/VoiceScribe.app/Contents/MacOS/
codesign --force --deep --sign "VoiceScribe Dev" .build/VoiceScribe.app
open .build/VoiceScribe.app

(After running setup-signing.sh once. Without it, substitute --sign - for ad-hoc signing — but be aware that ad-hoc rebuilds invalidate the macOS Accessibility grant on every iteration.)

Release build

./scripts/build.sh

Produces a signed .app at .build/release/VoiceScribe.app. Uses "VoiceScribe Dev" if available (run setup-signing.sh first), otherwise falls back to ad-hoc with a warning.

Regenerate app icon

./scripts/generate-icon.sh

Renders a fresh Resources/AppIcon.icns from the Swift CoreGraphics script. Edit colours / SF Symbol in scripts/generate-icon.swift.


Distribution

This repo is set up for bring-your-own-build: you clone, run the build script, and use the resulting .app. There is no signed/notarised release in the GitHub Releases.

If you do distribute a signed .app to others, two things matter:

  • Ad-hoc signed (--sign -) builds can be opened by anyone but each macOS install treats them as unidentified, requiring a right-click → Open the first time. Accessibility grants are tied to the code identity, so updates from someone else's machine require re-granting.
  • Properly notarised distribution requires the Apple Developer Program ($99/yr) so you can sign with a Developer ID certificate and submit the bundle to Apple's notarisation service. Once notarised, downloads work without any Gatekeeper friction.

Privacy

  • Audio is sent to OpenAI's Whisper API for transcription
  • The raw transcript is sent to OpenAI's Chat Completions API for formatting
  • Both endpoints are reached directly from your machine via URLSession; no third-party servers are involved
  • Your OpenAI API key is stored in macOS UserDefaults (com.opeoyeleke.voicescribe), never transmitted anywhere except OpenAI
  • Recent transcripts are kept locally in UserDefaults (last 10 entries; clearable from the menu)
  • Refer to OpenAI's data usage policy for what they do with API audio and text

Roadmap

Done:

  • Whisper transcription via OpenAI API
  • GPT-4o-mini formatting (streaming)
  • Native AX text injection + clipboard fallback
  • Configurable silence auto-stop
  • Edit-before-inject mode
  • Recent transcripts menu
  • Whisper hallucination biasing
  • Stable code-signing (no permission churn on rebuild)

Possible next steps:

  • Local Whisper via whisper.cpp for offline / zero-latency transcription
  • Per-app prompt profiles (different prompt for code vs prose vs Slack)
  • Voice commands ("new paragraph", "delete that", "scratch that")
  • Sparkle auto-updater
  • Notarised distribution

Contributing

PRs welcome. Each service has a single responsibility and is easy to swap or extend:

  • New transcription backend → conform to the WhisperService interface (file URL in, transcript out)
  • New formatter → conform to GPTFormatterService.format / formatStreaming shape
  • Different injection strategy → the TextInjectionService interface is one method: inject(text:) -> Bool

Build and test before opening a PR:

swift build 2>&1 | grep -E "error:|warning:"

If you're adding UI, also do a manual smoke test of the overlay flow (record → format → inject) in at least TextEdit (native AX) and one Electron app (clipboard fallback).


License

MIT

About

Native macOS menu bar dictation: hotkey, speak, watch Whisper transcribe and GPT clean it up at your cursor.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors