TommyTalker Pro

Privacy-first voice-to-text for macOS — local STT via mlx-whisper with app-aware formatting, AI post-processing, and push-to-talk dictation.

TommyTalker Pro is the enhanced evolution of the open-source TommyTalker. It extends the core push-to-talk dictation engine with a multi-mode system, cloud LLM post-processing, transcription history, speaker diarization, file transcription, auto-activation rules, and a 5-panel sidebar dashboard. All speech recognition remains 100% on-device via mlx-whisper on Apple Silicon Metal — cloud APIs are used only for optional AI text formatting, and only when explicitly enabled per mode.

Note: This is a showcase repository. It contains architecture documentation, design decisions, and technical specifications — not source code. The full implementation is maintained in a private repository. See ARCHITECTURE.md for the complete system design.

What TommyTalker Pro Adds Over TommyTalker

TommyTalker provides the foundation: push-to-talk recording, local mlx-whisper transcription, app-aware formatting for 97 apps, and a menu bar interface. TommyTalker Pro builds on every layer of that architecture:

Capability	TommyTalker (Base)	TommyTalker Pro
Operating Modes	Single mode (voice-to-text)	6 presets: Voice to Text, Message, Email, Note, Custom, Meeting
AI Processing	Engine-only (Ollama, not wired)	Per-mode LLM formatting via Claude, GPT, or Groq — context-aware with clipboard + selection
Hardware Detection	3 tiers (RAM only)	4 tiers with GPU awareness (Metal/CUDA detection)
Recording UI	None	Classic waveform (320x120) + mini compact (180x36), color-coded status
History	None	SQLite searchable history with date grouping, retention cleanup, per-record metadata
Speaker Diarization	Engine-only (not wired)	Wired into Meeting mode — pyannote.audio speaker identification
File Transcription	None	External audio/video transcription (MP3, MP4, WAV, M4A, FLAC, OGG, WEBM)
Auto-Activation	None	Rule-based automatic mode switching by frontmost app (2.5s polling)
Word Replacements	None	Custom STT correction rules for recurring misrecognitions
Session Recording	None	WAV archival of recording sessions at 44.1 kHz
Dashboard	3-tab settings	5-panel sidebar: Configuration, Modes, Sound, Vocabulary, History
Menu Bar	Basic controls	Mode switcher submenu, file transcription action, recording state with mode name
Security	Basic .gitignore	9-phase pre-commit security scanner + .env credential isolation
Test Coverage	76 tests	356 tests across 13 files

Architecture Highlights

Local-First ML Pipeline

Speech recognition runs entirely on-device through mlx-whisper, Apple's optimized ML framework for Apple Silicon. Audio is captured at 16 kHz mono via PortAudio (sounddevice), bypassing high-level audio APIs for low-latency push-to-talk performance. The 4-tier hardware detection system probes RAM capacity and GPU presence (Metal via system_profiler, CUDA via nvidia-smi) to select the appropriate Whisper model — from distil-whisper-small on 8GB machines to distil-whisper-large-v3 on 32GB+ systems with dedicated GPU.

Multi-Mode AI Pipeline

Six operating modes (Voice to Text, Message, Email, Note, Custom, Meeting) each carry independent configuration: voice model, AI toggle, LLM provider, and custom prompt instructions. When AI is enabled for a mode, the pipeline gathers context — selected text and clipboard contents from the frontmost application — and sends the raw transcription plus context to the configured LLM (Anthropic Claude, OpenAI GPT, or Groq) for formatting. The LLM client abstracts provider differences behind a unified interface with per-provider configuration (API keys, base URLs, model selection).

Quartz Event Tap Hotkey System

Global hotkeys use macOS Quartz Event Taps to intercept keyboard events at the system level — including modifier-only keys (Right Command) that standard hotkey libraries cannot capture. The event tap handles flagsChanged events for modifier keys and keyDown/keyUp events for standard combos (Option+R, Option+D), with a 500ms debounce to prevent rapid retrigger.

Feature Table

Feature	Technical Implementation	Business Value
Push-to-Talk Dictation	Quartz Event Tap + sounddevice PortAudio stream at 16 kHz	Hands-free text input without leaving the current application
App-Aware Formatting	NSWorkspace bundle ID detection, 97 JSON profiles with regex fallback	Text arrives correctly formatted — no manual cleanup
Multi-Mode System	6 preset configurations with per-mode voice model + AI + prompt settings	One app handles email, notes, meetings, and code dictation
AI Post-Processing	Pluggable LLM client (Anthropic/OpenAI/Groq) with context injection	Raw speech becomes polished, professional text
Transcription History	SQLite with full-text search, date grouping, configurable retention	Searchable record of every dictation with one-click retrieval
Speaker Diarization	pyannote.audio pipeline integrated into Meeting mode	Meeting transcripts identify who said what
File Transcription	Format detection + pydub conversion → Whisper pipeline	Transcribe existing recordings without re-recording
Auto-Activation	QTimer polling at 2.5s + NSWorkspace app detection + rule matching	Mode switches automatically when switching between apps
Recording Windows	PyQt6 frameless overlays with real-time waveform (QPainter)	Visual recording feedback without obscuring the workspace
Hardware-Aware Models	psutil + Metal/CUDA probing → 4-tier model selection	Optimal performance on any Apple Silicon Mac
Session Archival	Parallel 44.1 kHz WAV capture alongside 16 kHz STT stream	Lossless recording backup for compliance or review
Security Pipeline	9-phase bash scanner (secrets, PII, paths, patterns, commercial markers)	No credentials or sensitive data reach version control

Screenshots

Screenshots and demo recordings will be added in a future update.

Metrics

Metric	Value
Test Coverage	356 tests across 13 test files
App Profiles	97 macOS applications with format detection
Operating Modes	6 presets with full per-mode configuration
LLM Providers	3 (Anthropic Claude, OpenAI GPT, Groq)
Hardware Tiers	4 (Light, Basic, Standard, Pro)
Development Phases	9 phases (11-19), each with PRD and TDD
STT Sample Rate	16,000 Hz (Whisper-optimized mono)
Archival Sample Rate	44,100 Hz (CD-quality WAV)
Hotkey Debounce	500 ms
Auto-Activation Interval	2,500 ms
Dashboard Panels	5 (Configuration, Modes, Sound, Vocabulary, History)

Technology Stack

Layer	Technology	Purpose
Language	Python 3.12+	Core application and engine
GUI Framework	PyQt6	Menu bar, dashboard, recording windows, onboarding
Speech-to-Text	mlx-whisper (Metal)	On-device transcription on Apple Silicon
AI Processing	Anthropic SDK, OpenAI SDK, Groq SDK	Per-mode LLM text formatting
Audio Capture	sounddevice (PortAudio) + soundfile	Dual-stream recording (16 kHz STT + 44.1 kHz archival)
Speaker ID	pyannote.audio	Speaker diarization for meeting transcripts
History Storage	SQLite (via Python stdlib)	Searchable transcription history with retention
Hotkey System	Quartz Event Tap (pyobjc)	Modifier-only global hotkeys
App Detection	NSWorkspace (pyobjc-Cocoa)	Frontmost app identification for context + auto-activation
Text Insertion	pyautogui + pasteboard	Cursor-position text pasting
Hardware Detection	psutil + system_profiler + nvidia-smi	Cross-platform GPU-aware tier selection
Security	Custom 9-phase bash scanner	Pre-commit secret/PII/path detection
Packaging	PyInstaller	macOS .app bundle with LaunchAgent

System Design

See ARCHITECTURE.md for the complete system architecture including:

C4 System Context and Container diagrams
Push-to-talk data flow (record → transcribe → format → AI → paste)
Key design decisions with rationale and alternatives considered
Security posture and credential management approach
Hardware detection and model selection logic

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets/images		assets/images
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TommyTalker Pro

What TommyTalker Pro Adds Over TommyTalker

Architecture Highlights

Local-First ML Pipeline

Multi-Mode AI Pipeline

Quartz Event Tap Hotkey System

Feature Table

Screenshots

Metrics

Technology Stack

System Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TommyTalker Pro

What TommyTalker Pro Adds Over TommyTalker

Architecture Highlights

Local-First ML Pipeline

Multi-Mode AI Pipeline

Quartz Event Tap Hotkey System

Feature Table

Screenshots

Metrics

Technology Stack

System Design

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages