Source
https://github.com/soniqo/speech-core — C++17 voice-agent pipeline engine, Apache 2.0
What speech-core provides
speech-core is a C++17 orchestration core + LiteRT/ONNX model backends for on-device voice. It has three layers:
- Orchestration core (zero ML deps):
VoicePipeline state machine with turn detection, interruption handling, speech queue, conversation context, streaming VAD hysteresis. Covers the full audio → VAD → STT → LLM → TTS loop.
- Abstract interfaces (
STTInterface, TTSInterface, VADInterface, EnhancerInterface, EchoCancellerInterface, LLMInterface, plus diarization interfaces): pure-virtual C++ classes. Consumers plug in any backend.
- Reference model implementations (opt-in per backend):
- LiteRT backend (
SPEECH_CORE_WITH_LITERT): Silero VAD v5, Parakeet TDT 0.6B INT8 (~595MB), Nemotron Streaming STT 0.6B (true cache-aware streaming), Omnilingual ASR CTC 300M (multilingual), Pyannote Segmentation 3.0 (diarization), WeSpeaker ResNet34-LM (speaker embedding), VoxCPM2 2B TTS (~4.6GB, 48kHz)
- ONNX backend (
SPEECH_CORE_WITH_ONNX): Silero VAD v5, Parakeet TDT 0.6B, Kokoro 82M TTS, DeepFilterNet3 (speech enhancement)
- C API for FFI: vtable-based, designed for Swift/Kotlin consumption. No C++ ABI exposure to consuming languages.
- Android is a first-class target: explicit Android ABIs in CMake build, NNAPI ONNX execution provider, Android example app (
speech-android sibling repo).
Current state of our voice stack
| Capability |
Current implementation |
Approach |
| STT |
Vosk (default), Android Native, Sherpa-ONNX (Zipformer/SenseVoice/Whisper/Paraformer) |
Reflection-based Sherpa integration, per-controller state management |
| TTS |
Android TTS, Sherpa Piper, Sherpa Kokoro 82M |
Reflection-based Sherpa integration |
| VAD |
Sherpa-ONNX VAD, openWakeWord (3-stage ONNX) |
ONNX Runtime via ort-java |
| Wake word |
OnnxWakeWordDetector (3-stage openWakeWord pipeline) |
ONNX Runtime, custom mel spectrogram |
| Pipeline |
Manual threading + state per controller (VoiceInputController, VoiceOutputController interfaces) |
No unified pipeline state machine |
| Diarization |
None |
— |
| Echo cancellation |
None |
— |
| Speech enhancement |
None |
— |
Value assessment
High-value matches
-
Unified voice pipeline state machine — We currently hand-roll turn detection, interruption, and state transitions per controller. VoicePipeline provides a battle-tested, thread-safe state machine with:
- 5-state FSM (Idle→Listening→Transcribing→Thinking→Speaking)
- Hysteresis-gated VAD turn detection via
StreamingVAD
- Deferred interruption with retroactive detection and recovery timeout
- Eager STT (start transcription before silence confirms, saves ~0.3s)
- Force-split for long utterances (prevents unbounded memory)
- Empty/low-confidence STT recovery
- Conversation context with history trimming
-
LiteRT-native STT models (Parakeet TDT 0.6B, Nemotron Streaming 0.6B) — Both run on the same libLiteRt runtime we already use for Gemma-4 and EmbeddingGemma. Parakeet INT8 is ~595MB vs our Sherpa Zipformer. Nemotron offers true streaming (cache-aware RNN-T, not chunk-accumulate-and-retranscribe). Both have Qualcomm-specific variants.
-
Silero VAD v5 (LiteRT) — 1.3MB model, 512-sample chunks @ 16kHz, matching openWakeWord audio parameters. Could replace the Sherpa-ONNX VAD path with a smaller, same-runtime dependency.
-
C API for Kotlin FFI — The vtable-based C API is designed for Kotlin/Swift consumption. Much cleaner than our current reflection-based Sherpa integration. Would allow VoicePipeline to be driven from Kotlin with minimal JNI boilerplate.
-
Same LiteRT runtime — Everything speech-core does with LiteRT runs on libLiteRt, the same runtime we already ship for Gemma-4. Reduced APK bloat vs. running both ONNX Runtime (wake word, Sherpa) + LiteRT (LLM).
Medium-value matches
-
Speaker diarization — Completely new capability. Pyannote Segmentation + WeSpeaker Embedding + agglomerative clustering. Could enable multi-speaker meeting transcription with speaker labels.
-
Nemotron streaming STT — True streaming with cache-aware partial results, unlike Sherpa's chunk-accumulate approach. Enables real-time transcription UX during speech, not just after silence.
-
Echo cancellation interface — EchoCancellerInterface with feed_reference(TTS output) + cancel_echo(mic input). Our current push-to-talk has no AEC. Essential for barge-in voice mode.
Lower-value / caution areas
-
VoxCPM2 TTS (4.6GB) — Too large for mobile today. Slow on CPU, no GPU/NPU delegate wired yet. Our current Kokoro 82M (~130MB) is more practical. Would only make sense after Qualcomm HTP/NPU delegate support lands.
-
Build complexity — Adding a C++ static library to an Android Kotlin project adds NDK/CMake/JNI build complexity. The speech-android sibling repo (mentioned in speech-core's AGENTS.md but not yet public) would mitigate this if it ships a prebuilt AAR.
-
Project maturity — 25 stars, 1 fork, relatively new. But the architecture, documentation, and test coverage are exceptional for a project at this stage. Sibling ecosystem (speech-swift, speech-models on HF, speech-cloud) suggests active development.
Key risks
- NDK/JNI integration overhead — Speech-core is C++. We'd need JNI wrappers for the C API. The
speech-android sibling repo (not yet public) supposedly provides this.
- Model download size — Parakeet INT8 (~595MB) is significant but ≤ our current Gemma-4 model sizes. Nemotron streaming is similar. VoxCPM2 (4.6GB) is a non-starter for now.
- Competing runtimes — We already use ONNX Runtime for wake word detection. Adding LiteRT for voice models means both runtimes coexist. However, migrating wake word to LiteRT Silero VAD could reduce to one runtime.
- Sherpa investment — We have significant existing investment in Sherpa-ONNX (Zipformer, SenseVoice, Whisper, Paraformer, Piper, Kokoro). Migrating would mean replacing or maintaining both paths.
- VoxCPM2 CPU perf — Docs note it's "slow on CPU" with no GPU delegate wired. Would need Qualcomm HTP/NPU support for acceptable latency.
speech-android not public — The Android JNI bridge/sibling repo isn't publicly available yet, so the integration story is incomplete.
Recommendations
- Track speech-core as a medium-term candidate — Not an immediate replacement, but the strongest candidate for a unified voice pipeline when we're ready to move beyond per-controller state management.
- Wait for
speech-android to go public — The sibling Android SDK will make or break the integration value proposition. A prebuilt AAR with JNI wrappers changes the calculus significantly.
- Evaluate Parakeet TDT vs. Sherpa Zipformer — If speech-core's Parakeet INT8 offers materially better accuracy at comparable size, it's worth a spike.
- Consider Silero VAD v5 (LiteRT) — Small enough (1.3MB) to spike now. Could replace Sherpa VAD with a same-runtime dependency and free us from ONNX Runtime for VAD.
- Pipeline architecture is the real prize — Even if we don't use speech-core's models, the
VoicePipeline state machine design is worth studying. Our current per-controller state management is fragile compared to a proven FSM with interruption recovery, eager STT, and empty-STT recovery.
- Kokoro 82M parity — Both projects use Kokoro 82M. If speech-core's ONNX Kokoro implementation has better phonemization or output quality than Sherpa's, it's an easy win.
Related issues/PRs
Links
Source
https://github.com/soniqo/speech-core — C++17 voice-agent pipeline engine, Apache 2.0
What speech-core provides
speech-core is a C++17 orchestration core + LiteRT/ONNX model backends for on-device voice. It has three layers:
VoicePipelinestate machine with turn detection, interruption handling, speech queue, conversation context, streaming VAD hysteresis. Covers the fullaudio → VAD → STT → LLM → TTSloop.STTInterface,TTSInterface,VADInterface,EnhancerInterface,EchoCancellerInterface,LLMInterface, plus diarization interfaces): pure-virtual C++ classes. Consumers plug in any backend.SPEECH_CORE_WITH_LITERT): Silero VAD v5, Parakeet TDT 0.6B INT8 (~595MB), Nemotron Streaming STT 0.6B (true cache-aware streaming), Omnilingual ASR CTC 300M (multilingual), Pyannote Segmentation 3.0 (diarization), WeSpeaker ResNet34-LM (speaker embedding), VoxCPM2 2B TTS (~4.6GB, 48kHz)SPEECH_CORE_WITH_ONNX): Silero VAD v5, Parakeet TDT 0.6B, Kokoro 82M TTS, DeepFilterNet3 (speech enhancement)speech-androidsibling repo).Current state of our voice stack
VoiceInputController,VoiceOutputControllerinterfaces)Value assessment
High-value matches
Unified voice pipeline state machine — We currently hand-roll turn detection, interruption, and state transitions per controller.
VoicePipelineprovides a battle-tested, thread-safe state machine with:StreamingVADLiteRT-native STT models (Parakeet TDT 0.6B, Nemotron Streaming 0.6B) — Both run on the same
libLiteRtruntime we already use for Gemma-4 and EmbeddingGemma. Parakeet INT8 is ~595MB vs our Sherpa Zipformer. Nemotron offers true streaming (cache-aware RNN-T, not chunk-accumulate-and-retranscribe). Both have Qualcomm-specific variants.Silero VAD v5 (LiteRT) — 1.3MB model, 512-sample chunks @ 16kHz, matching openWakeWord audio parameters. Could replace the Sherpa-ONNX VAD path with a smaller, same-runtime dependency.
C API for Kotlin FFI — The vtable-based C API is designed for Kotlin/Swift consumption. Much cleaner than our current reflection-based Sherpa integration. Would allow
VoicePipelineto be driven from Kotlin with minimal JNI boilerplate.Same LiteRT runtime — Everything speech-core does with LiteRT runs on
libLiteRt, the same runtime we already ship for Gemma-4. Reduced APK bloat vs. running both ONNX Runtime (wake word, Sherpa) + LiteRT (LLM).Medium-value matches
Speaker diarization — Completely new capability. Pyannote Segmentation + WeSpeaker Embedding + agglomerative clustering. Could enable multi-speaker meeting transcription with speaker labels.
Nemotron streaming STT — True streaming with cache-aware partial results, unlike Sherpa's chunk-accumulate approach. Enables real-time transcription UX during speech, not just after silence.
Echo cancellation interface —
EchoCancellerInterfacewithfeed_reference(TTS output)+cancel_echo(mic input). Our current push-to-talk has no AEC. Essential for barge-in voice mode.Lower-value / caution areas
VoxCPM2 TTS (4.6GB) — Too large for mobile today. Slow on CPU, no GPU/NPU delegate wired yet. Our current Kokoro 82M (~130MB) is more practical. Would only make sense after Qualcomm HTP/NPU delegate support lands.
Build complexity — Adding a C++ static library to an Android Kotlin project adds NDK/CMake/JNI build complexity. The
speech-androidsibling repo (mentioned in speech-core's AGENTS.md but not yet public) would mitigate this if it ships a prebuilt AAR.Project maturity — 25 stars, 1 fork, relatively new. But the architecture, documentation, and test coverage are exceptional for a project at this stage. Sibling ecosystem (speech-swift, speech-models on HF, speech-cloud) suggests active development.
Key risks
speech-androidsibling repo (not yet public) supposedly provides this.speech-androidnot public — The Android JNI bridge/sibling repo isn't publicly available yet, so the integration story is incomplete.Recommendations
speech-androidto go public — The sibling Android SDK will make or break the integration value proposition. A prebuilt AAR with JNI wrappers changes the calculus significantly.VoicePipelinestate machine design is worth studying. Our current per-controller state management is fragile compared to a proven FSM with interruption recovery, eager STT, and empty-STT recovery.Related issues/PRs
docs/research/stt-engine-comparison.mdLinks