Skip to content

Assess soniqo/speech-core — C++ voice-agent pipeline for Android integration value #1081

@lokhor

Description

@lokhor

Source

https://github.com/soniqo/speech-core — C++17 voice-agent pipeline engine, Apache 2.0

What speech-core provides

speech-core is a C++17 orchestration core + LiteRT/ONNX model backends for on-device voice. It has three layers:

  1. Orchestration core (zero ML deps): VoicePipeline state machine with turn detection, interruption handling, speech queue, conversation context, streaming VAD hysteresis. Covers the full audio → VAD → STT → LLM → TTS loop.
  2. Abstract interfaces (STTInterface, TTSInterface, VADInterface, EnhancerInterface, EchoCancellerInterface, LLMInterface, plus diarization interfaces): pure-virtual C++ classes. Consumers plug in any backend.
  3. Reference model implementations (opt-in per backend):
    • LiteRT backend (SPEECH_CORE_WITH_LITERT): Silero VAD v5, Parakeet TDT 0.6B INT8 (~595MB), Nemotron Streaming STT 0.6B (true cache-aware streaming), Omnilingual ASR CTC 300M (multilingual), Pyannote Segmentation 3.0 (diarization), WeSpeaker ResNet34-LM (speaker embedding), VoxCPM2 2B TTS (~4.6GB, 48kHz)
    • ONNX backend (SPEECH_CORE_WITH_ONNX): Silero VAD v5, Parakeet TDT 0.6B, Kokoro 82M TTS, DeepFilterNet3 (speech enhancement)
  4. C API for FFI: vtable-based, designed for Swift/Kotlin consumption. No C++ ABI exposure to consuming languages.
  5. Android is a first-class target: explicit Android ABIs in CMake build, NNAPI ONNX execution provider, Android example app (speech-android sibling repo).

Current state of our voice stack

Capability Current implementation Approach
STT Vosk (default), Android Native, Sherpa-ONNX (Zipformer/SenseVoice/Whisper/Paraformer) Reflection-based Sherpa integration, per-controller state management
TTS Android TTS, Sherpa Piper, Sherpa Kokoro 82M Reflection-based Sherpa integration
VAD Sherpa-ONNX VAD, openWakeWord (3-stage ONNX) ONNX Runtime via ort-java
Wake word OnnxWakeWordDetector (3-stage openWakeWord pipeline) ONNX Runtime, custom mel spectrogram
Pipeline Manual threading + state per controller (VoiceInputController, VoiceOutputController interfaces) No unified pipeline state machine
Diarization None
Echo cancellation None
Speech enhancement None

Value assessment

High-value matches

  1. Unified voice pipeline state machine — We currently hand-roll turn detection, interruption, and state transitions per controller. VoicePipeline provides a battle-tested, thread-safe state machine with:

    • 5-state FSM (Idle→Listening→Transcribing→Thinking→Speaking)
    • Hysteresis-gated VAD turn detection via StreamingVAD
    • Deferred interruption with retroactive detection and recovery timeout
    • Eager STT (start transcription before silence confirms, saves ~0.3s)
    • Force-split for long utterances (prevents unbounded memory)
    • Empty/low-confidence STT recovery
    • Conversation context with history trimming
  2. LiteRT-native STT models (Parakeet TDT 0.6B, Nemotron Streaming 0.6B) — Both run on the same libLiteRt runtime we already use for Gemma-4 and EmbeddingGemma. Parakeet INT8 is ~595MB vs our Sherpa Zipformer. Nemotron offers true streaming (cache-aware RNN-T, not chunk-accumulate-and-retranscribe). Both have Qualcomm-specific variants.

  3. Silero VAD v5 (LiteRT) — 1.3MB model, 512-sample chunks @ 16kHz, matching openWakeWord audio parameters. Could replace the Sherpa-ONNX VAD path with a smaller, same-runtime dependency.

  4. C API for Kotlin FFI — The vtable-based C API is designed for Kotlin/Swift consumption. Much cleaner than our current reflection-based Sherpa integration. Would allow VoicePipeline to be driven from Kotlin with minimal JNI boilerplate.

  5. Same LiteRT runtime — Everything speech-core does with LiteRT runs on libLiteRt, the same runtime we already ship for Gemma-4. Reduced APK bloat vs. running both ONNX Runtime (wake word, Sherpa) + LiteRT (LLM).

Medium-value matches

  1. Speaker diarization — Completely new capability. Pyannote Segmentation + WeSpeaker Embedding + agglomerative clustering. Could enable multi-speaker meeting transcription with speaker labels.

  2. Nemotron streaming STT — True streaming with cache-aware partial results, unlike Sherpa's chunk-accumulate approach. Enables real-time transcription UX during speech, not just after silence.

  3. Echo cancellation interfaceEchoCancellerInterface with feed_reference(TTS output) + cancel_echo(mic input). Our current push-to-talk has no AEC. Essential for barge-in voice mode.

Lower-value / caution areas

  1. VoxCPM2 TTS (4.6GB) — Too large for mobile today. Slow on CPU, no GPU/NPU delegate wired yet. Our current Kokoro 82M (~130MB) is more practical. Would only make sense after Qualcomm HTP/NPU delegate support lands.

  2. Build complexity — Adding a C++ static library to an Android Kotlin project adds NDK/CMake/JNI build complexity. The speech-android sibling repo (mentioned in speech-core's AGENTS.md but not yet public) would mitigate this if it ships a prebuilt AAR.

  3. Project maturity — 25 stars, 1 fork, relatively new. But the architecture, documentation, and test coverage are exceptional for a project at this stage. Sibling ecosystem (speech-swift, speech-models on HF, speech-cloud) suggests active development.

Key risks

  • NDK/JNI integration overhead — Speech-core is C++. We'd need JNI wrappers for the C API. The speech-android sibling repo (not yet public) supposedly provides this.
  • Model download size — Parakeet INT8 (~595MB) is significant but ≤ our current Gemma-4 model sizes. Nemotron streaming is similar. VoxCPM2 (4.6GB) is a non-starter for now.
  • Competing runtimes — We already use ONNX Runtime for wake word detection. Adding LiteRT for voice models means both runtimes coexist. However, migrating wake word to LiteRT Silero VAD could reduce to one runtime.
  • Sherpa investment — We have significant existing investment in Sherpa-ONNX (Zipformer, SenseVoice, Whisper, Paraformer, Piper, Kokoro). Migrating would mean replacing or maintaining both paths.
  • VoxCPM2 CPU perf — Docs note it's "slow on CPU" with no GPU delegate wired. Would need Qualcomm HTP/NPU support for acceptable latency.
  • speech-android not public — The Android JNI bridge/sibling repo isn't publicly available yet, so the integration story is incomplete.

Recommendations

  1. Track speech-core as a medium-term candidate — Not an immediate replacement, but the strongest candidate for a unified voice pipeline when we're ready to move beyond per-controller state management.
  2. Wait for speech-android to go public — The sibling Android SDK will make or break the integration value proposition. A prebuilt AAR with JNI wrappers changes the calculus significantly.
  3. Evaluate Parakeet TDT vs. Sherpa Zipformer — If speech-core's Parakeet INT8 offers materially better accuracy at comparable size, it's worth a spike.
  4. Consider Silero VAD v5 (LiteRT) — Small enough (1.3MB) to spike now. Could replace Sherpa VAD with a same-runtime dependency and free us from ONNX Runtime for VAD.
  5. Pipeline architecture is the real prize — Even if we don't use speech-core's models, the VoicePipeline state machine design is worth studying. Our current per-controller state management is fragile compared to a proven FSM with interruption recovery, eager STT, and empty-STT recovery.
  6. Kokoro 82M parity — Both projects use Kokoro 82M. If speech-core's ONNX Kokoro implementation has better phonemization or output quality than Sherpa's, it's an easy win.

Related issues/PRs

Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions