Assess soniqo/speech-core — C++ voice-agent pipeline for Android integration value

## Source

https://github.com/soniqo/speech-core — C++17 voice-agent pipeline engine, Apache 2.0

## What speech-core provides

speech-core is a **C++17 orchestration core + LiteRT/ONNX model backends** for on-device voice. It has three layers:

1. **Orchestration core** (zero ML deps): `VoicePipeline` state machine with turn detection, interruption handling, speech queue, conversation context, streaming VAD hysteresis. Covers the full `audio → VAD → STT → LLM → TTS` loop.
2. **Abstract interfaces** (`STTInterface`, `TTSInterface`, `VADInterface`, `EnhancerInterface`, `EchoCancellerInterface`, `LLMInterface`, plus diarization interfaces): pure-virtual C++ classes. Consumers plug in any backend.
3. **Reference model implementations** (opt-in per backend):
   - **LiteRT backend** (`SPEECH_CORE_WITH_LITERT`): Silero VAD v5, Parakeet TDT 0.6B INT8 (~595MB), Nemotron Streaming STT 0.6B (true cache-aware streaming), Omnilingual ASR CTC 300M (multilingual), Pyannote Segmentation 3.0 (diarization), WeSpeaker ResNet34-LM (speaker embedding), VoxCPM2 2B TTS (~4.6GB, 48kHz)
   - **ONNX backend** (`SPEECH_CORE_WITH_ONNX`): Silero VAD v5, Parakeet TDT 0.6B, Kokoro 82M TTS, DeepFilterNet3 (speech enhancement)
4. **C API** for FFI: vtable-based, designed for Swift/Kotlin consumption. No C++ ABI exposure to consuming languages.
5. **Android is a first-class target**: explicit Android ABIs in CMake build, NNAPI ONNX execution provider, Android example app (`speech-android` sibling repo).

## Current state of our voice stack

| Capability | Current implementation | Approach |
|---|---|---|
| **STT** | Vosk (default), Android Native, Sherpa-ONNX (Zipformer/SenseVoice/Whisper/Paraformer) | Reflection-based Sherpa integration, per-controller state management |
| **TTS** | Android TTS, Sherpa Piper, Sherpa Kokoro 82M | Reflection-based Sherpa integration |
| **VAD** | Sherpa-ONNX VAD, openWakeWord (3-stage ONNX) | ONNX Runtime via ort-java |
| **Wake word** | OnnxWakeWordDetector (3-stage openWakeWord pipeline) | ONNX Runtime, custom mel spectrogram |
| **Pipeline** | Manual threading + state per controller (`VoiceInputController`, `VoiceOutputController` interfaces) | No unified pipeline state machine |
| **Diarization** | None | — |
| **Echo cancellation** | None | — |
| **Speech enhancement** | None | — |

## Value assessment

### High-value matches

1. **Unified voice pipeline state machine** — We currently hand-roll turn detection, interruption, and state transitions per controller. `VoicePipeline` provides a battle-tested, thread-safe state machine with:
   - 5-state FSM (Idle→Listening→Transcribing→Thinking→Speaking)
   - Hysteresis-gated VAD turn detection via `StreamingVAD`
   - Deferred interruption with retroactive detection and recovery timeout
   - Eager STT (start transcription before silence confirms, saves ~0.3s)
   - Force-split for long utterances (prevents unbounded memory)
   - Empty/low-confidence STT recovery
   - Conversation context with history trimming

2. **LiteRT-native STT models (Parakeet TDT 0.6B, Nemotron Streaming 0.6B)** — Both run on the same `libLiteRt` runtime we already use for Gemma-4 and EmbeddingGemma. Parakeet INT8 is ~595MB vs our Sherpa Zipformer. Nemotron offers true streaming (cache-aware RNN-T, not chunk-accumulate-and-retranscribe). Both have Qualcomm-specific variants.

3. **Silero VAD v5 (LiteRT)** — 1.3MB model, 512-sample chunks @ 16kHz, matching openWakeWord audio parameters. Could replace the Sherpa-ONNX VAD path with a smaller, same-runtime dependency.

4. **C API for Kotlin FFI** — The vtable-based C API is designed for Kotlin/Swift consumption. Much cleaner than our current reflection-based Sherpa integration. Would allow `VoicePipeline` to be driven from Kotlin with minimal JNI boilerplate.

5. **Same LiteRT runtime** — Everything speech-core does with LiteRT runs on `libLiteRt`, the same runtime we already ship for Gemma-4. Reduced APK bloat vs. running both ONNX Runtime (wake word, Sherpa) + LiteRT (LLM).

### Medium-value matches

6. **Speaker diarization** — Completely new capability. Pyannote Segmentation + WeSpeaker Embedding + agglomerative clustering. Could enable multi-speaker meeting transcription with speaker labels.

7. **Nemotron streaming STT** — True streaming with cache-aware partial results, unlike Sherpa's chunk-accumulate approach. Enables real-time transcription UX during speech, not just after silence.

8. **Echo cancellation interface** — `EchoCancellerInterface` with `feed_reference(TTS output)` + `cancel_echo(mic input)`. Our current push-to-talk has no AEC. Essential for barge-in voice mode.

### Lower-value / caution areas

9. **VoxCPM2 TTS (4.6GB)** — Too large for mobile today. Slow on CPU, no GPU/NPU delegate wired yet. Our current Kokoro 82M (~130MB) is more practical. Would only make sense after Qualcomm HTP/NPU delegate support lands.

10. **Build complexity** — Adding a C++ static library to an Android Kotlin project adds NDK/CMake/JNI build complexity. The `speech-android` sibling repo (mentioned in speech-core's AGENTS.md but not yet public) would mitigate this if it ships a prebuilt AAR.

11. **Project maturity** — 25 stars, 1 fork, relatively new. But the architecture, documentation, and test coverage are exceptional for a project at this stage. Sibling ecosystem (speech-swift, speech-models on HF, speech-cloud) suggests active development.

## Key risks

- **NDK/JNI integration overhead** — Speech-core is C++. We'd need JNI wrappers for the C API. The `speech-android` sibling repo (not yet public) supposedly provides this.
- **Model download size** — Parakeet INT8 (~595MB) is significant but ≤ our current Gemma-4 model sizes. Nemotron streaming is similar. VoxCPM2 (4.6GB) is a non-starter for now.
- **Competing runtimes** — We already use ONNX Runtime for wake word detection. Adding LiteRT for voice models means both runtimes coexist. However, migrating wake word to LiteRT Silero VAD could reduce to one runtime.
- **Sherpa investment** — We have significant existing investment in Sherpa-ONNX (Zipformer, SenseVoice, Whisper, Paraformer, Piper, Kokoro). Migrating would mean replacing or maintaining both paths.
- **VoxCPM2 CPU perf** — Docs note it's "slow on CPU" with no GPU delegate wired. Would need Qualcomm HTP/NPU support for acceptable latency.
- **`speech-android` not public** — The Android JNI bridge/sibling repo isn't publicly available yet, so the integration story is incomplete.

## Recommendations

1. **Track speech-core as a medium-term candidate** — Not an immediate replacement, but the strongest candidate for a unified voice pipeline when we're ready to move beyond per-controller state management.
2. **Wait for `speech-android` to go public** — The sibling Android SDK will make or break the integration value proposition. A prebuilt AAR with JNI wrappers changes the calculus significantly.
3. **Evaluate Parakeet TDT vs. Sherpa Zipformer** — If speech-core's Parakeet INT8 offers materially better accuracy at comparable size, it's worth a spike.
4. **Consider Silero VAD v5 (LiteRT)** — Small enough (1.3MB) to spike now. Could replace Sherpa VAD with a same-runtime dependency and free us from ONNX Runtime for VAD.
5. **Pipeline architecture is the real prize** — Even if we don't use speech-core's models, the `VoicePipeline` state machine design is worth studying. Our current per-controller state management is fragile compared to a proven FSM with interruption recovery, eager STT, and empty-STT recovery.
6. **Kokoro 82M parity** — Both projects use Kokoro 82M. If speech-core's ONNX Kokoro implementation has better phonemization or output quality than Sherpa's, it's an easy win.

## Related issues/PRs

- #678 — Android native STT alongside Vosk
- #700 — Parakeet CTC evaluation
- #703 — whisper.cpp vs Vosk evaluation
- STT research: `docs/research/stt-engine-comparison.md`

## Links

- speech-core: https://github.com/soniqo/speech-core
- speech-core models on HF: https://huggingface.co/soniqo
- Parakeet TDT 0.6B LiteRT INT8: https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8
- Nemotron Streaming LiteRT: https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT
- Silero VAD v5 LiteRT: https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assess soniqo/speech-core — C++ voice-agent pipeline for Android integration value #1081

Source

What speech-core provides

Current state of our voice stack

Value assessment

High-value matches

Medium-value matches

Lower-value / caution areas

Key risks

Recommendations

Related issues/PRs

Links

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Capability	Current implementation	Approach
STT	Vosk (default), Android Native, Sherpa-ONNX (Zipformer/SenseVoice/Whisper/Paraformer)	Reflection-based Sherpa integration, per-controller state management
TTS	Android TTS, Sherpa Piper, Sherpa Kokoro 82M	Reflection-based Sherpa integration
VAD	Sherpa-ONNX VAD, openWakeWord (3-stage ONNX)	ONNX Runtime via ort-java
Wake word	OnnxWakeWordDetector (3-stage openWakeWord pipeline)	ONNX Runtime, custom mel spectrogram
Pipeline	Manual threading + state per controller (`VoiceInputController`, `VoiceOutputController` interfaces)	No unified pipeline state machine
Diarization	None	—
Echo cancellation	None	—
Speech enhancement	None	—

Assess soniqo/speech-core — C++ voice-agent pipeline for Android integration value #1081

Description

Source

What speech-core provides

Current state of our voice stack

Value assessment

High-value matches

Medium-value matches

Lower-value / caution areas

Key risks

Recommendations

Related issues/PRs

Links

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions