Implement on-device voice processing infrastructure (STT/TTS)#68
Conversation
Co-authored-by: itsnothuy <158990013+itsnothuy@users.noreply.github.com>
…ocessing Co-authored-by: itsnothuy <158990013+itsnothuy@users.noreply.github.com>
Co-authored-by: itsnothuy <158990013+itsnothuy@users.noreply.github.com>
Co-authored-by: itsnothuy <158990013+itsnothuy@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR implements comprehensive on-device voice processing infrastructure for the Iris AI assistant, adding Speech-to-Text (STT), Text-to-Speech (TTS), and audio processing capabilities. The implementation provides complete interfaces, type systems, and placeholder implementations ready for native model integration (Whisper.cpp for STT, Piper for TTS). All processing remains on-device to maintain privacy.
Key Changes:
- Complete voice processing infrastructure with VAD, streaming recognition, and audio synthesis
- Android AudioRecord/AudioTrack integration for real-time audio capture and playback
- Comprehensive type system covering STT/TTS models, audio formats, and recognition results
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/pages/voice-processing.md | Comprehensive documentation (478 lines) covering architecture, usage, and future work |
| core-multimodal/voice/VoiceTypes.kt | Type definitions for STT/TTS models, audio formats, and speech recognition results (273 lines) |
| core-multimodal/voice/VoiceInterfaces.kt | Interface definitions for SpeechToTextEngine, TextToSpeechEngine, and AudioProcessor |
| core-multimodal/voice/SpeechToTextEngineImpl.kt | STT implementation with streaming recognition, VAD, and device validation (443 lines) |
| core-multimodal/voice/TextToSpeechEngineImpl.kt | TTS implementation with synthesis, streaming, and playback (328 lines) |
| core-multimodal/voice/Mock*.kt | Mock implementations for testing (3 files, 207 lines total) |
| core-multimodal/audio/AudioProcessorImpl.kt | Audio capture/playback with preprocessing (AGC, noise reduction) (331 lines) |
| core-multimodal/audio/AudioTypes.kt | Audio data types and enums for streaming and file formats |
| core-multimodal/di/MultimodalModule.kt | Dependency injection bindings for voice components |
| core-multimodal/README.md | Updated documentation with voice processing usage examples (+263 lines) |
| common/models/DeviceProfile.kt | Added MICROPHONE hardware capability enum |
| common/error/IrisException.kt | Added VoiceException for voice processing errors |
| app/events/EventBus.kt | Added STT/TTS model load events |
Comments suppressed due to low confidence (1)
docs/pages/voice-processing.md:1
- The documentation references Section 8 for Voice Processing Engine, but Section 8 of architecture.md is actually 'Tool Calling & Android Integration'. The correct reference should be Section 7 'Multimodal (ASR & Vision) Architecture' which covers audio processing. Update to reference Section 7.
# Voice Processing & Speech Engine
| while (audioTrack?.playState == AudioTrack.PLAYSTATE_PLAYING && isPlaying) { | ||
| Thread.sleep(100) | ||
| } |
There was a problem hiding this comment.
Using Thread.sleep() in a suspend function defeats the purpose of coroutines. This blocks the thread instead of suspending cooperatively. Replace with delay(100) from kotlinx.coroutines, which properly suspends the coroutine without blocking the thread.
| return withContext(Dispatchers.Default) { | ||
| try { | ||
| // Calculate RMS energy | ||
| val rms = sqrt(audioSamples.map { it * it }.average().toFloat()) |
There was a problem hiding this comment.
The VAD calculation creates intermediate collections via map and average() on every audio chunk. For real-time processing, this adds unnecessary allocations. Consider using a manual loop to calculate RMS in-place: var sum = 0f; audioSamples.forEach { sum += it * it }; val rms = sqrt(sum / audioSamples.size)
| val rms = sqrt(audioSamples.map { it * it }.average().toFloat()) | |
| var sum = 0f | |
| for (sample in audioSamples) { | |
| sum += sample * sample | |
| } | |
| val rms = sqrt(sum / audioSamples.size) |
| val samples = FloatArray(audioBytes.size / 2) { index -> | ||
| buffer.getShort(index * 2) / 32768.0f |
There was a problem hiding this comment.
The WAV file loader calculates sample indices incorrectly. When reading shorts from a ByteBuffer, the buffer position is already managed by getShort(), so multiplying the index by 2 reads incorrect positions. Change to: FloatArray(audioBytes.size / 2) { index -> buffer.getShort() / 32768.0f } or use buffer.asShortBuffer() to avoid manual index calculation.
| val samples = FloatArray(audioBytes.size / 2) { index -> | |
| buffer.getShort(index * 2) / 32768.0f | |
| val samples = FloatArray(audioBytes.size / 2) { | |
| buffer.getShort() / 32768.0f |
| } | ||
|
|
||
| private fun applyAutomaticGainControl(samples: FloatArray): FloatArray { | ||
| val rms = sqrt(samples.map { it * it }.average().toFloat()) |
There was a problem hiding this comment.
Similar to the VAD issue, the AGC calculation creates intermediate collections for every audio frame. Replace with: var sum = 0f; samples.forEach { sum += it * it }; val rms = sqrt(sum / samples.size) to avoid allocations in the real-time audio processing path.
| val rms = sqrt(samples.map { it * it }.average().toFloat()) | |
| var sum = 0f | |
| samples.forEach { sum += it * it } | |
| val rms = sqrt(sum / samples.size) |
| silenceCount++ | ||
|
|
||
| // End of speech detection | ||
| if (silenceCount >= config.endOfSpeechSilenceMs / VAD_WINDOW_MS) { |
There was a problem hiding this comment.
Integer division may cause incorrect endpoint detection timing. If endOfSpeechSilenceMs is not a multiple of VAD_WINDOW_MS, the calculation truncates. For example, 1500ms / 100ms = 15, but 1550ms / 100ms also = 15, losing 50ms precision. Consider using float division and rounding: silenceCount >= (config.endOfSpeechSilenceMs.toFloat() / VAD_WINDOW_MS).roundToInt()
| // TODO: Synthesize through native engine | ||
| // For now, generate simple tone as placeholder | ||
| val sampleRate = currentTTSModel!!.audioFormat.sampleRate | ||
| val duration = text.length * 0.05 // ~50ms per character |
There was a problem hiding this comment.
Magic number 0.05 is used without explanation. Add a named constant: private const val MS_PER_CHARACTER = 0.05f with a comment explaining this is a placeholder estimation until native TTS integration.
| val bytes = file.readBytes() | ||
|
|
||
| // Skip WAV header (44 bytes) | ||
| val dataStart = if (bytes.size > 44) 44 else 0 |
There was a problem hiding this comment.
Magic number 44 represents WAV header size but lacks documentation. Add a constant: private const val WAV_HEADER_SIZE = 44 and add a comment explaining this assumes standard WAV format without extended chunks.
| val recordingDuration = System.currentTimeMillis() - currentRecordingSession!!.startTime | ||
| if (recordingDuration > MAX_RECORDING_DURATION_MS) { |
There was a problem hiding this comment.
[nitpick] Recording duration is calculated on every audio chunk (potentially every 100ms), creating unnecessary overhead. Consider calculating this check less frequently, perhaps only every 5-10 chunks, since the max duration is 60 seconds and overshooting by a few hundred milliseconds is acceptable.
| data class PartialTranscriptionResult( | ||
| val text: String, | ||
| val confidence: Float, | ||
| val isFinal: Boolean | ||
| ) |
There was a problem hiding this comment.
The PartialTranscriptionResult data class is defined but never used in the implementation. The sealed class SpeechRecognitionResult.PartialTranscription is used instead. Consider removing this unused type to reduce confusion and maintain a single representation for partial transcriptions.
| data class PartialTranscriptionResult( | |
| val text: String, | |
| val confidence: Float, | |
| val isFinal: Boolean | |
| ) |
| /** | ||
| * Final transcription result (complete audio) | ||
| */ | ||
| data class FinalTranscriptionResult( | ||
| val text: String, | ||
| val confidence: Float, | ||
| val segments: List<TranscriptionSegment> | ||
| ) |
There was a problem hiding this comment.
The FinalTranscriptionResult data class appears redundant with the already-defined TranscriptionResult type (lines 108-114) which has the same fields plus duration and language. Consider consolidating these types or documenting why both are needed.
| /** | |
| * Final transcription result (complete audio) | |
| */ | |
| data class FinalTranscriptionResult( | |
| val text: String, | |
| val confidence: Float, | |
| val segments: List<TranscriptionSegment> | |
| ) | |
| // FinalTranscriptionResult removed: use TranscriptionResult instead. |
Summary
Adds complete voice processing infrastructure to
core-multimodal: Speech-to-Text engine with VAD and streaming recognition, Text-to-Speech engine with synthesis and playback, and AudioProcessor with real-time capture/playback. All processing remains on-device. Native model integration points prepared for Whisper.cpp/Piper.2,623 lines added across 9 new Kotlin files + comprehensive documentation.
Type of Change
Core Components
AudioProcessor (331 lines)
SpeechToTextEngine (443 lines)
Flow<SpeechRecognitionResult>sttEngine.loadSTTModel(whisperModel).getOrThrow() sttEngine.startListening(config).collect { result -> when (result) { is PartialTranscription -> updateUI(result.text) is FinalTranscription -> processText(result.text) } }TextToSpeechEngine (328 lines)
speak()methodType System (273 lines)
STTModelDescriptor,TTSModelDescriptorSpeechRecognitionResultvariants (sealed class)AudioConfig,AudioFormat,AudioRequirementsMock Implementations (207 lines)
Test infrastructure without hardware/models for all three engines.
Architecture
Integrates with:
core-hw: Device capability detectionapp: EventBus for model load eventscommon: Error types (VoiceException), device profilesNative integration ready via TODO-marked placeholders for Whisper.cpp (STT) and Piper (TTS).
Testing
Architecture Compliance
Code Quality
Security & Privacy
Privacy: All audio processing on-device. No external transmission. Audio buffers cleared post-processing.
Documentation
Screenshots / Notes
No UI changes. Infrastructure-only implementation. Placeholder implementations for native integration:
Performance Impact
Characteristics: 4x buffer multiplier for latency, efficient PCM conversion, memory-efficient streaming.
Follow-ups
Known Limitations: Requires native model integration for actual STT/TTS. Current implementations use placeholders with TODO markers.
Files Added: 9 Kotlin files (audio/, voice/), 1 doc page
Files Modified: EventBus, IrisException, DeviceProfile, MultimodalModule, README
Module: core-multimodal
Dependencies: None (uses existing Android APIs)
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
dl.google.com/usr/lib/jvm/temurin-17-jdk-amd64/bin/java --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.prefs/java.util.prefs=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.prefs/java.util.prefs=ALL-UNNAMED --add-opens=java.base/java.nio.charset=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.xml/javax.xml.namespace=ALL-UNNAMED -Xmx2048m -Dfile.encoding=UTF-8 -Duser.country -Duser.language=en -Duser.variant -cp /home/REDACTED/.gradle/wrapper/dists/gradle-8.13-bin/5xuhj0ry160q40clulazy9h7d/gradle-8.13/lib/gradle-daemon-main-8.13.jar -javaagent:/home/REDACTED/.gradle/wrapper/dists/gradle-8.13-bin/5xuhj0ry160q40clulazy9h7d/gradle-8.13/lib/agents/gradle-instrumentation-agent-8.13.jar org.gradle.launcher.daemon.bootstrap.GradleDaemon 8.13(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Issue #8: Voice Processing & Speech Engine</issue_title>
<issue_description>### Scope / page(s)
🎯 Epic: Voice AI Capabilities
Priority: P2 (Medium)
Estimate: 10-12 days
Dependencies: #37 (Core Architecture), #51 (Chat Engine), #57 (Multimodal Support)
Architecture Reference: docs/architecture.md - Section 8 Voice Processing Engine
📋 Overview
Implement comprehensive voice processing capabilities including Speech-to-Text (STT), Text-to-Speech (TTS), and voice activity detection. This system enables hands-free interaction with the AI assistant through natural speech input and audio response output.
🎯 Goals
📝 Detailed Tasks
1. Speech-to-Text Engine
1.1 STT Engine Implementation
Create
core-voice/src/main/kotlin/SpeechToTextEngine.kt: