Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix: 修复文档编码,语音功能 #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 修复文档编码,语音功能 #130
Changes from all commits
69c064bFile filter
Filter by extension
Conversations
Uh oh!
There was an error while loading. Please reload this page.
Jump to
Uh oh!
There was an error while loading. Please reload this page.
There are no files selected for viewing
No visible changes here.
But you can view the diff anyway.
AI Chat Mode Development Guide
AI Chat Mode (AI_CHAT) is a new voice interaction feature in InkSight that allows users to trigger voice conversations by pressing buttons on the device, enabling real-time voice communication with an AI assistant. This mode is designed for devices equipped with microphones and speakers, and leverages Alibaba's Bailian (百炼) platform for real-time ASR, streaming TTS, and LLM capabilities to deliver an end-to-end voice conversation experience.
1. Feature Overview
The core pipeline of AI Chat Mode is:
Key capabilities include:
qwen3-asr-flash-realtimemodel, uploads audio streams in real-time via WebSocket, supports Chinese recognition, and can return intermediate results (partial transcripts) during recognitioncosyvoice-v3-plusmodel, synthesizes speech in streaming mode, supports playback-while-generating to reduce first-byte latencyTwo Interaction Modes
AI Chat Mode supports two usage paths:
2. Hardware Wiring
AI Chat Mode is currently only available on the ESP32-WROOM32E development board. Voice functionality support for ESP32-C3 series is on the roadmap, but the current firmware does not yet include it.
2.1 Recommended Audio Modules
2.2 Pin Mapping
The voice-related pins of ESP32-WROOM-32E are defined in the firmware as follows:
Microphone (INMP441)
Speaker (MAX98357A)
2.3 Voice Trigger Button
2.4 Wiring Diagram
2.5 Firmware Configuration Macros
In
firmware/platformio.ini, ensure the build target corresponds toWROOM32E, for example, using the environmentepd_42_wsv2_ssd1683_wroom32e.If you want the device to automatically enter AI Chat Mode on power-up, define in
platformio.ini:Voice-related firmware parameters:
SAMPLE_RATEENABLE_OPUSAI_CHAT_BTN_HOLD_MS2.6 Audio Parameters
3. Prerequisites
Hardware Requirements
Server Requirements
Bailian API Key Application
qwen3-asr-flash-realtimecosyvoice-v3-plus4. .env Configuration (For Development)
All voice-related parameters are configured in the backend
.envfile. The following explains each parameter's function, default value, and tuning suggestions grouped by feature.4.1 API Key Configuration
Explanation:
VOICE_DASHSCOPE_API_KEYtakes priority. If left empty, it automatically usesDASHSCOPE_API_KEY. To use different keys for STT and TTS separately, useVOICE_STT_API_KEYandVOICE_TTS_API_KEY.4.2 Speech Recognition (STT) Configuration
Tuning suggestions:
VOICE_REALTIME_ASR_IDLE_TIMEOUT_SECONDS: If users have longer pauses between speaking, increase this value (e.g., to 30); if there are many false triggers, decrease it (e.g., to 15)VOICE_REALTIME_ASR_LANGUAGE: Currently only supportszh(Chinese); change toenfor English4.3 Text-to-Speech (TTS) Configuration
Tuning suggestions:
VOICE_STREAMING_TTS_VOLUME: Adjust according to speaker sensitivity; reduce to 30-40 if too loudVOICE_STREAMING_TTS_SPEED: Reduce to 0.8-0.9 if speaking too fast; increase to 1.1-1.2 if too slowVOICE_STREAMING_TTS_PITCH: Fine-tune for different voice styles (best results in 0.8-1.2 range)VOICE_STREAMING_TTS_VOICE4.4 Streaming Synthesis and Warmup Control
These parameters control the coordination strategy between LLM streaming generation and TTS streaming synthesis, affecting conversation response speed and fluency.
Tuning suggestions:
VOICE_LLM_STREAMING=1andVOICE_TTS_STREAMING=10for non-streaming mode (slower but more stable)VOICE_TTS_DELTA_IDLE_MS: AI thinking pauses exceeding 320ms trigger segmented playback to avoid long user waitsVOICE_WS_PARTIAL_WARMUP_STABLE_MS: Start generating responses 180ms after ASR recognition stabilizes, reducing perceived latency4.5 Server-side VAD (Voice Activity Detection)
Server-side VAD allows the backend to automatically trigger speech recognition submission when silence is detected, without waiting for the device to actively send a commit signal.
Explanation: After enabling server-side VAD, even if the device firmware does not actively commit voice, the backend can automatically process after 800ms of silence. This helps improve experience when network is unstable, but may increase false triggers. If the device firmware has implemented local VAD, it is recommended to set
VOICE_SERVER_VAD_ENABLEDto0.4.6 Session Cache Configuration
Explanation: Each voice conversation generates a turn record containing recognized text, response text, audio data, and conversation image. This data can be retrieved via API within the TTL for playback, breakpoint resume, and other scenarios. 900 seconds (~15 minutes) is sufficient for daily use without frequent modifications.
5. Mode Switching Commands
Users can switch device modes through natural language commands during conversations. The system recognizes the following types of commands:
Explanation: After the device detects a mode switching command, it generates a brief confirmation voice (e.g., "Okay, switching for you") and then automatically switches to the target mode.
6. Usage Flow
6.1 Initial Configuration
VOICE_DASHSCOPE_API_KEYin the backend.env(or configureVOICE_STT_API_KEYandVOICE_TTS_API_KEYseparately)6.2 Daily Use
7. Troubleshooting
Common Issues
1. Speech recognition not responding or returning empty text
VOICE_DASHSCOPE_API_KEYis correctly configured[VOICE_STT]related output2. TTS playback abnormal (no sound or noise)
VOICE_STREAMING_TTS_SAMPLE_RATEmatches device audio sample rate (usually 16000Hz)VOICE_STREAMING_TTS_ENABLEDis set to 13. Mode switching not working
switch_to_modehandling logic4. High response latency
VOICE_TTS_DELTA_IDLE_MS(e.g., to 200) to speed up segmentationVOICE_WS_PARTIAL_WARMUP_STABLE_MSis reasonable5. Server-side VAD false triggers
VOICE_SERVER_VAD_ENABLEDto0VOICE_SERVER_VAD_SILENCE_MS(e.g., to 1200)8. Related Documentation
No visible changes here.
But you can view the diff anyway.
AI 对话模式开发指南
AI 对话模式(AI_CHAT)是 InkSight 新增的语音交互功能,允许用户通过设备上的按键触发语音对话,与 AI 助手进行实时语音交流。该模式适用于已接入麦克风和扬声器的设备,底层调用阿里百炼的实时 ASR、流式 TTS 与大语言模型,实现端到端的语音对话体验。
1. 功能概述
AI 对话模式的核心链路为:
具体能力包括:
qwen3-asr-flash-realtime模型,通过 WebSocket 实时上传音频流,支持中文识别,并可在识别过程中返回中间结果(partial transcript)cosyvoice-v3-plus模型,以流式方式合成语音,支持边生成边播放,缩短首字节延迟两种交互模式
AI 对话模式支持两种使用路径:
2. 硬件接线
AI 对话模式目前仅在 ESP32-WROOM32E 开发板上可用。ESP32-C3 系列的语音功能支持已在路线图中,但当前固件尚未包含。
2.1 推荐音频模块
2.2 引脚对应关系
ESP32-WROOM-32E 的语音相关引脚在固件中的定义如下:
麦克风(INMP441)
扬声器(MAX98357A)
2.3 语音触发按键
2.4 接线示意图
2.5 固件配置宏
在
firmware/platformio.ini中,确保构建目标对应WROOM32E,例如使用环境epd_42_wsv2_ssd1683_wroom32e。如果希望设备上电后自动进入 AI 对话模式,可在
platformio.ini中定义:语音相关固件参数:
SAMPLE_RATEENABLE_OPUSAI_CHAT_BTN_HOLD_MS2.6 音频参数
3. 前置要求
硬件要求
服务端要求
百炼 API Key 申请
qwen3-asr-flash-realtimecosyvoice-v3-plus4. .env 参数配置(个人开发)
所有语音相关参数均在后端
.env文件中配置。以下按功能分组说明每个参数的作用、默认值和调优建议。4.1 API 密钥配置
说明:优先使用
VOICE_DASHSCOPE_API_KEY。如果留空,则自动使用DASHSCOPE_API_KEY。如需为 STT 和 TTS 分别使用不同 Key,可通过VOICE_STT_API_KEY和VOICE_TTS_API_KEY单独指定。4.2 语音识别(STT)配置
调优建议:
VOICE_REALTIME_ASR_IDLE_TIMEOUT_SECONDS:如果用户说话间隔较长,可适当增大(如 30);如果设备误触发较多,可减小(如 15)VOICE_REALTIME_ASR_LANGUAGE:当前仅支持zh(中文),如需英文可改为en4.3 语音合成(TTS)配置
调优建议:
VOICE_STREAMING_TTS_VOLUME:根据扬声器灵敏度调整,过响可降至 30-40VOICE_STREAMING_TTS_SPEED:语速过快可降至 0.8-0.9,过慢可升至 1.1-1.2VOICE_STREAMING_TTS_PITCH:如需不同音色风格可微调(0.8-1.2 范围内效果较好)VOICE_STREAMING_TTS_VOICE4.4 流式合成与预热控制
这些参数控制 LLM 流式生成与 TTS 流式合成的协同策略,影响对话的响应速度和流畅度。
调优建议:
VOICE_LLM_STREAMING=1和VOICE_TTS_STREAMING=10,改用非流式模式(速度较慢但稳定)VOICE_TTS_DELTA_IDLE_MS:AI 思考时停顿超过 320ms 会触发分段播放,避免用户等待过久VOICE_WS_PARTIAL_WARMUP_STABLE_MS:ASR 识别稳定 180ms 后就开始生成回复,缩短感知延迟4.5 服务端 VAD(语音活动检测)
服务端 VAD 允许后端在检测到静音后自动触发语音识别提交,无需等待设备主动发送 commit 信号。
说明:启用服务端 VAD 后,即使设备固件没有主动 commit 语音,后端也能在 800ms 静音后自动处理。这有助于在网络不稳定时提升体验,但可能增加误触发。如果设备固件已实现了本地 VAD,建议将
VOICE_SERVER_VAD_ENABLED设为0。4.6 会话缓存配置
说明:每次语音对话会生成一条 turn 记录,包含识别文本、回复文本、音频数据和对话图片。这些数据在 TTL 内可通过 API 获取,用于回放、断点续传等场景。900 秒(约 15 分钟)对于日常使用足够,无需频繁修改。
5. 模式切换指令
用户可以在对话过程中通过自然语言指令切换设备模式。系统会识别以下类型的指令:
说明:设备检测到模式切换指令后,会生成一条简短确认语音(如"好的,帮你切换"),然后自动切换到目标模式。
6. 使用流程
6.1 首次配置
.env中配置VOICE_DASHSCOPE_API_KEY(或单独配置VOICE_STT_API_KEY和VOICE_TTS_API_KEY)6.2 日常使用
7. 故障排查
常见问题
1. 语音识别无反应或返回空文本
VOICE_DASHSCOPE_API_KEY是否正确配置[VOICE_STT]相关输出2. TTS 播放声音异常(无声或杂音)
VOICE_STREAMING_TTS_SAMPLE_RATE与设备音频采样率一致(通常为 16000Hz)VOICE_STREAMING_TTS_ENABLED是否为 13. 模式切换不生效
switch_to_mode的处理逻辑4. 响应延迟过高
VOICE_TTS_DELTA_IDLE_MS(如改为 200)以加快分段VOICE_WS_PARTIAL_WARMUP_STABLE_MS是否合理5. 服务端 VAD 误触发
VOICE_SERVER_VAD_ENABLED设为0VOICE_SERVER_VAD_SILENCE_MS(如 1200)8. 相关文档
Uh oh!
There was an error while loading. Please reload this page.