GetStream · amosgyamfi · Mar 20, 2026
diff --git a/plugins/xai/grok_tts/.env.example b/plugins/xai/grok_tts/.env.example
@@ -0,0 +1,9 @@
+
+# Stream API credentials
+STREAM_API_KEY=...
+STREAM_API_SECRET=...
+EXAMPLE_BASE_URL=https://demo.visionagents.ai
+# Deepgram API credentials
+XAI_API_KEY=...
+DEEPGRAM_API_KEY=...
+GOOGLE_API_KEY=...
diff --git a/plugins/xai/grok_tts/.gitignore b/plugins/xai/grok_tts/.gitignore
@@ -0,0 +1,10 @@
+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+
+# Virtual environments
+.venv
diff --git a/plugins/xai/grok_tts/README.md b/plugins/xai/grok_tts/README.md
diff --git a/plugins/xai/grok_tts/plugins/grok_tts/README.md b/plugins/xai/grok_tts/plugins/grok_tts/README.md
@@ -0,0 +1,119 @@
+# Grok TTS Plugin
+
+A Text-to-Speech (TTS) plugin for [Vision Agents](https://github.com/GetStream/Vision-Agents) powered by [xAI's Grok Voice API](https://x.ai/api/voice). Provides five expressive voices with inline speech tags for fine-grained delivery control.
+
+## Features
+
+- Five distinct voices: Eve, Ara, Leo, Rex, Sal
+- Inline speech tags for expressive delivery (`[laugh]`, `[pause]`, `<whisper>`, etc.)
+- Multiple output codecs: PCM, MP3, WAV, mu-law, A-law
+- Configurable sample rate (8 kHz – 48 kHz)
+- 20+ supported languages with automatic detection
+- Built-in retry with exponential backoff
+- Async HTTP via aiohttp for non-blocking synthesis
+
+## Installation
+
+```bash
+uv add "vision-agents[grok-tts]"
+# or directly
+uv add vision-agents-plugins-grok-tts
+```
+
+## Usage
+
+```python
+from vision_agents.plugins import grok_tts
+
+# Default voice (eve) — energetic, upbeat
+tts = grok_tts.TTS()
+
+# Specify a voice
+tts = grok_tts.TTS(voice="ara")   # warm, friendly
+tts = grok_tts.TTS(voice="leo")   # authoritative, strong
+tts = grok_tts.TTS(voice="rex")   # confident, clear
+tts = grok_tts.TTS(voice="sal")   # smooth, balanced
+
+# Custom output format
+tts = grok_tts.TTS(
+    voice="rex",
+    codec="mp3",
+    sample_rate=44100,
+    bit_rate=192000,
+)
+
+# Explicit API key (otherwise reads XAI_API_KEY env var)
+tts = grok_tts.TTS(api_key="xai-your-key-here")
+```
+
+## Configuration
+
+| Parameter     | Type   | Default   | Description                                                           |
+|---------------|--------|-----------|-----------------------------------------------------------------------|
+| `api_key`     | str    | env var   | xAI API key. Falls back to `XAI_API_KEY` environment variable.        |
+| `voice`       | str    | `"eve"`   | Voice ID: `"eve"`, `"ara"`, `"leo"`, `"rex"`, or `"sal"`.            |
+| `language`    | str    | `"en"`    | BCP-47 language code or `"auto"` for detection.                       |
+| `codec`       | str    | `"pcm"`   | Output codec: `"pcm"`, `"mp3"`, `"wav"`, `"mulaw"`, `"alaw"`.       |
+| `sample_rate` | int    | `24000`   | Sample rate: `8000`–`48000` Hz.                                       |
+| `bit_rate`    | int    | `None`    | MP3 bit rate (only used with `codec="mp3"`).                          |
+| `base_url`    | str    | `None`    | Override the xAI TTS API endpoint.                                    |
+| `session`     | object | `None`    | Optional pre-existing `aiohttp.ClientSession`.                        |
+
+## Voices
+
+| Voice | Tone                     | Best For                                      |
+|-------|--------------------------|-----------------------------------------------|
+| `eve` | Energetic, upbeat        | Demos, announcements, upbeat content (default) |
+| `ara` | Warm, friendly           | Conversational interfaces, hospitality         |
+| `leo` | Authoritative, strong    | Instructional, educational, healthcare         |
+| `rex` | Confident, clear         | Business, corporate, customer support          |
+| `sal` | Smooth, balanced         | Versatile — works for any context              |
+
+## Speech Tags
+
+Add expressiveness to synthesized speech with inline and wrapping tags:
+
+**Inline tags** (placed where the expression should occur):
+- Pauses: `[pause]` `[long-pause]` `[hum-tune]`
+- Laughter: `[laugh]` `[chuckle]` `[giggle]` `[cry]`
+- Mouth sounds: `[tsk]` `[tongue-click]` `[lip-smack]`
+- Breathing: `[breath]` `[inhale]` `[exhale]` `[sigh]`
+
+**Wrapping tags** (wrap text to change delivery):
+- Volume: `<soft>text</soft>` `<loud>text</loud>` `<shout>text</shout>`
+- Pitch/speed: `<high-pitch>text</high-pitch>` `<low-pitch>text</low-pitch>` `<slow>text</slow>` `<fast>text</fast>`
+- Style: `<whisper>text</whisper>` `<sing>text</sing>`
+
+## Supported Languages
+
+| Language              | Code    |
+|-----------------------|---------|
+| English               | `en`    |
+| Chinese (Simplified)  | `zh`    |
+| French                | `fr`    |
+| German                | `de`    |
+| Spanish (Spain)       | `es-ES` |
+| Spanish (Mexico)      | `es-MX` |
+| Japanese              | `ja`    |
+| Korean                | `ko`    |
+| Portuguese (Brazil)   | `pt-BR` |
+| Italian               | `it`    |
+| Hindi                 | `hi`    |
+| Arabic (Egypt)        | `ar-EG` |
+| Russian               | `ru`    |
+| Turkish               | `tr`    |
+| Vietnamese            | `vi`    |
+| Auto-detect           | `auto`  |
+
+## Dependencies
+
+- Python 3.10+
+- aiohttp >= 3.9
+- vision-agents (core)
+- Optional: pydub (for MP3 decoding)
+
+## Getting Your API Key
+
+1. Go to [console.x.ai](https://console.x.ai/team/default/api-keys)
+2. Create a new API key
+3. Set the `XAI_API_KEY` environment variable or pass it directly to the plugin
diff --git a/plugins/xai/grok_tts/plugins/grok_tts/example/README.md b/plugins/xai/grok_tts/plugins/grok_tts/example/README.md
@@ -0,0 +1,100 @@
+# Grok TTS Examples
+
+This directory contains examples demonstrating how to use the Grok TTS plugin with Vision Agents. Each example showcases a different use case with a voice selected to match the persona.
+
+## Examples
+
+| Example                            | File                                | Voice | Persona                      |
+|------------------------------------|-------------------------------------|-------|------------------------------|
+| Basic                              | `basic_example.py`                  | Eve   | Friendly AI assistant        |
+| Restaurant Host                    | `restaurant_host_example.py`        | Ara   | Upscale Italian restaurant   |
+| Medical Receptionist               | `medical_receptionist_example.py`   | Sal   | Family practice front desk   |
+| Customer Support                   | `customer_support_example.py`       | Rex   | SaaS product support agent   |
+| Real Estate Agent                  | `real_estate_agent_example.py`      | Eve   | Property sales agent         |
+| Healthcare Information             | `healthcare_example.py`             | Leo   | Telehealth wellness guide    |
+| Hotel Concierge                    | `hotel_concierge_example.py`        | Ara   | Luxury hotel concierge       |
+
+## Setup
+
+1. Install dependencies:
+
+```bash
+cd plugins/grok_tts/example
+uv sync
+```
+
+2. Create a `.env` file with your API keys:
+
+```bash
+# Required for Grok TTS
+XAI_API_KEY=your_xai_api_key
+
+# Required for speech-to-text
+DEEPGRAM_API_KEY=your_deepgram_api_key
+
+# Required for LLM
+GOOGLE_API_KEY=your_google_api_key
+
+# Required for real-time transport
+STREAM_API_KEY=your_stream_api_key
+STREAM_API_SECRET=your_stream_api_secret
+```
+
+## Running the Examples
+
+Each example follows the same pattern — pick any one:
+
+```bash
+# Basic assistant
+uv run basic_example.py run
+
+# Restaurant host
+uv run restaurant_host_example.py run
+
+# Medical receptionist
+uv run medical_receptionist_example.py run
+
+# Customer support
+uv run customer_support_example.py run
+
+# Real estate agent
+uv run real_estate_agent_example.py run
+
+# Healthcare information
+uv run healthcare_example.py run
+
+# Hotel concierge
+uv run hotel_concierge_example.py run
+```
+
+## Voice Selection Guide
+
+Each example uses a voice that matches its persona:
+
+- **Eve** (energetic, upbeat) — Great default for demos and enthusiastic roles like real estate
+- **Ara** (warm, friendly) — Perfect for hospitality: restaurant hosts, hotel concierges
+- **Leo** (authoritative, strong) — Ideal for healthcare and instructional content
+- **Rex** (confident, clear) — Best for professional roles: support agents, business
+- **Sal** (smooth, balanced) — Versatile choice for calm, reassuring roles like medical reception
+
+## Customization
+
+You can easily swap voices or adjust settings in any example:
+
+```python
+# Change voice
+tts=grok_tts.TTS(voice="leo")
+
+# Change language
+tts=grok_tts.TTS(voice="ara", language="es-ES")
+
+# Use MP3 output
+tts=grok_tts.TTS(voice="eve", codec="mp3", sample_rate=44100, bit_rate=192000)
+```
+
+## Additional Resources
+
+- [xAI TTS Documentation](https://docs.x.ai/developers/model-capabilities/audio/text-to-speech)
+- [xAI Voice API](https://x.ai/api/voice)
+- [Vision Agents Documentation](https://visionagents.ai)
+- [Vision Agents Plugin Guide](https://visionagents.ai/integrations/create-your-own-plugin)
diff --git a/plugins/xai/grok_tts/plugins/grok_tts/example/__init__.py b/plugins/xai/grok_tts/plugins/grok_tts/example/__init__.py
diff --git a/plugins/xai/grok_tts/plugins/grok_tts/example/basic_example.py b/plugins/xai/grok_tts/plugins/grok_tts/example/basic_example.py
@@ -0,0 +1,67 @@
+"""
+Grok TTS — Basic Example
+
+A minimal Vision Agents setup that demonstrates Grok text-to-speech
+with Deepgram STT, Gemini LLM, and Stream's real-time edge transport.
+
+Requirements (environment variables):
+    XAI_API_KEY          — xAI / Grok API key
+    DEEPGRAM_API_KEY     — Deepgram STT key
+    GOOGLE_API_KEY       — Google Gemini key
+    STREAM_API_KEY       — Stream API key
+    STREAM_API_SECRET    — Stream API secret
+"""
+
+import asyncio
+import logging
+
+from dotenv import load_dotenv
+from vision_agents.core import Agent, Runner, User
+from vision_agents.core.agents import AgentLauncher
+from vision_agents.plugins import deepgram, gemini, getstream, smart_turn
+from vision_agents.plugins import grok_tts
+
+logger = logging.getLogger(__name__)
+
+load_dotenv()
+
+
+async def create_agent(**kwargs) -> Agent:
+    """Create an agent with Grok TTS using the default 'eve' voice."""
+    agent = Agent(
+        edge=getstream.Edge(),
+        agent_user=User(name="Grok Voice AI", id="agent"),
+        instructions=(
+            "You are a friendly and helpful voice assistant powered by Grok. "
+            "Keep your responses concise and conversational."
+        ),
+        tts=grok_tts.TTS(voice="eve"),
+        stt=deepgram.STT(eager_turn_detection=True),
+        llm=gemini.LLM(),
+        turn_detection=smart_turn.TurnDetection(
+            silence_duration_ms=2000,
+            speech_probability_threshold=0.5,
+        ),
+    )
+    return agent
+
+
+async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
+    """Join a call and greet the user."""
+    call = await agent.create_call(call_type, call_id)
+
+    logger.info("Starting Grok TTS Agent (basic example)...")
+
+    async with agent.join(call):
+        logger.info("Agent joined call")
+
+        await asyncio.sleep(3)
+        await agent.llm.simple_response(
+            text="Hello! I'm your voice assistant running on Grok TTS. How can I help?"
+        )
+
+        await agent.finish()
+
+
+if __name__ == "__main__":
+    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()