Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions plugins/xai/grok_tts/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

# Stream API credentials
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai
# Deepgram API credentials
XAI_API_KEY=...
DEEPGRAM_API_KEY=...
GOOGLE_API_KEY=...
10 changes: 10 additions & 0 deletions plugins/xai/grok_tts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# Virtual environments
.venv
Empty file added plugins/xai/grok_tts/README.md
Empty file.
119 changes: 119 additions & 0 deletions plugins/xai/grok_tts/plugins/grok_tts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Grok TTS Plugin

A Text-to-Speech (TTS) plugin for [Vision Agents](https://github.com/GetStream/Vision-Agents) powered by [xAI's Grok Voice API](https://x.ai/api/voice). Provides five expressive voices with inline speech tags for fine-grained delivery control.

## Features

- Five distinct voices: Eve, Ara, Leo, Rex, Sal
- Inline speech tags for expressive delivery (`[laugh]`, `[pause]`, `<whisper>`, etc.)
- Multiple output codecs: PCM, MP3, WAV, mu-law, A-law
- Configurable sample rate (8 kHz – 48 kHz)
- 20+ supported languages with automatic detection
- Built-in retry with exponential backoff
- Async HTTP via aiohttp for non-blocking synthesis

## Installation

```bash
uv add "vision-agents[grok-tts]"
# or directly
uv add vision-agents-plugins-grok-tts
```

## Usage

```python
from vision_agents.plugins import grok_tts

# Default voice (eve) — energetic, upbeat
tts = grok_tts.TTS()

# Specify a voice
tts = grok_tts.TTS(voice="ara") # warm, friendly
tts = grok_tts.TTS(voice="leo") # authoritative, strong
tts = grok_tts.TTS(voice="rex") # confident, clear
tts = grok_tts.TTS(voice="sal") # smooth, balanced

# Custom output format
tts = grok_tts.TTS(
voice="rex",
codec="mp3",
sample_rate=44100,
bit_rate=192000,
)

# Explicit API key (otherwise reads XAI_API_KEY env var)
tts = grok_tts.TTS(api_key="xai-your-key-here")
```

## Configuration

| Parameter | Type | Default | Description |
|---------------|--------|-----------|-----------------------------------------------------------------------|
| `api_key` | str | env var | xAI API key. Falls back to `XAI_API_KEY` environment variable. |
| `voice` | str | `"eve"` | Voice ID: `"eve"`, `"ara"`, `"leo"`, `"rex"`, or `"sal"`. |
| `language` | str | `"en"` | BCP-47 language code or `"auto"` for detection. |
| `codec` | str | `"pcm"` | Output codec: `"pcm"`, `"mp3"`, `"wav"`, `"mulaw"`, `"alaw"`. |
| `sample_rate` | int | `24000` | Sample rate: `8000`–`48000` Hz. |
| `bit_rate` | int | `None` | MP3 bit rate (only used with `codec="mp3"`). |
| `base_url` | str | `None` | Override the xAI TTS API endpoint. |
| `session` | object | `None` | Optional pre-existing `aiohttp.ClientSession`. |

## Voices

| Voice | Tone | Best For |
|-------|--------------------------|-----------------------------------------------|
| `eve` | Energetic, upbeat | Demos, announcements, upbeat content (default) |
| `ara` | Warm, friendly | Conversational interfaces, hospitality |
| `leo` | Authoritative, strong | Instructional, educational, healthcare |
| `rex` | Confident, clear | Business, corporate, customer support |
| `sal` | Smooth, balanced | Versatile — works for any context |

## Speech Tags

Add expressiveness to synthesized speech with inline and wrapping tags:

**Inline tags** (placed where the expression should occur):
- Pauses: `[pause]` `[long-pause]` `[hum-tune]`
- Laughter: `[laugh]` `[chuckle]` `[giggle]` `[cry]`
- Mouth sounds: `[tsk]` `[tongue-click]` `[lip-smack]`
- Breathing: `[breath]` `[inhale]` `[exhale]` `[sigh]`

**Wrapping tags** (wrap text to change delivery):
- Volume: `<soft>text</soft>` `<loud>text</loud>` `<shout>text</shout>`
- Pitch/speed: `<high-pitch>text</high-pitch>` `<low-pitch>text</low-pitch>` `<slow>text</slow>` `<fast>text</fast>`
- Style: `<whisper>text</whisper>` `<sing>text</sing>`

## Supported Languages

| Language | Code |
|-----------------------|---------|
| English | `en` |
| Chinese (Simplified) | `zh` |
| French | `fr` |
| German | `de` |
| Spanish (Spain) | `es-ES` |
| Spanish (Mexico) | `es-MX` |
| Japanese | `ja` |
| Korean | `ko` |
| Portuguese (Brazil) | `pt-BR` |
| Italian | `it` |
| Hindi | `hi` |
| Arabic (Egypt) | `ar-EG` |
| Russian | `ru` |
| Turkish | `tr` |
| Vietnamese | `vi` |
| Auto-detect | `auto` |

## Dependencies

- Python 3.10+
- aiohttp >= 3.9
- vision-agents (core)
- Optional: pydub (for MP3 decoding)

## Getting Your API Key

1. Go to [console.x.ai](https://console.x.ai/team/default/api-keys)
2. Create a new API key
3. Set the `XAI_API_KEY` environment variable or pass it directly to the plugin
100 changes: 100 additions & 0 deletions plugins/xai/grok_tts/plugins/grok_tts/example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Grok TTS Examples

This directory contains examples demonstrating how to use the Grok TTS plugin with Vision Agents. Each example showcases a different use case with a voice selected to match the persona.

## Examples

| Example | File | Voice | Persona |
|------------------------------------|-------------------------------------|-------|------------------------------|
| Basic | `basic_example.py` | Eve | Friendly AI assistant |
| Restaurant Host | `restaurant_host_example.py` | Ara | Upscale Italian restaurant |
| Medical Receptionist | `medical_receptionist_example.py` | Sal | Family practice front desk |
| Customer Support | `customer_support_example.py` | Rex | SaaS product support agent |
| Real Estate Agent | `real_estate_agent_example.py` | Eve | Property sales agent |
| Healthcare Information | `healthcare_example.py` | Leo | Telehealth wellness guide |
| Hotel Concierge | `hotel_concierge_example.py` | Ara | Luxury hotel concierge |

## Setup

1. Install dependencies:

```bash
cd plugins/grok_tts/example
uv sync
```

2. Create a `.env` file with your API keys:

```bash
# Required for Grok TTS
XAI_API_KEY=your_xai_api_key

# Required for speech-to-text
DEEPGRAM_API_KEY=your_deepgram_api_key

# Required for LLM
GOOGLE_API_KEY=your_google_api_key

# Required for real-time transport
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
```

## Running the Examples

Each example follows the same pattern — pick any one:

```bash
# Basic assistant
uv run basic_example.py run

# Restaurant host
uv run restaurant_host_example.py run

# Medical receptionist
uv run medical_receptionist_example.py run

# Customer support
uv run customer_support_example.py run

# Real estate agent
uv run real_estate_agent_example.py run

# Healthcare information
uv run healthcare_example.py run

# Hotel concierge
uv run hotel_concierge_example.py run
```

## Voice Selection Guide

Each example uses a voice that matches its persona:

- **Eve** (energetic, upbeat) — Great default for demos and enthusiastic roles like real estate
- **Ara** (warm, friendly) — Perfect for hospitality: restaurant hosts, hotel concierges
- **Leo** (authoritative, strong) — Ideal for healthcare and instructional content
- **Rex** (confident, clear) — Best for professional roles: support agents, business
- **Sal** (smooth, balanced) — Versatile choice for calm, reassuring roles like medical reception

## Customization

You can easily swap voices or adjust settings in any example:

```python
# Change voice
tts=grok_tts.TTS(voice="leo")

# Change language
tts=grok_tts.TTS(voice="ara", language="es-ES")

# Use MP3 output
tts=grok_tts.TTS(voice="eve", codec="mp3", sample_rate=44100, bit_rate=192000)
```

## Additional Resources

- [xAI TTS Documentation](https://docs.x.ai/developers/model-capabilities/audio/text-to-speech)
- [xAI Voice API](https://x.ai/api/voice)
- [Vision Agents Documentation](https://visionagents.ai)
- [Vision Agents Plugin Guide](https://visionagents.ai/integrations/create-your-own-plugin)
Empty file.
67 changes: 67 additions & 0 deletions plugins/xai/grok_tts/plugins/grok_tts/example/basic_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
"""
Grok TTS — Basic Example

A minimal Vision Agents setup that demonstrates Grok text-to-speech
with Deepgram STT, Gemini LLM, and Stream's real-time edge transport.

Requirements (environment variables):
XAI_API_KEY — xAI / Grok API key
DEEPGRAM_API_KEY — Deepgram STT key
GOOGLE_API_KEY — Google Gemini key
STREAM_API_KEY — Stream API key
STREAM_API_SECRET — Stream API secret
"""

import asyncio
import logging

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, smart_turn
from vision_agents.plugins import grok_tts

logger = logging.getLogger(__name__)

load_dotenv()


async def create_agent(**kwargs) -> Agent:
"""Create an agent with Grok TTS using the default 'eve' voice."""
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Grok Voice AI", id="agent"),
instructions=(
"You are a friendly and helpful voice assistant powered by Grok. "
"Keep your responses concise and conversational."
),
tts=grok_tts.TTS(voice="eve"),
stt=deepgram.STT(eager_turn_detection=True),
llm=gemini.LLM(),
turn_detection=smart_turn.TurnDetection(
silence_duration_ms=2000,
speech_probability_threshold=0.5,
),
)
return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
"""Join a call and greet the user."""
call = await agent.create_call(call_type, call_id)

logger.info("Starting Grok TTS Agent (basic example)...")

async with agent.join(call):
logger.info("Agent joined call")

await asyncio.sleep(3)
await agent.llm.simple_response(
text="Hello! I'm your voice assistant running on Grok TTS. How can I help?"
)

await agent.finish()


if __name__ == "__main__":
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Loading
Loading