Expressive Voice Agents

Recent advancements in speech and language models have enabled the generation of high quality, natural-sounding speech. However, current voice chat systems often lack the ability to generate properly emotive and personalized responses. In this project, we present an integrated voice agent system that leverages these advancements to create expressive and personalized voice agents.

Our system consists of three main components: speech transcription, language model generation with question answering, and text-to-speech synthesis. We personalize the language model and speech synthesis components to specific target individuals, focusing on the use case of casual coffee chats or "get-to-know-you" conversations. Our experiments demonstrate the effectiveness of retrieval augmented generation for providing realistic and personalized answers, and finetuned TTS models for generating high quality personalized speech. Furthermore, we introduce AudioRAG, a novel method for retrieving emotion-matching audio samples to generate expressive speech. The resulting voice agent system achieves high quality in terms of realism, expressiveness, and personalization with reasonable latency, offering a promising direction for AI-driven personal voice clones.

Paper Report

https://drive.google.com/file/d/1IOI0cnyRXUW1d1VGFaHj29nqUgneSPZV/view?usp=drive_link

File structure

React frontend (src/frontend/)
FastAPI server (src/app.py)
Whisper transcription module (src/transcriber.py)
Tortoise text-to-speech module (src/tts.py)
Zephyr language model module (src/llm_zephyr.py)

Read the accompanying docs for a detailed look at each of these components.

Developing locally

Requirements

modal installed in your current Python virtual environment (pip install modal)
A Modal account
A Modal token set up in your environment (modal token new)
To get the VoiceCraft submodule, do

git submodule update

The way VoiceCraft is coded right now requies manually downloading the encodec function, so do wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th and make sure the file is in the directory: VoiceCraft/pretrained_models/

Develop on Modal

To serve the app on Modal, run this command from the root directory of this repo:

modal serve src.app

In the terminal output, you'll find a URL that you can visit to use your app. While the modal serve process is running, changes to any of the project files will be automatically applied. Ctrl+C will stop the app.

Deploy to Modal

Once you're happy with your changes, deploy your app:

modal deploy src.app

[Note that leaving the app deployed on Modal doesn't cost you anything! Modal apps are serverless and scale to 0 when not in use.]

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
alignments_test		alignments_test
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CS 224S Poster.pdf		CS 224S Poster.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expressive Voice Agents

Paper Report

File structure

Developing locally

Requirements

Develop on Modal

Deploy to Modal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Expressive Voice Agents

Paper Report

File structure

Developing locally

Requirements

Develop on Modal

Deploy to Modal

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages