Jarvis 2.0 is a next-generation multimodal conversational AI assistant π£οΈ, designed for real-time β‘, low-latency, and emotionally intelligent β€οΈ interaction.
This project integrates π the high-performance, websocket-based audio streaming π architecture of Unmute with the powerful audio-language reasoning 𦩠of Audio Flamingo 3.
We utilize Unmute's robust Voice Activity Detection (VAD) ποΈ and its integration with Kyutai's STT/TTS models to create a seamless, responsive conversational pipeline. Instead of a standard text LLM, Jarvis 2.0 uses Nvidia's Audio Flamingo 3 as its central "brain" π§ , allowing for a deeper understanding π of not just what is said, but how it's said.
Jarvis 2.0 functions by creating a real-time, bidirectional audio stream ππ between the user and the AI.
- VAD & Streaming: π€ The frontend captures user audio and, using Unmute's VAD implementation, streams it over a websocket πΈοΈ to the backend as the user speaks.
- Transcription: βοΈ The backend forwards this audio to Kyutai's Speech-to-Text (STT) model, which generates a live transcription.
- Core Reasoning: π‘ The transcribed text is sent to the Audio Flamingo 3 𦩠model. This advanced Audio-Language Model (ALM) generates a context-aware, nuanced, and intelligent response.
- Speech Synthesis: π£οΈ The text response from Audio Flamingo 3 is streamed, as it's generated, to Kyutai's Text-to-Speech (TTS) model.
- Response: π§ The TTS model generates audio, which is streamed back to the user's browser π», enabling a fluid, low-latency conversation.
graph LR
UVI[User Voice Input] --> F(Frontend)
F -->|Audio File| B(Backend)
B <-->|WEB SOCKET| STT(STT)
B <-->|WEB SOCKET| TTS(TTS)
B <-->|HTTP| AF3(AF3)
B <--> LLM(LLM)
LLM <--> SDK(OpenAI Agent SDK)
SDK <--> TC(Tool Calling)
- β‘ Extremely Low Latency: Built on Unmute's architecture, streaming STT, LLM, and TTS tokens simultaneously for lower time-to-first-word."
- π§ Advanced AI Reasoning: Powered by Audio Flamingo 3 π¦©, providing state-of-the-art responses.
- π Real-time Streaming: Full-duplex audio transport over websockets.
- ποΈ Robust VAD: Intelligently detects end-of-speech or natural spaces to provide a natural turn-taking experience.
- π§© Modular: Easily swap out the core model (Audio Flamingo 3) for other backends like GPT-4o, Ollama, or Mistral.
- π Spatial & Emotion Detection: The core model (Audio Flamingo 3) understands audio and is able to detect the surrounding environment π and the user's tone ππ’ from the input audio, something which has not yet been achieved by other open source models.
Alternatively, you can run all services manually. This is more complex due to dependencies.
π» Software requirements:
uv: Install withcurl -LsSf https://astral.sh/uv/install.sh | shcargo: Install withcurl https://sh.rustup.rs -sSf | shpnpm: Install withcurl -fsSL https://get.pnpm.io/install.sh | sh -cuda 12.1: Needed for the Rust processes (tts and stt).
./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh # Requires GPU VRAM
./dockerless/start_stt.sh # Requires GPU VRAM
./dockerless/start_tts.sh # Requires GPU VRAMThe website should now be accessible at π http://localhost:3000.
If you're running Jarvis 2.0 on a remote machine (e.g., jarvis-box) and accessing it from your local machine, you must use SSH port forwarding.
Note
π Browsers restrict microphone π€ access on non-secure (http://) connections, except for localhost. Port forwarding makes the remote server accessible via your localhost, bypassing this restriction.
π³ For Docker Compose: The default setup runs on port 80. Forward this to your local port 3333 π:
ssh -N -L 3333:localhost:80 jarvis-boxNow open http://localhost:3333 in your browser.
π οΈ For Dockerless: You must forward the frontend (3000) and backend (8000) ports separately π:
ssh -N -L 8000:localhost:8000 -L 3000:localhost:3000 jarvis-boxNow open http://localhost:3000 in your browser.
For simplicity, HTTPS is not included in the default setups. For production deployments, we recommend using a reverse proxy like Caddy or Nginx, or adapting the Docker Swarm documentation provided by the Unmute project.
- Press "S" to toggle subtitles for both you and Jarvis.
- A dev mode can be enabled in
useKeyboardShortcuts.tsby changingALLOW_DEV_MODEtotrue. Press "D" to see the debug view.
All character prompts, voices, and system messages are defined in voices.yaml. To add a new character, simply add a new entry. The backend caches this file on startup, so you will need to restart the backend service to see changes.
The backend is compatible with any OpenAI-compatible API. While it's configured for our VLLM-hosted Audio Flamingo 3 by default, you can easily point it to another service.
Edit your docker-compose.yml and change the environment variables for the backend service.
Example: Using Ollama (π¦)
backend:
image: jarvis-backend:latest
[..]
environment:
[..]
- KYUTAI_LLM_URL=http://host.docker.internal:11434
- KYUTAI_LLM_MODEL=llama3 # or any model you have pulled
- KYUTAI_LLM_API_KEY=ollama
extra_hosts:
- "host.docker.internal:host-gateway"Example: Using OpenAI (π€)
backend:
image: jarvis-backend:latest
[..]
environment:
[..]
- KYUTAI_LLM_URL=https://api.openai.com/v1
- KYUTAI_LLM_MODEL=gpt-4o
- KYUTAI_LLM_API_KEY=sk-..If you use an external API, you can remove the llm (VLLM) service from your docker-compose.yml to save πΎ GPU resources.
Tool calling is not yet natively supported by the backend, but it's a highly requested feature.
The easiest way to integrate it is to make it invisible to the Jarvis backend. You can create a small FastAPI server that wraps VLLM, intercepts the requests, performs tool calls, and then returns the final response. See this comment for a conceptual overview.
Jarvis 2.0 stands on the shoulders of giants π§βπ¬. This project would not be possible without the foundational work from the Kyutai team on Unmute. We extend our sincere thanks π to them for open-sourcing their high-performance audio pipeline, which serves as the backbone of this project.
This project is licensed under the MIT License. See the LICENSE file for details.