Talk to a 3D avatar that listens, replies in a real voice, lip-syncs, and remembers you — all running on your Mac.
Only the LLM call ever leaves your machine.
AI Companion is a locally-running, voice-driven companion. You speak; a friendly female upper-torso avatar in your browser listens, thinks, and talks back in a natural voice with real-time lip-sync. Over time it gets to know you — it asks questions, remembers your answers and the name you give it, and draws connections across conversations.
Everything runs on the host Mac — microphone noise suppression, voice-activity detection, speech-to-text, text-to-speech, the 3D avatar, and the memory store. The only thing that leaves your machine is the LLM call (and the small fact-extraction call for memory).
Pipecat is the brain; TalkingHead is the mouth. Pipecat owns mic → noise → VAD → STT → memory → LLM → sentence-chunking → Kokoro synthesis and barge-in detection. It does not play the TTS audio — each synthesized sentence (audio + word/phoneme timestamps) is shipped to the browser, which plays it through
TalkingHead.speakAudio()so visemes ride the audio clock. You can interrupt it at any time (barge-in).
- 🎙️ Real voice conversation — local Whisper STT, streaming LLM, local Kokoro TTS.
- 🧑🦰 3D talking avatar with phoneme-accurate lip-sync (TalkingHead + three.js).
- 🧠 Persistent memory — remembers facts about you and the name you give it, across sessions (Hindsight).
- ⚡ Low latency — ~1.5–2.5 s from "you stop" to "it speaks"; sentences pipeline as they stream.
- ✋ Barge-in — talk over it and it stops and listens.
- 🔒 Local-first & private — audio, models, and memory stay on your machine.
- 🧪 Mock mode — run the whole loop with no models and no API key.
flowchart LR
MIC["🎙️ Browser mic"]:::edge
subgraph PC["Pipecat — local voice pipeline (your Mac)"]
direction LR
RNN["RNNoise<br/>noise suppression"] --> VAD["Silero VAD<br/>turns + barge-in"] --> ASR["Whisper<br/>speech to text"] --> MEM["Hindsight<br/>memory · optional"] --> LLM["LLM<br/>OpenAI · remote"]:::remote --> TTS["Kokoro TTS<br/>audio + timestamps"]
end
TH["TalkingHead<br/>3D avatar + lip-sync"]:::edge
MIC --> RNN
TTS --> TH
TH -. "barge in anytime" .-> MIC
classDef edge fill:#1b2a3a,stroke:#9ecbff,color:#e7eaee
classDef remote fill:#3a2330,stroke:#d06a86,color:#ffffff
The avatar plays each sentence as soon as it's synthesized, while the next one is still being
generated. Full design, rationale, and the hard-won gotchas are in
ai_companion_build_plan.md.
- macOS on Apple Silicon (for MLX/Metal Whisper; a CPU fallback works anywhere)
- Python 3.10–3.12 (Kokoro/torch need
<3.13); this repo pins 3.12.8 via pyenv espeak-ng(Kokoro G2P):brew install espeak-ng- Node/npm (bundles the Pipecat JS client —
setup.shruns the build) - An
OPENAI_API_KEY— the only required secret (powers the LLM and memory's fact extraction)
scripts/setup.sh # pyenv 3.12.8 + venv + deps + espeak-ng + frontend bundle
cp .env.example .env # add your OPENAI_API_KEY (set MEMORY_ENABLED=1 to enable memory)
scripts/run.sh # runs in this terminal (logs here); opens the browser
# stop with Ctrl-COpen http://localhost:8080, click Start, and talk. The first run downloads the Whisper,
Kokoro, and Silero weights into ./models (they persist). 🎧 Headphones recommended — they stop
the avatar's voice from re-entering your mic.
Every stage has a mock, so you can verify the full transport + lip-sync loop with zero downloads:
. .venv/bin/activate
KOKORO_MOCK=1 ASR_ENGINE=mock LLM_MOCK=1 uvicorn backend.server:app --host 0.0.0.0 --port 8080The avatar boots, connects, and lip-syncs a canned reply over a sine tone.
All via .env (see .env.example). Key knobs:
| Var | Default | Meaning |
|---|---|---|
OPENAI_API_KEY |
— | required (LLM + memory extraction) |
OPENAI_MODEL |
gpt-4o-mini |
LLM model — keep it non-reasoning for low latency |
ASR_ENGINE |
whisper-mlx |
whisper-mlx (Apple GPU) · whisper-cpu · mock |
KOKORO_VOICE |
af_heart |
Kokoro voice id |
WHISPER_NO_SPEECH_PROB |
0.6 |
drop non-speech segments (anti-hallucination) |
MEMORY_ENABLED |
0 |
1 = persistent Hindsight memory (recall + retain) |
MEMORY_USER_ID |
default |
per-user memory bank |
KOKORO_MOCK / LLM_MOCK |
0 |
mock TTS / mock LLM for model-free runs |
Set MEMORY_ENABLED=1 and run.sh also starts a local Hindsight server (embedded
Postgres + pgvector, local embeddings). Before each reply it recalls relevant memories; after each
exchange it extracts and stores facts. Nothing about you leaves your machine except the small
OpenAI extraction call. Tell it your name and a few things about yourself, restart, and it'll greet
you knowing them.
backend/
server.py FastAPI: /ws, /health, /config, serves the frontend
main.py builds + runs one Pipecat pipeline per WS connection
config.py env parsing + the companion persona
protocol.py server→client message schemas (single source of truth)
kokoro_synth.py Kokoro synthesis + WAV helpers
pipeline/ transport · noise · vad · asr_whisper · llm · memory
tts_kokoro · interrupt · mock_services · timing
frontend/
index.html app.js avatar.js lipsync.js ui.js # source
package.json # esbuild → app.bundle.js
scripts/ setup.sh · run.sh · stop.sh
tests/ test_contract.py · lipsync_check.mjs
docs/ banner.svg · screenshot.png
. .venv/bin/activate
python -m py_compile backend/*.py backend/pipeline/*.py # everything compiles
python tests/test_contract.py # mock synth + packet schema
node tests/lipsync_check.mjs # phoneme→viseme reductionPipecat · TalkingHead · Kokoro TTS · Whisper (MLX / faster-whisper) · Silero VAD · three.js · Hindsight · OpenAI
The full architecture, every decision, the measured latency budget, and the debugging lessons live
in ai_companion_build_plan.md — written so someone could rebuild
this from scratch.
MIT © Xavier Anguera. Built on excellent open-source projects (see Built with); their respective licenses apply to those components.
