GitHub - xanguera/aicompanion: Local-first, voice-driven AI companion — 3D avatar with lip-sync, streaming voice, and persistent memory. Runs on your Mac; only the LLM call is remote.

Talk to a 3D avatar that listens, replies in a real voice, lip-syncs, and remembers you — all running on your Mac.
_{Only the LLM call ever leaves your machine.}

What is this?

AI Companion is a locally-running, voice-driven companion. You speak; a friendly female upper-torso avatar in your browser listens, thinks, and talks back in a natural voice with real-time lip-sync. Over time it gets to know you — it asks questions, remembers your answers and the name you give it, and draws connections across conversations.

Everything runs on the host Mac — microphone noise suppression, voice-activity detection, speech-to-text, text-to-speech, the 3D avatar, and the memory store. The only thing that leaves your machine is the LLM call (and the small fact-extraction call for memory).

Pipecat is the brain; TalkingHead is the mouth. Pipecat owns mic → noise → VAD → STT → memory → LLM → sentence-chunking → Kokoro synthesis and barge-in detection. It does not play the TTS audio — each synthesized sentence (audio + word/phoneme timestamps) is shipped to the browser, which plays it through TalkingHead.speakAudio() so visemes ride the audio clock. You can interrupt it at any time (barge-in).

✨ Features

🎙️ Real voice conversation — local Whisper STT, streaming LLM, local Kokoro TTS.
🧑‍🦰 3D talking avatar with phoneme-accurate lip-sync (TalkingHead + three.js).
🧠 Persistent memory — remembers facts about you and the name you give it, across sessions (Hindsight).
⚡ Low latency — ~1.5–2.5 s from "you stop" to "it speaks"; sentences pipeline as they stream.
✋ Barge-in — talk over it and it stops and listens.
🔒 Local-first & private — audio, models, and memory stay on your machine.
🧪 Mock mode — run the whole loop with no models and no API key.

🎬 How it works

flowchart LR
    MIC["🎙️ Browser mic"]:::edge
    subgraph PC["Pipecat — local voice pipeline (your Mac)"]
        direction LR
        RNN["RNNoise<br/>noise suppression"] --> VAD["Silero VAD<br/>turns + barge-in"] --> ASR["Whisper<br/>speech to text"] --> MEM["Hindsight<br/>memory · optional"] --> LLM["LLM<br/>OpenAI · remote"]:::remote --> TTS["Kokoro TTS<br/>audio + timestamps"]
    end
    TH["TalkingHead<br/>3D avatar + lip-sync"]:::edge
    MIC --> RNN
    TTS --> TH
    TH -. "barge in anytime" .-> MIC
    classDef edge fill:#1b2a3a,stroke:#9ecbff,color:#e7eaee
    classDef remote fill:#3a2330,stroke:#d06a86,color:#ffffff

The avatar plays each sentence as soon as it's synthesized, while the next one is still being generated. Full design, rationale, and the hard-won gotchas are in ai_companion_build_plan.md.

📋 Requirements

macOS on Apple Silicon (for MLX/Metal Whisper; a CPU fallback works anywhere)
Python 3.10–3.12 (Kokoro/torch need <3.13); this repo pins 3.12.8 via pyenv
espeak-ng (Kokoro G2P): brew install espeak-ng
Node/npm (bundles the Pipecat JS client — setup.sh runs the build)
An OPENAI_API_KEY — the only required secret (powers the LLM and memory's fact extraction)

🚀 Quick start

scripts/setup.sh          # pyenv 3.12.8 + venv + deps + espeak-ng + frontend bundle
cp .env.example .env      # add your OPENAI_API_KEY (set MEMORY_ENABLED=1 to enable memory)
scripts/run.sh            # runs in this terminal (logs here); opens the browser
# stop with Ctrl-C

Open http://localhost:8080, click Start, and talk. The first run downloads the Whisper, Kokoro, and Silero weights into ./models (they persist). 🎧 Headphones recommended — they stop the avatar's voice from re-entering your mic.

Try it with no models or API key

Every stage has a mock, so you can verify the full transport + lip-sync loop with zero downloads:

. .venv/bin/activate
KOKORO_MOCK=1 ASR_ENGINE=mock LLM_MOCK=1 uvicorn backend.server:app --host 0.0.0.0 --port 8080

The avatar boots, connects, and lip-syncs a canned reply over a sine tone.

⚙️ Configuration

All via .env (see .env.example). Key knobs:

Var	Default	Meaning
`OPENAI_API_KEY`	—	required (LLM + memory extraction)
`OPENAI_MODEL`	`gpt-4o-mini`	LLM model — keep it non-reasoning for low latency
`ASR_ENGINE`	`whisper-mlx`	`whisper-mlx` (Apple GPU) · `whisper-cpu` · `mock`
`KOKORO_VOICE`	`af_heart`	Kokoro voice id
`WHISPER_NO_SPEECH_PROB`	`0.6`	drop non-speech segments (anti-hallucination)
`MEMORY_ENABLED`	`0`	`1` = persistent Hindsight memory (recall + retain)
`MEMORY_USER_ID`	`default`	per-user memory bank
`KOKORO_MOCK` / `LLM_MOCK`	`0`	mock TTS / mock LLM for model-free runs

🧠 How memory works

Set MEMORY_ENABLED=1 and run.sh also starts a local Hindsight server (embedded Postgres + pgvector, local embeddings). Before each reply it recalls relevant memories; after each exchange it extracts and stores facts. Nothing about you leaves your machine except the small OpenAI extraction call. Tell it your name and a few things about yourself, restart, and it'll greet you knowing them.

🗂️ Project layout

backend/
  server.py            FastAPI: /ws, /health, /config, serves the frontend
  main.py              builds + runs one Pipecat pipeline per WS connection
  config.py            env parsing + the companion persona
  protocol.py          server→client message schemas (single source of truth)
  kokoro_synth.py      Kokoro synthesis + WAV helpers
  pipeline/            transport · noise · vad · asr_whisper · llm · memory
                       tts_kokoro · interrupt · mock_services · timing
frontend/
  index.html app.js avatar.js lipsync.js ui.js   # source
  package.json                                    # esbuild → app.bundle.js
scripts/  setup.sh · run.sh · stop.sh
tests/    test_contract.py · lipsync_check.mjs
docs/     banner.svg · screenshot.png

✅ Verify

. .venv/bin/activate
python -m py_compile backend/*.py backend/pipeline/*.py    # everything compiles
python tests/test_contract.py                              # mock synth + packet schema
node tests/lipsync_check.mjs                                # phoneme→viseme reduction

🛠️ Built with

Pipecat · TalkingHead · Kokoro TTS · Whisper (MLX / faster-whisper) · Silero VAD · three.js · Hindsight · OpenAI

📖 Design doc

The full architecture, every decision, the measured latency budget, and the debugging lessons live in ai_companion_build_plan.md — written so someone could rebuild this from scratch.

📄 License

MIT © Xavier Anguera. Built on excellent open-source projects (see Built with); their respective licenses apply to those components.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
docs		docs
frontend		frontend
plan		plan
scripts		scripts
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

✨ Features

🎬 How it works

📋 Requirements

🚀 Quick start

Try it with no models or API key

⚙️ Configuration

🧠 How memory works

🗂️ Project layout

✅ Verify

🛠️ Built with

📖 Design doc

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is this?

✨ Features

🎬 How it works

📋 Requirements

🚀 Quick start

Try it with no models or API key

⚙️ Configuration

🧠 How memory works

🗂️ Project layout

✅ Verify

🛠️ Built with

📖 Design doc

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages