Skip to content

xanguera/aicompanion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Companion

status python 3.12 pipecat 1.4.0 macOS Apple Silicon local-first license MIT

Talk to a 3D avatar that listens, replies in a real voice, lip-syncs, and remembers you — all running on your Mac.
Only the LLM call ever leaves your machine.

AI Companion in action — the avatar greeting the user with live captions


What is this?

AI Companion is a locally-running, voice-driven companion. You speak; a friendly female upper-torso avatar in your browser listens, thinks, and talks back in a natural voice with real-time lip-sync. Over time it gets to know you — it asks questions, remembers your answers and the name you give it, and draws connections across conversations.

Everything runs on the host Mac — microphone noise suppression, voice-activity detection, speech-to-text, text-to-speech, the 3D avatar, and the memory store. The only thing that leaves your machine is the LLM call (and the small fact-extraction call for memory).

Pipecat is the brain; TalkingHead is the mouth. Pipecat owns mic → noise → VAD → STT → memory → LLM → sentence-chunking → Kokoro synthesis and barge-in detection. It does not play the TTS audio — each synthesized sentence (audio + word/phoneme timestamps) is shipped to the browser, which plays it through TalkingHead.speakAudio() so visemes ride the audio clock. You can interrupt it at any time (barge-in).

✨ Features

  • 🎙️ Real voice conversation — local Whisper STT, streaming LLM, local Kokoro TTS.
  • 🧑‍🦰 3D talking avatar with phoneme-accurate lip-sync (TalkingHead + three.js).
  • 🧠 Persistent memory — remembers facts about you and the name you give it, across sessions (Hindsight).
  • Low latency — ~1.5–2.5 s from "you stop" to "it speaks"; sentences pipeline as they stream.
  • Barge-in — talk over it and it stops and listens.
  • 🔒 Local-first & private — audio, models, and memory stay on your machine.
  • 🧪 Mock mode — run the whole loop with no models and no API key.

🎬 How it works

flowchart LR
    MIC["🎙️ Browser mic"]:::edge
    subgraph PC["Pipecat — local voice pipeline (your Mac)"]
        direction LR
        RNN["RNNoise<br/>noise suppression"] --> VAD["Silero VAD<br/>turns + barge-in"] --> ASR["Whisper<br/>speech to text"] --> MEM["Hindsight<br/>memory · optional"] --> LLM["LLM<br/>OpenAI · remote"]:::remote --> TTS["Kokoro TTS<br/>audio + timestamps"]
    end
    TH["TalkingHead<br/>3D avatar + lip-sync"]:::edge
    MIC --> RNN
    TTS --> TH
    TH -. "barge in anytime" .-> MIC
    classDef edge fill:#1b2a3a,stroke:#9ecbff,color:#e7eaee
    classDef remote fill:#3a2330,stroke:#d06a86,color:#ffffff
Loading

The avatar plays each sentence as soon as it's synthesized, while the next one is still being generated. Full design, rationale, and the hard-won gotchas are in ai_companion_build_plan.md.

📋 Requirements

  • macOS on Apple Silicon (for MLX/Metal Whisper; a CPU fallback works anywhere)
  • Python 3.10–3.12 (Kokoro/torch need <3.13); this repo pins 3.12.8 via pyenv
  • espeak-ng (Kokoro G2P): brew install espeak-ng
  • Node/npm (bundles the Pipecat JS client — setup.sh runs the build)
  • An OPENAI_API_KEY — the only required secret (powers the LLM and memory's fact extraction)

🚀 Quick start

scripts/setup.sh          # pyenv 3.12.8 + venv + deps + espeak-ng + frontend bundle
cp .env.example .env      # add your OPENAI_API_KEY (set MEMORY_ENABLED=1 to enable memory)
scripts/run.sh            # runs in this terminal (logs here); opens the browser
# stop with Ctrl-C

Open http://localhost:8080, click Start, and talk. The first run downloads the Whisper, Kokoro, and Silero weights into ./models (they persist). 🎧 Headphones recommended — they stop the avatar's voice from re-entering your mic.

Try it with no models or API key

Every stage has a mock, so you can verify the full transport + lip-sync loop with zero downloads:

. .venv/bin/activate
KOKORO_MOCK=1 ASR_ENGINE=mock LLM_MOCK=1 uvicorn backend.server:app --host 0.0.0.0 --port 8080

The avatar boots, connects, and lip-syncs a canned reply over a sine tone.

⚙️ Configuration

All via .env (see .env.example). Key knobs:

Var Default Meaning
OPENAI_API_KEY required (LLM + memory extraction)
OPENAI_MODEL gpt-4o-mini LLM model — keep it non-reasoning for low latency
ASR_ENGINE whisper-mlx whisper-mlx (Apple GPU) · whisper-cpu · mock
KOKORO_VOICE af_heart Kokoro voice id
WHISPER_NO_SPEECH_PROB 0.6 drop non-speech segments (anti-hallucination)
MEMORY_ENABLED 0 1 = persistent Hindsight memory (recall + retain)
MEMORY_USER_ID default per-user memory bank
KOKORO_MOCK / LLM_MOCK 0 mock TTS / mock LLM for model-free runs

🧠 How memory works

Set MEMORY_ENABLED=1 and run.sh also starts a local Hindsight server (embedded Postgres + pgvector, local embeddings). Before each reply it recalls relevant memories; after each exchange it extracts and stores facts. Nothing about you leaves your machine except the small OpenAI extraction call. Tell it your name and a few things about yourself, restart, and it'll greet you knowing them.

🗂️ Project layout

backend/
  server.py            FastAPI: /ws, /health, /config, serves the frontend
  main.py              builds + runs one Pipecat pipeline per WS connection
  config.py            env parsing + the companion persona
  protocol.py          server→client message schemas (single source of truth)
  kokoro_synth.py      Kokoro synthesis + WAV helpers
  pipeline/            transport · noise · vad · asr_whisper · llm · memory
                       tts_kokoro · interrupt · mock_services · timing
frontend/
  index.html app.js avatar.js lipsync.js ui.js   # source
  package.json                                    # esbuild → app.bundle.js
scripts/  setup.sh · run.sh · stop.sh
tests/    test_contract.py · lipsync_check.mjs
docs/     banner.svg · screenshot.png

✅ Verify

. .venv/bin/activate
python -m py_compile backend/*.py backend/pipeline/*.py    # everything compiles
python tests/test_contract.py                              # mock synth + packet schema
node tests/lipsync_check.mjs                                # phoneme→viseme reduction

🛠️ Built with

Pipecat · TalkingHead · Kokoro TTS · Whisper (MLX / faster-whisper) · Silero VAD · three.js · Hindsight · OpenAI

📖 Design doc

The full architecture, every decision, the measured latency budget, and the debugging lessons live in ai_companion_build_plan.md — written so someone could rebuild this from scratch.

📄 License

MIT © Xavier Anguera. Built on excellent open-source projects (see Built with); their respective licenses apply to those components.

About

Local-first, voice-driven AI companion — 3D avatar with lip-sync, streaming voice, and persistent memory. Runs on your Mac; only the LLM call is remote.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors