An advanced AI-powered desktop assistant that understands voice commands, analyzes your screen, and performs intelligent automation
CoBrain is a sophisticated desktop agent that combines voice recognition, computer vision, and AI automation to create an intelligent assistant that truly understands your digital environment. It's like having a personal AI that can see your screen, understand your voice commands, and take actions on your behalf.
- 🎤 Voice-Activated: Wake word detection with natural speech processing
- 👁️ Screen Understanding: AI vision that can analyze what's on your screen
- 🤖 Smart Automation: Performs complex macOS automation tasks
- 💬 Contextual Responses: Answers questions using both knowledge and screen context
- 🔍 Visual Search: Analyzes highlighted text, errors, and screen content
- ⚡ Intent Detection: Automatically distinguishes between questions and actions
- 🪟 Floating UI: Beautiful, transparent, always-on-top interface
- Wake Word Detection: Just say "CoBrain" to activate
- Real-time Transcription: Powered by Deepgram's Nova-3 model
- Conversational Filtering: Ignores casual conversations automatically
- Multi-language Support: Understands natural speech patterns
- Screenshot Analysis: AI can see and understand your current screen
- Highlighted Text Recognition: Explain selected content instantly
- Error Detection: Automatically opens relevant help for coding errors
- Visual Context: Combines screen content with your questions
- macOS Integration: Uses MCP (Model Context Protocol) for system control
- Application Control: Open, close, and manage applications
- File Operations: Git operations, file management, project navigation
- Cursor IDE Integration: Special error handling and AI chat activation
- GPT-4 Vision: Multi-modal AI that processes text and images
- Web Search Integration: Access to real-time information
- Context Awareness: Remembers conversation history
- Intent Classification: Smart routing between questions and actions
- Transparent Widget: Elegant floating interface
- Status Indicators: Visual feedback for all operations
- Dynamic Expansion: UI adapts based on content
- Click-through Mode: Non-intrusive when not in use
- Drag & Drop: Repositionable interface
- Node.js v16+
- Python v3.8+
- macOS (required for automation features)
- Docker (optional, for Qdrant vector search)
-
Clone and Install
git clone <repository-url> cd desktop-agent npm install
-
Setup Python Environment
# macOS/Linux ./setup-python.sh # Windows setup-python.bat
-
Install Additional Dependencies
# Install OpenAI Agents framework npm install @openai/agents @openai/agents-openai # Install TypeScript runtime npm install --save-dev @types/node tsx
-
Configure Environment Create a
.envfile with your API keys:# Required API Keys OPENAI_API_KEY=your_openai_api_key_here DEEPGRAM_API_KEY=your_deepgram_api_key_here # Optional Configuration WAKE_WORD_MODEL=alexa_v0.1.onnx SPEECH_COMPLETION_DELAY=2000 SCREENSHOT_CAPTURE_ENABLED=true QDRANT_URL=http://localhost:6333
-
Optional: Setup Qdrant (for browsing history)
docker run -d -p 6333:6333 -p 6334:6334 --name qdrant qdrant/qdrant
- OpenAI API: platform.openai.com/api-keys
- Deepgram API: console.deepgram.com
npm start- Activation: Click "Start" or the app auto-starts
- Wake Word: Say "CoBrain" to activate listening
- Command: Speak your question or action request
- Response: Get intelligent responses or automated actions
"What is this?" (analyzes current screen)
"Explain this error" (opens Cursor AI if in IDE)
"What's the weather today?"
"Who is the president of America?"
"Open browser"
"Clone this repo" (gets URL from browser)
"Pull latest repo and open it" (compound commands)
"Close this window"
"Take a screenshot"
"Tell him I'll reply later"
"Let them know I'm busy"
"I'll talk to you soon"
- Highlight text on any webpage and ask "What does this mean?"
- Error debugging: Ask about errors while in Cursor IDE
- Visual questions: "What's on my screen?" "Describe this interface"
- "Latest repo" =
~/Desktop/demo/desktop-agent - "Open it" = Opens in Cursor IDE
- Multi-step commands: Executes each step sequentially
- 🖱️ Button: Toggle click-through mode manually
- Drag anywhere: Reposition the floating widget
- Auto-expansion: UI grows/shrinks based on content
- Smart hiding: Becomes transparent when not needed
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ main.js │ │ chatgpt- │ │ agent.ts │
│ (orchestrator) │◄─│ handler.js │◄─│ (automation) │
│ │ │ (AI brain) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ transcription- │ │ screenpipe- │ │ wakeword- │
│ handler.js │ │ handler.js │ │ handler.js │
│ (Deepgram) │ │ (Qdrant + OCR) │ │ (OpenWakeWord) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- Audio Input → Wake Word Detection
- Activation → Speech Transcription
- Intent Detection → Question vs Action routing
- Screen Capture → Visual context analysis
- AI Processing → GPT-4 with vision/tools
- Response/Action → UI display or system automation
desktop-agent/
├── 🎛️ Core Engine
│ ├── main.js # Main orchestrator
│ ├── renderer.js # UI controller
│ └── index.html # Interface
├── 🤖 AI Components
│ ├── chatgpt-handler.js # OpenAI integration
│ ├── agent.ts # Automation agent
│ └── screenpipe-handler.js # Visual context
├── 🎤 Audio Processing
│ ├── wakeword-handler.js # Wake word detection
│ ├── transcription-handler.js # Speech-to-text
│ └── wakeword_detector.py # Python wake word
├── 🔧 Configuration
│ ├── package.json # Node dependencies
│ ├── requirements.txt # Python packages
│ └── .env # API keys & settings
└── 📁 Data
├── screenshots/ # Screen captures
├── temp/ # Temporary files
└── *.onnx # Wake word models
# 🔑 API Keys (Required)
OPENAI_API_KEY=sk-... # OpenAI API access
DEEPGRAM_API_KEY=... # Speech transcription
# 🎤 Audio Settings
WAKE_WORD_MODEL=alexa_v0.1.onnx # Wake word model
SPEECH_COMPLETION_DELAY=2000 # Delay before processing (ms)
# 📸 Screenshot Settings
SCREENSHOT_CAPTURE_ENABLED=true # Enable screen analysis
SCREENSHOT_CAPTURE_INTERVAL=5 # Periodic capture interval
# 🔍 Search Integration
QDRANT_URL=http://localhost:6333 # Vector database URL
# 🎨 UI Settings
TRANSPARENCY_LEVEL=0.9 # Window transparency
ALWAYS_ON_TOP=true # Keep widget visibleReplace the .onnx file and update WAKE_WORD_MODEL:
# Available models: alexa, hey_jarvis, hey_siri, etc.
WAKE_WORD_MODEL=co_brain.onnxEdit prompts in chatgpt-handler.js and agent.ts:
// Make responses more/less verbose
systemPrompt: "You are a concise AI assistant..."Customize CSS in index.html:
.widget-container {
background: rgba(10, 10, 15, 0.95);
backdrop-filter: blur(10px);
}🎤 Audio/Microphone Issues
- Check permissions: macOS → System Preferences → Security & Privacy → Microphone
- Test audio:
npm run test-audio - Restart audio: Stop/start the agent
- Check devices: Ensure correct microphone is selected
🐍 Python Environment Issues
# Test Python setup
npm run test-venv
# Recreate virtual environment
rm -rf venv
./setup-python.sh
# Manual troubleshooting
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
python -c "import openwakeword, pyaudio; print('OK')"🔑 API Key Issues
- Verify keys: Check
.envfile format - Test connectivity: App console shows connection status
- Check quotas: Ensure sufficient API credits
- Key format: OpenAI keys start with
sk-, Deepgram keys are UUID format
🖼️ Screenshot/Vision Issues
- Permissions: Grant screen recording permissions to Terminal/app
- Test vision: Ask "What do you see?" with content visible
- Debug logs: Check console for screenshot capture messages
- Model limits: Ensure images aren't too large for GPT-4 Vision
🤖 Automation Issues
# Test agent framework
npx tsx agent.ts "test command"
# Check MCP server
npm install @steipete/macos-automator-mcp
# Debug automation
# Check console for "AUTOMATION_ACTION:" messagesRun with detailed logging:
DEBUG=* npm start- Reduce screenshot frequency if system is slow
- Use smaller wake word models for faster response
- Disable Qdrant if not using browsing history features
- Adjust
SPEECH_COMPLETION_DELAYfor your speaking pace
# Test individual components
npm run test-venv # Python environment
npm run test-audio # Audio capture
npm run test-wakeword # Wake word detection
npm run test-screenpipe # Visual analysis
# Development mode with DevTools
npm run dev- Voice Commands: Add patterns to
detectIntent()inmain.js - Automation: Extend prompts in
agent.ts - UI Components: Modify
renderer.jsandindex.html - AI Capabilities: Enhance
chatgpt-handler.js
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
- Multi-language support for wake words and transcription
- Custom automation workflows with visual editor
- Plugin system for third-party integrations
- Voice training for improved wake word accuracy
- Batch operations for complex multi-step tasks
- Desktop notification integration
- Cross-platform support (Windows, Linux)
- ✅ Smart conversational filtering - Ignores casual speech
- ✅ Enhanced screen analysis - Better visual understanding
- ✅ Improved error handling - Cursor IDE integration
- ✅ Click-through interface - Non-intrusive UI mode
- ✅ Multi-step automation - Complex command sequences
- ✅ Intent classification - Smart question vs action routing
MIT License - see LICENSE file for details.
- OpenAI - For GPT-4 and API access
- Deepgram - For speech transcription technology
- Electron - For cross-platform desktop framework
- OpenWakeWord - For wake word detection
- Qdrant - For vector search capabilities
Built with ❤️ for the future of human-computer interaction