🧠 CoBrain - Intelligent Desktop Agent

An advanced AI-powered desktop assistant that understands voice commands, analyzes your screen, and performs intelligent automation

🌟 What is CoBrain?

CoBrain is a sophisticated desktop agent that combines voice recognition, computer vision, and AI automation to create an intelligent assistant that truly understands your digital environment. It's like having a personal AI that can see your screen, understand your voice commands, and take actions on your behalf.

🎯 Key Capabilities

🎤 Voice-Activated: Wake word detection with natural speech processing
👁️ Screen Understanding: AI vision that can analyze what's on your screen
🤖 Smart Automation: Performs complex macOS automation tasks
💬 Contextual Responses: Answers questions using both knowledge and screen context
🔍 Visual Search: Analyzes highlighted text, errors, and screen content
⚡ Intent Detection: Automatically distinguishes between questions and actions
🪟 Floating UI: Beautiful, transparent, always-on-top interface

🚀 Features

🎙️ Advanced Voice Processing

Wake Word Detection: Just say "CoBrain" to activate
Real-time Transcription: Powered by Deepgram's Nova-3 model
Conversational Filtering: Ignores casual conversations automatically
Multi-language Support: Understands natural speech patterns

🖼️ Intelligent Screen Analysis

Screenshot Analysis: AI can see and understand your current screen
Highlighted Text Recognition: Explain selected content instantly
Error Detection: Automatically opens relevant help for coding errors
Visual Context: Combines screen content with your questions

🔄 Smart Automation

macOS Integration: Uses MCP (Model Context Protocol) for system control
Application Control: Open, close, and manage applications
File Operations: Git operations, file management, project navigation
Cursor IDE Integration: Special error handling and AI chat activation

🧠 AI-Powered Intelligence

GPT-4 Vision: Multi-modal AI that processes text and images
Web Search Integration: Access to real-time information
Context Awareness: Remembers conversation history
Intent Classification: Smart routing between questions and actions

🎨 Modern Interface

Transparent Widget: Elegant floating interface
Status Indicators: Visual feedback for all operations
Dynamic Expansion: UI adapts based on content
Click-through Mode: Non-intrusive when not in use
Drag & Drop: Repositionable interface

🛠️ Installation

Prerequisites

Node.js v16+
Python v3.8+
macOS (required for automation features)
Docker (optional, for Qdrant vector search)

Quick Setup

Clone and Install

git clone <repository-url>
cd desktop-agent
npm install

Setup Python Environment

# macOS/Linux
./setup-python.sh

# Windows
setup-python.bat

Install Additional Dependencies

# Install OpenAI Agents framework
npm install @openai/agents @openai/agents-openai

# Install TypeScript runtime
npm install --save-dev @types/node tsx

Configure Environment Create a .env file with your API keys:

# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
DEEPGRAM_API_KEY=your_deepgram_api_key_here

# Optional Configuration
WAKE_WORD_MODEL=alexa_v0.1.onnx
SPEECH_COMPLETION_DELAY=2000
SCREENSHOT_CAPTURE_ENABLED=true
QDRANT_URL=http://localhost:6333

Optional: Setup Qdrant (for browsing history)

docker run -d -p 6333:6333 -p 6334:6334 --name qdrant qdrant/qdrant

Get API Keys

OpenAI API: platform.openai.com/api-keys
Deepgram API: console.deepgram.com

🎮 Usage

Starting CoBrain

npm start

Basic Workflow

Activation: Click "Start" or the app auto-starts
Wake Word: Say "CoBrain" to activate listening
Command: Speak your question or action request
Response: Get intelligent responses or automated actions

Voice Commands

📖 Questions (Displayed in UI)

"What is this?" (analyzes current screen)
"Explain this error" (opens Cursor AI if in IDE)
"What's the weather today?"
"Who is the president of America?"

⚡ Actions (Executes automation)

"Open browser"
"Clone this repo" (gets URL from browser)
"Pull latest repo and open it" (compound commands)
"Close this window"
"Take a screenshot"

💬 Conversational (Ignored automatically)

"Tell him I'll reply later"
"Let them know I'm busy"
"I'll talk to you soon"

Advanced Features

Screen Analysis

Highlight text on any webpage and ask "What does this mean?"
Error debugging: Ask about errors while in Cursor IDE
Visual questions: "What's on my screen?" "Describe this interface"

Automation Shortcuts

"Latest repo" = ~/Desktop/demo/desktop-agent
"Open it" = Opens in Cursor IDE
Multi-step commands: Executes each step sequentially

UI Controls

🖱️ Button: Toggle click-through mode manually
Drag anywhere: Reposition the floating widget
Auto-expansion: UI grows/shrinks based on content
Smart hiding: Becomes transparent when not needed

🏗️ Architecture

Core Components

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   main.js       │  │  chatgpt-       │  │   agent.ts      │
│  (orchestrator) │◄─│  handler.js     │◄─│ (automation)    │
│                 │  │  (AI brain)     │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         ▲                      ▲                      ▲
         │                      │                      │
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ transcription-  │  │  screenpipe-    │  │   wakeword-     │
│ handler.js      │  │  handler.js     │  │   handler.js    │
│ (Deepgram)      │  │ (Qdrant + OCR)  │  │ (OpenWakeWord)  │
└─────────────────┘  └─────────────────┘  └─────────────────┘

Data Flow

Audio Input → Wake Word Detection
Activation → Speech Transcription
Intent Detection → Question vs Action routing
Screen Capture → Visual context analysis
AI Processing → GPT-4 with vision/tools
Response/Action → UI display or system automation

File Structure

desktop-agent/
├── 🎛️ Core Engine
│   ├── main.js                 # Main orchestrator
│   ├── renderer.js             # UI controller  
│   └── index.html              # Interface
├── 🤖 AI Components
│   ├── chatgpt-handler.js      # OpenAI integration
│   ├── agent.ts                # Automation agent
│   └── screenpipe-handler.js   # Visual context
├── 🎤 Audio Processing
│   ├── wakeword-handler.js     # Wake word detection
│   ├── transcription-handler.js # Speech-to-text
│   └── wakeword_detector.py    # Python wake word
├── 🔧 Configuration
│   ├── package.json            # Node dependencies
│   ├── requirements.txt        # Python packages
│   └── .env                    # API keys & settings
└── 📁 Data
    ├── screenshots/            # Screen captures
    ├── temp/                   # Temporary files
    └── *.onnx                  # Wake word models

⚙️ Configuration

Environment Variables

# 🔑 API Keys (Required)
OPENAI_API_KEY=sk-...                    # OpenAI API access
DEEPGRAM_API_KEY=...                     # Speech transcription

# 🎤 Audio Settings
WAKE_WORD_MODEL=alexa_v0.1.onnx         # Wake word model
SPEECH_COMPLETION_DELAY=2000             # Delay before processing (ms)

# 📸 Screenshot Settings  
SCREENSHOT_CAPTURE_ENABLED=true          # Enable screen analysis
SCREENSHOT_CAPTURE_INTERVAL=5            # Periodic capture interval

# 🔍 Search Integration
QDRANT_URL=http://localhost:6333         # Vector database URL

# 🎨 UI Settings
TRANSPARENCY_LEVEL=0.9                   # Window transparency
ALWAYS_ON_TOP=true                       # Keep widget visible

Customization Options

Change Wake Word

Replace the .onnx file and update WAKE_WORD_MODEL:

# Available models: alexa, hey_jarvis, hey_siri, etc.
WAKE_WORD_MODEL=co_brain.onnx

Modify AI Behavior

Edit prompts in chatgpt-handler.js and agent.ts:

// Make responses more/less verbose
systemPrompt: "You are a concise AI assistant..."

UI Theming

Customize CSS in index.html:

.widget-container {
    background: rgba(10, 10, 15, 0.95);
    backdrop-filter: blur(10px);
}

🐛 Troubleshooting

Common Issues

🎤 Audio/Microphone Issues

Check permissions: macOS → System Preferences → Security & Privacy → Microphone
Test audio: npm run test-audio
Restart audio: Stop/start the agent
Check devices: Ensure correct microphone is selected

🐍 Python Environment Issues

# Test Python setup
npm run test-venv

# Recreate virtual environment  
rm -rf venv
./setup-python.sh

# Manual troubleshooting
source venv/bin/activate  # macOS/Linux
venv\Scripts\activate     # Windows
python -c "import openwakeword, pyaudio; print('OK')"

🔑 API Key Issues

Verify keys: Check .env file format
Test connectivity: App console shows connection status
Check quotas: Ensure sufficient API credits
Key format: OpenAI keys start with sk-, Deepgram keys are UUID format

🖼️ Screenshot/Vision Issues

Permissions: Grant screen recording permissions to Terminal/app
Test vision: Ask "What do you see?" with content visible
Debug logs: Check console for screenshot capture messages
Model limits: Ensure images aren't too large for GPT-4 Vision

🤖 Automation Issues

# Test agent framework
npx tsx agent.ts "test command"

# Check MCP server
npm install @steipete/macos-automator-mcp

# Debug automation
# Check console for "AUTOMATION_ACTION:" messages

Debug Mode

Run with detailed logging:

DEBUG=* npm start

Performance Tips

Reduce screenshot frequency if system is slow
Use smaller wake word models for faster response
Disable Qdrant if not using browsing history features
Adjust SPEECH_COMPLETION_DELAY for your speaking pace

🚧 Development

Running Tests

# Test individual components
npm run test-venv          # Python environment
npm run test-audio         # Audio capture
npm run test-wakeword      # Wake word detection
npm run test-screenpipe    # Visual analysis

# Development mode with DevTools
npm run dev

Adding New Features

Voice Commands: Add patterns to detectIntent() in main.js
Automation: Extend prompts in agent.ts
UI Components: Modify renderer.js and index.html
AI Capabilities: Enhance chatgpt-handler.js

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

📋 Roadmap

🎯 Planned Features

Multi-language support for wake words and transcription
Custom automation workflows with visual editor
Plugin system for third-party integrations
Voice training for improved wake word accuracy
Batch operations for complex multi-step tasks
Desktop notification integration
Cross-platform support (Windows, Linux)

🔄 Recent Updates

✅ Smart conversational filtering - Ignores casual speech
✅ Enhanced screen analysis - Better visual understanding
✅ Improved error handling - Cursor IDE integration
✅ Click-through interface - Non-intrusive UI mode
✅ Multi-step automation - Complex command sequences
✅ Intent classification - Smart question vs action routing

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

OpenAI - For GPT-4 and API access
Deepgram - For speech transcription technology
Electron - For cross-platform desktop framework
OpenWakeWord - For wake word detection
Qdrant - For vector search capabilities

Built with ❤️ for the future of human-computer interaction

Report Bug • Request Feature • Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
temp		temp
.env.example		.env.example
.gitignore		.gitignore
API_SETUP.md		API_SETUP.md
README.md		README.md
agent-executor.js		agent-executor.js
agent.ts		agent.ts
alexa_v0.1.onnx		alexa_v0.1.onnx
audio-capture.js		audio-capture.js
chatgpt-handler.js		chatgpt-handler.js
co_brain.onnx		co_brain.onnx
eng.traineddata		eng.traineddata
index.html		index.html
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
renderer.js		renderer.js
requirements.txt		requirements.txt
screenpipe-handler.js		screenpipe-handler.js
setup-python.bat		setup-python.bat
setup-python.sh		setup-python.sh
setup-qdrant.bat		setup-qdrant.bat
setup-qdrant.sh		setup-qdrant.sh
setup.js		setup.js
test-audio-device.js		test-audio-device.js
test-audio-transmission.js		test-audio-transmission.js
test-deepgram.js		test-deepgram.js
test-screenpipe.js		test-screenpipe.js
test-transcription.js		test-transcription.js
test-venv.js		test-venv.js
test-wakeword-handler.js		test-wakeword-handler.js
transcription-handler.js		transcription-handler.js
wakeword-handler.js		wakeword-handler.js
wakeword_detector.py		wakeword_detector.py

GabrielRips/desktop-agent

Folders and files

Latest commit

History

Repository files navigation