A GitHub Copilot CLI plugin that adds image and audio input to your Copilot conversations.
- Phase 1 (Image): Drag image files into the CLI or paste screenshots from your clipboard
- Phase 2 (Audio): Record audio with
/recordand have it transcribed and answered
- GitHub Copilot CLI installed
- Node.js 20+
- For audio (Phase 2):
soxon macOS/Windows,arecordon Linux - For transcription (Phase 2): OpenAI API key
cd mcp-server
npm install
npm run build
cd ..copilot plugin install ./plugincopilot plugin listYou should see copilot-multimodal in the list.
export OPENAI_API_KEY=your-openai-api-key
# Add to ~/.bashrc or ~/.zshrc to make permanent- Start a Copilot CLI session
- Drag an image file into the terminal window — the file path appears in the input
- Type your question after the path and press Enter
/Users/me/screenshots/error.png What is causing this error?
The multimodal agent automatically detects the image path and includes it in the analysis.
Use the clipboard watcher to make Ctrl+V work just like drag-and-drop:
Open a separate PowerShell window and run:
.\clipboard-watcher\start.ps1It prints a ready message and stays running in the background.
- Press
Win+Shift+Sand select a region — screenshot goes to clipboard - Switch back to Copilot CLI
- Press
Ctrl+V— pastes[📷 copilot-image-xxxx.png]into the input - Type your question and press
Enter
[📷 copilot-image-cf6510.png] what does this UI error mean?
The agent reads the image token exactly like drag-and-drop.
macOS / Linux:
Cmd+Shift+Ctrl+4(macOS) or your distro's screenshot tool, thenCtrl+V. The watcher script works on any platform that supports PowerShell 5.1+.
If you don't run the watcher, you can still ask about a screenshot by describing it visually:
Here's the screenshot I just took — what does this error mean?
The agent will call read_clipboard_image when it detects visual-intent language.
Start a recording session:
/record
Copilot responds: 🎙️ Recording... type /stop when done
Speak your question, then type:
/stop
Copilot transcribes the audio and answers your spoken question.
To cancel a recording:
/cancel
Power Users: Create shell aliases for faster invocations:
# .bashrc / .zshrc
alias rec='/record'
alias srec='/stop'
alias crec='/cancel'Note: Due to GitHub Copilot CLI Plugin API limitations, audio recording is triggered via slash commands rather than a keyboard shortcut. See ARCHITECTURE.md for details.
The plugin provides one agent:
| Agent | Description |
|---|---|
multimodal |
Full image + audio support — handles drag-and-drop, clipboard paste, and voice recording |
Switch agents in your Copilot CLI session:
/agent multimodal
Make sure you've copied an image (not just text). Use your OS screenshot tool to capture to clipboard, not to a file.
Resize or crop the image before pasting. Most screenshots are well under 5MB; this typically happens with RAW photos or very large exports.
Install the audio recording tool for your platform:
- macOS:
brew install sox - Windows:
choco install sox(or download from https://sox.sourceforge.net) - Linux:
sudo apt install soxor use built-inarecord
Set your OpenAI API key:
export OPENAI_API_KEY=sk-...Re-install to pick up changes:
npm run build -C mcp-server
copilot plugin install ./plugincd mcp-server
npm testTests run on Windows, macOS, and Linux via GitHub Actions CI.
See ARCHITECTURE.md for full architecture documentation and data flow diagrams.