copilot-multimodal

A GitHub Copilot CLI plugin that adds image and audio input to your Copilot conversations.

Phase 1 (Image): Drag image files into the CLI or paste screenshots from your clipboard
Phase 2 (Audio): Record audio with /record and have it transcribed and answered

Requirements

GitHub Copilot CLI installed
Node.js 20+
For audio (Phase 2): sox on macOS/Windows, arecord on Linux
For transcription (Phase 2): OpenAI API key

Installation

1. Build the MCP server

cd mcp-server
npm install
npm run build
cd ..

2. Install the plugin

copilot plugin install ./plugin

3. Verify installation

copilot plugin list

You should see copilot-multimodal in the list.

4. (Phase 2 only) Set up transcription

export OPENAI_API_KEY=your-openai-api-key
# Add to ~/.bashrc or ~/.zshrc to make permanent

Usage

Image Input — Drag & Drop

Start a Copilot CLI session
Drag an image file into the terminal window — the file path appears in the input
Type your question after the path and press Enter

/Users/me/screenshots/error.png What is causing this error?

The multimodal agent automatically detects the image path and includes it in the analysis.

Image Input — Screenshot Paste (Ctrl+V)

Use the clipboard watcher to make Ctrl+V work just like drag-and-drop:

1. Start the clipboard watcher (once per session)

Open a separate PowerShell window and run:

.\clipboard-watcher\start.ps1

It prints a ready message and stays running in the background.

2. Take a screenshot and paste it

Press Win+Shift+S and select a region — screenshot goes to clipboard
Switch back to Copilot CLI
Press Ctrl+V — pastes [📷 copilot-image-xxxx.png] into the input
Type your question and press Enter

[📷 copilot-image-cf6510.png] what does this UI error mean?

The agent reads the image token exactly like drag-and-drop.

macOS / Linux: Cmd+Shift+Ctrl+4 (macOS) or your distro's screenshot tool, then Ctrl+V. The watcher script works on any platform that supports PowerShell 5.1+.

Without the watcher

If you don't run the watcher, you can still ask about a screenshot by describing it visually:

Here's the screenshot I just took — what does this error mean?

The agent will call read_clipboard_image when it detects visual-intent language.

Audio Input — Recording (Phase 2)

Start a recording session:

/record

Copilot responds: 🎙️ Recording... type /stop when done

Speak your question, then type:

/stop

Copilot transcribes the audio and answers your spoken question.

To cancel a recording:

/cancel

Power Users: Create shell aliases for faster invocations:

# .bashrc / .zshrc
alias rec='/record'
alias srec='/stop'
alias crec='/cancel'

Note: Due to GitHub Copilot CLI Plugin API limitations, audio recording is triggered via slash commands rather than a keyboard shortcut. See ARCHITECTURE.md for details.

Choosing an Agent

The plugin provides one agent:

Agent	Description
`multimodal`	Full image + audio support — handles drag-and-drop, clipboard paste, and voice recording

Switch agents in your Copilot CLI session:

/agent multimodal

Troubleshooting

"No image found in clipboard"

Make sure you've copied an image (not just text). Use your OS screenshot tool to capture to clipboard, not to a file.

"Image too large — exceeds 5MB"

Resize or crop the image before pasting. Most screenshots are well under 5MB; this typically happens with RAW photos or very large exports.

"sox not found" / "arecord not found"

Install the audio recording tool for your platform:

macOS: brew install sox
Windows: choco install sox (or download from https://sox.sourceforge.net)
Linux: sudo apt install sox or use built-in arecord

"Transcription requires OPENAI_API_KEY"

Set your OpenAI API key:

export OPENAI_API_KEY=sk-...

Plugin not loading after changes

Re-install to pick up changes:

npm run build -C mcp-server
copilot plugin install ./plugin

Development

Running tests

cd mcp-server
npm test

Tests run on Windows, macOS, and Linux via GitHub Actions CI.

Project structure

See ARCHITECTURE.md for full architecture documentation and data flow diagrams.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.copilot		.copilot
.github		.github
.squad-templates		.squad-templates
.squad		.squad
clipboard-watcher		clipboard-watcher
mcp-server		mcp-server
plugin		plugin
.gitattributes		.gitattributes
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
PUBLISHING.md		PUBLISHING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

copilot-multimodal

Requirements

Installation

1. Build the MCP server

2. Install the plugin

3. Verify installation

4. (Phase 2 only) Set up transcription

Usage

Image Input — Drag & Drop

Image Input — Screenshot Paste (Ctrl+V)

1. Start the clipboard watcher (once per session)

2. Take a screenshot and paste it

Without the watcher

Audio Input — Recording (Phase 2)

Choosing an Agent

Troubleshooting

"No image found in clipboard"

"Image too large — exceeds 5MB"

"sox not found" / "arecord not found"

"Transcription requires OPENAI_API_KEY"

Plugin not loading after changes

Development

Running tests

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

dotnetspark/copilot-multimodal

Folders and files

Latest commit

History

Repository files navigation

copilot-multimodal

Requirements

Installation

1. Build the MCP server

2. Install the plugin

3. Verify installation

4. (Phase 2 only) Set up transcription

Usage

Image Input — Drag & Drop

Image Input — Screenshot Paste (Ctrl+V)

1. Start the clipboard watcher (once per session)

2. Take a screenshot and paste it

Without the watcher

Audio Input — Recording (Phase 2)

Choosing an Agent

Troubleshooting

"No image found in clipboard"

"Image too large — exceeds 5MB"

"sox not found" / "arecord not found"

"Transcription requires OPENAI_API_KEY"

Plugin not loading after changes

Development

Running tests

Project structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages