An agentic VLM (Vision Language Model) framework that gives a language model access to a stateful Jupyter notebook running inside a Docker container. The agent can iteratively write and execute Python code to process images, run computations, and produce visualizations — all within a sandboxed environment.
SWE-Vision/
├── swe_vision/ # Core library
│ ├── __init__.py # Package exports
│ ├── config.py # Constants, logging, tool definitions, system prompt
│ ├── kernel.py # JupyterNotebookKernel — Docker-based Jupyter runtime
│ ├── image_utils.py # Image encoding, MIME detection, OpenAI content parts
│ ├── file_manager.py # NotebookFileManager — host ↔ container file sharing
│ ├── trajectory.py # TrajectoryRecorder — saves full agent traces to disk
│ ├── agent.py # VLMToolCallAgent — agentic loop with tool calling
│ ├── cli.py # CLI entry point
│ └── eval_utils.py # LLM judge prompt, answer extraction utilities
│
├── apps/ # Standalone applications
│ ├── web_app.py # ChatGPT-style web UI (Flask + SSE streaming)
│ └── trajectory_viewer.py # Trajectory visualization dashboard (Flask)
│
├── env/ # Docker environment (Dockerfile for the kernel)
├── requirements.txt
└── README.md
pip install -r requirements.txtexport OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # custom API endpoint
export OPENAI_MODEL="openai/gpt-5.2" # default modelThe agent runs code inside a Docker container. Make sure Docker is installed and running, then place a Dockerfile in the env/ directory. A minimal example:
docker build -t swe-vision -f ./env/Dockerfile ./envWe provide a script to run the agent with a single command.
bash run.shYou can also run the agent manually.
# Single query with an image
python -m swe_vision.cli --image photo.png "What objects are in this image?"
# Multiple images
python -m swe_vision.cli -i img1.png -i img2.png "What is the difference between these two images?"A ChatGPT-style interface with real-time streaming of the agent's reasoning, code execution, and results:
python apps/web_app.py --port 8080
# Open http://localhost:8080Every agent run saves a trajectory (JSON + images) to ./trajectories/. Browse them with the viewer:
python apps/trajectory_viewer.py --port 5050
# Open http://localhost:5050 User Query (+ images)
│
▼
┌──────────────────────┐
│ LLM (e.g. GPT-5.2) │◄───────────────────────┐
│ │ │
│ Tool Calls: │ │
│ ┌────────────────┐ │ ┌──────────────┐ │
│ │ execute_code │─┼────►│Jupyter Kernel│ │
│ └────────────────┘ │ │ (Docker) │ │
│ ┌────────────────┐ │ └──────┬───────┘ │
│ │ finish │─┼──► Answer │ (Output) │
│ └────────────────┘ │ │ │
└──────────────────────┘ text + images ──────┘
Key components:
| Module | Responsibility |
|---|---|
config.py |
All constants, tool schemas, system prompt |
kernel.py |
Builds Docker image, starts container, manages Jupyter kernel via ZMQ |
agent.py |
Orchestrates the agentic loop: LLM calls → tool dispatch → result collection |
trajectory.py |
Records every step with timestamps, code, images; saves to JSON |
image_utils.py |
Base64 encoding, compression, OpenAI content part builders |
file_manager.py |
Copies files into the Docker mount so the kernel can access them |
usage: python -m swe_vision.cli [-h] [--image IMAGE] [--interactive]
[--model MODEL] [--api-key API_KEY]
[--base-url BASE_URL]
[--max-iterations MAX_ITERATIONS]
[--save-trajectory SAVE_TRAJECTORY]
[--verbose] [--quiet]
[--reasoning | --no-reasoning]
[query]
| Flag | Description |
|---|---|
--image, -i |
Image file path (repeatable) |
--interactive |
Multi-turn interactive mode |
--model, -m |
Model name (default: gpt-4o or $OPENAI_MODEL) |
--reasoning / --no-reasoning |
Enable/disable extended reasoning |
--save-trajectory |
Custom trajectory output directory |
--quiet, -q |
Minimal console output |
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
API key for the LLM provider | (required) |
OPENAI_BASE_URL |
Custom API base URL | OpenAI default |
OPENAI_MODEL |
Default model name | gpt-4o |
VLM_DOCKER_IMAGE |
Docker image name for the kernel | swe-vision:latest |
VLM_DOCKERFILE_DIR |
Path to the Dockerfile directory | ./env/ |
VLM_HOST_WORK_DIR |
Host-side working directory for file sharing | ~/tmp/vlm_docker_workdir |
VLM_WEB_SESSION_DIR |
Session storage for the web app | /tmp |
import asyncio
from swe_vision import VLMToolCallAgent
async def main():
agent = VLMToolCallAgent(
model="openai/gpt-5.2",
api_key="sk-...",
reasoning=True,
)
try:
answer = await agent.run(
"Analyze this chart and summarize the trends",
image_paths=["chart.png"],
)
print(answer)
finally:
await agent.cleanup()
asyncio.run(main())MIT

