Access Google's Gemini models via Vertex AI for enterprise use
Install this plugin in the same environment as LLM.
Note: This package is not published on PyPI (the
llm-vertexname on PyPI is an unrelated project). Install directly from GitHub:
llm install "llm-vertex @ git+https://github.com/c0ffee0wl/llm-vertex.git"This plugin uses Google Cloud Vertex AI, which supports three authentication methods:
Fastest setup, but recommended for testing only:
# Set via llm keys command
llm keys set vertex
# Or via environment variable
export GOOGLE_CLOUD_API_KEY="YOUR_API_KEY"Get your API key from the Google Cloud Console.
Note: Vertex AI API keys are different from Google AI Studio keys. Make sure to create a Vertex AI-compatible API key in your GCP project. API keys are convenient for development and testing but not recommended for production. For production, use Application Default Credentials (Option 2).
If you're already using Google Cloud, authenticate with:
gcloud auth application-default loginThis sets up Application Default Credentials (ADC) that the plugin will automatically use.
- Create a service account in your GCP project with Vertex AI User permissions
- Download the JSON key file
- Set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Or configure it via the plugin:
llm vertex set-credentials /path/to/service-account.json# Via environment variable
export GOOGLE_CLOUD_PROJECT="your-project-id"
# Or via plugin config
llm vertex set-project your-project-idThe plugin defaults to the global endpoint. However, the global endpoint has important limitations:
⚠️ Does not support tuning, batch prediction, or RAG corpus creation⚠️ Does not guarantee region-specific ML processing⚠️ Does not provide data residency compliance
For production use or if you need specific features, use a regional endpoint:
# Via environment variable
export GOOGLE_CLOUD_REGION="us-central1"
# Or via plugin config
llm vertex set-region us-central1To see all available regions:
llm vertex list-regionsAvailable regions include:
- United States:
us-central1,us-east1,us-east4,us-east5,us-south1,us-west1,us-west4 - Canada:
northamerica-northeast1 - South America:
southamerica-east1 - Europe:
europe-west1,europe-west2,europe-west3,europe-west4,europe-west6,europe-west8,europe-west9,europe-north1,europe-southwest1,europe-central2 - Asia Pacific:
asia-east1,asia-northeast1,asia-northeast3,asia-southeast1,asia-south1,australia-southeast1,australia-southeast2 - Middle East:
me-central1,me-central2,me-west1
For the latest region availability and model-specific regional support, see the official Vertex AI locations documentation.
llm vertex configNow run the model using -m vertex/gemini-2.5-flash, for example:
llm -m vertex/gemini-2.5-flash "A short joke about a pelican and a walrus"You can set the default model to avoid the extra -m option:
llm models default vertex/gemini-2.5-flash
llm "A joke about a pelican and a walrus"All Gemini models are available through Vertex AI:
vertex/gemini-3-pro-preview: Gemini 3 Pro preview (global region only)vertex/gemini-3-pro-preview-11-2025: Gemini 3 Pro November 2025 version (global region only)vertex/gemini-3-pro-preview-11-2025-thinking: Gemini 3 Pro with thinking mode (global region only)vertex/gemini-3-flash-preview: Gemini 3 Flash preview (global region only)
Note: Gemini 3 models automatically use the global endpoint regardless of your configured region.
Thinking Levels: Gemini 3 models support configurable thinking levels:
- Flash (
gemini-3-flash-preview):minimal,low,medium,high - Pro (
gemini-3-pro-preview):low,high
llm -m vertex/gemini-3-flash-preview -o thinking_level high 'complex reasoning task'vertex/gemini-3.1-pro-preview: Gemini 3.1 Pro preview (global region only)vertex/gemini-3.1-pro-preview-customtools: Gemini 3.1 Pro with custom tools support (global region only)vertex/gemini-3.1-flash-lite-preview: Gemini 3.1 Flash Lite preview (global region only)
Thinking Levels: Gemini 3.1 models support configurable thinking levels:
- Pro (
gemini-3.1-pro-preview):low,medium,high - Flash Lite (
gemini-3.1-flash-lite-preview):minimal,low,medium,high
llm -m vertex/gemini-3.1-pro-preview -o thinking_level high 'complex reasoning task'Note: Gemini 3.1 models automatically use the global endpoint regardless of your configured region.
vertex/gemma-3-1b-it: Gemma 3 1B (text only, no vision)vertex/gemma-3-4b-it: Gemma 3 4Bvertex/gemma-3-12b-it: Gemma 3 12Bvertex/gemma-3-27b-it: Gemma 3 27Bvertex/gemma-3n-e4b-it: Gemma 3n E4B (text only, no vision)
Note: Gemma models do not support structured output schemas or media resolution settings. gemma-3-1b-it and gemma-3n-e4b-it do not support vision/image inputs.
vertex/gemini-2.5-flash-lite-preview-09-2025vertex/gemini-2.5-flash-preview-09-2025vertex/gemini-flash-lite-latest: Latest Gemini Flash Litevertex/gemini-flash-latest: Latest Gemini Flashvertex/gemini-2.5-flash-lite: Gemini 2.5 Flash Litevertex/gemini-2.5-pro: Gemini 2.5 Provertex/gemini-2.5-flash: Gemini 2.5 Flashvertex/gemini-2.0-flash: Gemini 2.0 Flashvertex/gemini-2.0-flash-thinking-exp-01-21: Experimental "thinking" modelvertex/gemini-1.5-flash-8b-latest: The least expensive modelvertex/gemini-1.5-pro-latestvertex/gemini-1.5-flash-latest
And many more. Use the vertex/ prefix to reference models:
llm -m vertex/gemini-1.5-flash-8b-latest --schema 'name,age int,bio' 'invent a dog'Note: This plugin provides Gemini models via Vertex AI (enterprise API). For the public Google AI Studio API, use the separate llm-gemini plugin.
Different Gemini models are available in different regions. Here's the detailed availability for the main models:
The gemini-2.5-flash (GA) model is available in the following regions:
| Region Code | Geographic Location | Notes |
|---|---|---|
| Global | Global endpoint | Limited features (no tuning, batch prediction, or RAG) |
| United States | ||
| us-central1 | Iowa, USA | |
| us-east1 | South Carolina, USA | |
| us-east4 | Northern Virginia, USA | |
| us-east5 | Columbus, Ohio, USA | |
| us-south1 | Dallas, Texas, USA | |
| us-west1 | Oregon, USA | |
| us-west4 | Las Vegas, Nevada, USA | |
| Europe | ||
| europe-central2 | Warsaw, Poland | |
| europe-north1 | Hamina, Finland | |
| europe-southwest1 | Madrid, Spain | |
| europe-west1 | St. Ghislain, Belgium | |
| europe-west4 | Eemshaven, Netherlands | |
| europe-west8 | Milan, Italy | |
| Canada | ||
| northamerica-northeast1 | Montréal, Canada | |
| Asia Pacific | ||
| asia-northeast1 | Tokyo, Japan | 128K context window only* |
| asia-northeast3 | Seoul, South Korea | 128K context window only* |
| asia-south1 | Mumbai, India | 128K context window only* |
| asia-southeast1 | Jurong West, Singapore | 128K context window only* |
| australia-southeast1 | Sydney, Australia | 128K context window only* |
*Regions marked with asterisk have limitations: 128K context window only, supervised fine-tuning not supported.
The preview model (gemini-2.5-flash-preview-09-2025) is available only via the Global endpoint.
The gemini-2.5-pro model is available in the following regions:
| Region Code | Geographic Location | Notes |
|---|---|---|
| Global | Global endpoint | Limited features (no tuning, batch prediction, or RAG) |
| United States | ||
| us-central1 | Iowa, USA | |
| us-east1 | South Carolina, USA | |
| us-east4 | Northern Virginia, USA | |
| us-east5 | Columbus, Ohio, USA | |
| us-south1 | Dallas, Texas, USA | |
| us-west1 | Oregon, USA | |
| us-west4 | Las Vegas, Nevada, USA | |
| Europe | ||
| europe-central2 | Warsaw, Poland | |
| europe-north1 | Hamina, Finland | |
| europe-southwest1 | Madrid, Spain | |
| europe-west1 | St. Ghislain, Belgium | |
| europe-west4 | Eemshaven, Netherlands | |
| europe-west8 | Milan, Italy | |
| europe-west9 | Paris, France | |
| Asia Pacific | ||
| asia-northeast1 | Tokyo, Japan | 128K context window only; supervised fine-tuning not supported |
Important Notes:
- For production use cases requiring specific features (tuning, batch prediction, RAG corpus), use a regional endpoint instead of the global endpoint
- Region availability may change; check the official documentation for the latest information:
The Gemini 3 models (Pro launched November 18, 2025; Flash launched December 17, 2025) are currently only available via the Global endpoint.
| Model ID | Availability | Thinking Levels | Auto-Region Override |
|---|---|---|---|
| gemini-3-pro-preview | Global only | low, high | Yes |
| gemini-3-pro-preview-11-2025 | Global only | low, high | Yes |
| gemini-3-pro-preview-11-2025-thinking | Global only | low, high | Yes |
| gemini-3-flash-preview | Global only | minimal, low, medium, high | Yes |
| gemini-3.1-pro-preview | Global only | low, medium, high | Yes |
| gemini-3.1-pro-preview-customtools | Global only | low, medium, high | Yes |
| gemini-3.1-flash-lite-preview | Global only | minimal, low, medium, high | Yes |
Key Features:
- 1 million token context window
- 64K token output limit
- Multimodal support (text, images, audio, video)
- Google Search grounding
- Configurable thinking levels (Flash has 4 levels, Pro has 2)
- Knowledge cutoff: January 2025
Important: These models automatically use the global endpoint regardless of your configured region setting. You don't need to change your GOOGLE_CLOUD_REGION configuration - the plugin handles this automatically.
For more information, see:
Gemini models are multi-modal. You can provide images, audio or video files as input like this:
llm -m vertex/gemini-2.5-flash 'extract text' -a image.jpgOr with a URL:
llm -m vertex/gemini-2.5-flash-lite 'describe image' \
-a https://static.simonwillison.net/static/2024/pelicans.jpgAudio works too:
llm -m vertex/gemini-2.5-flash 'transcribe audio' -a audio.mp3And video:
llm -m vertex/gemini-2.5-flash 'describe what happens' -a video.mp4Use -o json_object 1 to force the output to be JSON:
llm -m vertex/gemini-2.5-flash -o json_object 1 \
'3 largest cities in California, list of {"name": "..."}'Outputs:
{"cities": [{"name": "Los Angeles"}, {"name": "San Diego"}, {"name": "San Jose"}]}Gemini models can write and execute code - they can decide to write Python code, execute it in a secure sandbox and use the result as part of their response.
To enable this feature, use -o code_execution 1:
llm -m vertex/gemini-2.5-flash -o code_execution 1 \
'use python to calculate (factorial of 13) * 3'Some Gemini models support Grounding with Google Search, where the model can run a Google search and use the results as part of answering a prompt.
To run a prompt with Google search enabled, use -o google_search 1:
llm -m vertex/gemini-2.5-flash -o google_search 1 \
'What happened in Ireland today?'Gemini models support a URL context tool which, when enabled, allows the models to fetch additional content from URLs as part of their execution.
You can enable that with the -o url_context 1 option:
llm -m vertex/gemini-2.5-flash -o url_context 1 'Latest headline on simonwillison.net'To chat interactively with the model, run llm chat:
llm chat -m vertex/gemini-2.5-flashBy default there is no timeout against the Vertex AI API. You can use the timeout option to protect against API requests that hang indefinitely:
llm -m vertex/gemini-2.5-flash 'epic saga about mice' -o timeout 1.5This plugin provides access to Vertex AI embedding models. All embedding model IDs use the vertex/ prefix to distinguish them from other plugins.
| Model ID | Description | Dimensions |
|---|---|---|
vertex/gemini-embedding-001 |
Latest state-of-the-art model | 3072 (default) |
vertex/gemini-embedding-001-768 |
Truncated to 768 dimensions | 768 |
vertex/gemini-embedding-001-1536 |
Truncated to 1536 dimensions | 1536 |
vertex/text-embedding-005 |
Text embedding model | 768 |
vertex/text-embedding-004 |
Legacy text embedding model | 768 |
vertex/text-multilingual-embedding-002 |
Multilingual embedding model | 768 |
Deprecated models (October 2025):
vertex/gemini-embedding-exp-03-07(and truncated variants -128, -256, -512, -1024, -2048)
# Using the recommended model
llm embed -m vertex/gemini-embedding-001 -c 'hello world'
# Using a smaller dimension variant for efficiency
llm embed -m vertex/gemini-embedding-001-768 -c 'hello world'
# Using the multilingual model
llm embed -m vertex/text-multilingual-embedding-002 -c 'bonjour le monde'See the LLM embeddings documentation for further details.
- Create a GCP project at https://console.cloud.google.com
- Enable the Vertex AI API for your project
- Set up billing for your project
- Go to IAM & Admin > Service Accounts in GCP Console
- Create a new service account
- Grant it the "Vertex AI User" role
- Create and download a JSON key
- Configure the plugin with the path to this key (see Configuration above)
Vertex AI charges for model usage. See Vertex AI pricing for details.
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd llm-vertex
python3 -m venv venv
source venv/bin/activateNow install the dependencies and test dependencies:
llm install -e '.[test]'To run the tests:
pytestMake sure you've set the GOOGLE_CLOUD_PROJECT environment variable or run:
llm vertex set-project your-project-idVerify your authentication setup:
llm vertex configFor API key:
llm keys set vertexFor ADC:
gcloud auth application-default loginFor service account, ensure GOOGLE_APPLICATION_CREDENTIALS points to a valid JSON file.
Enable the Vertex AI API:
gcloud services enable aiplatform.googleapis.com --project=your-project-id