llm-vertex

Access Google's Gemini models via Vertex AI for enterprise use

Installation

Install this plugin in the same environment as LLM.

Note: This package is not published on PyPI (the llm-vertex name on PyPI is an unrelated project). Install directly from GitHub:

llm install "llm-vertex @ git+https://github.com/c0ffee0wl/llm-vertex.git"

Authentication Setup

This plugin uses Google Cloud Vertex AI, which supports three authentication methods:

Option 1: API Key (Recommended for Testing Only)

Fastest setup, but recommended for testing only:

# Set via llm keys command
llm keys set vertex
# Or via environment variable
export GOOGLE_CLOUD_API_KEY="YOUR_API_KEY"

Get your API key from the Google Cloud Console.

Note: Vertex AI API keys are different from Google AI Studio keys. Make sure to create a Vertex AI-compatible API key in your GCP project. API keys are convenient for development and testing but not recommended for production. For production, use Application Default Credentials (Option 2).

Option 2: Application Default Credentials (Recommended for Production)

If you're already using Google Cloud, authenticate with:

gcloud auth application-default login

This sets up Application Default Credentials (ADC) that the plugin will automatically use.

Option 3: Service Account JSON File

Create a service account in your GCP project with Vertex AI User permissions
Download the JSON key file
Set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Or configure it via the plugin:

llm vertex set-credentials /path/to/service-account.json

Configuration

Set Your GCP Project ID

# Via environment variable
export GOOGLE_CLOUD_PROJECT="your-project-id"

# Or via plugin config
llm vertex set-project your-project-id

Set Your Region (Optional)

The plugin defaults to the global endpoint. However, the global endpoint has important limitations:

⚠️ Does not support tuning, batch prediction, or RAG corpus creation
⚠️ Does not guarantee region-specific ML processing
⚠️ Does not provide data residency compliance

For production use or if you need specific features, use a regional endpoint:

# Via environment variable
export GOOGLE_CLOUD_REGION="us-central1"

# Or via plugin config
llm vertex set-region us-central1

To see all available regions:

llm vertex list-regions

Available regions include:

United States: us-central1, us-east1, us-east4, us-east5, us-south1, us-west1, us-west4
Canada: northamerica-northeast1
South America: southamerica-east1
Europe: europe-west1, europe-west2, europe-west3, europe-west4, europe-west6, europe-west8, europe-west9, europe-north1, europe-southwest1, europe-central2
Asia Pacific: asia-east1, asia-northeast1, asia-northeast3, asia-southeast1, asia-south1, australia-southeast1, australia-southeast2
Middle East: me-central1, me-central2, me-west1

For the latest region availability and model-specific regional support, see the official Vertex AI locations documentation.

View Current Configuration

llm vertex config

Usage

Now run the model using -m vertex/gemini-2.5-flash, for example:

llm -m vertex/gemini-2.5-flash "A short joke about a pelican and a walrus"

You can set the default model to avoid the extra -m option:

llm models default vertex/gemini-2.5-flash
llm "A joke about a pelican and a walrus"

Available models

All Gemini models are available through Vertex AI:

Gemini 3 (Latest - Preview)

vertex/gemini-3-pro-preview: Gemini 3 Pro preview (global region only)
vertex/gemini-3-pro-preview-11-2025: Gemini 3 Pro November 2025 version (global region only)
vertex/gemini-3-pro-preview-11-2025-thinking: Gemini 3 Pro with thinking mode (global region only)
vertex/gemini-3-flash-preview: Gemini 3 Flash preview (global region only)

Note: Gemini 3 models automatically use the global endpoint regardless of your configured region.

Thinking Levels: Gemini 3 models support configurable thinking levels:

Flash (gemini-3-flash-preview): minimal, low, medium, high
Pro (gemini-3-pro-preview): low, high

llm -m vertex/gemini-3-flash-preview -o thinking_level high 'complex reasoning task'

Gemini 3.1 (Latest - Preview)

vertex/gemini-3.1-pro-preview: Gemini 3.1 Pro preview (global region only)
vertex/gemini-3.1-pro-preview-customtools: Gemini 3.1 Pro with custom tools support (global region only)
vertex/gemini-3.1-flash-lite-preview: Gemini 3.1 Flash Lite preview (global region only)

Thinking Levels: Gemini 3.1 models support configurable thinking levels:

Pro (gemini-3.1-pro-preview): low, medium, high
Flash Lite (gemini-3.1-flash-lite-preview): minimal, low, medium, high

llm -m vertex/gemini-3.1-pro-preview -o thinking_level high 'complex reasoning task'

Note: Gemini 3.1 models automatically use the global endpoint regardless of your configured region.

Gemma 3 (Open Models)

vertex/gemma-3-1b-it: Gemma 3 1B (text only, no vision)
vertex/gemma-3-4b-it: Gemma 3 4B
vertex/gemma-3-12b-it: Gemma 3 12B
vertex/gemma-3-27b-it: Gemma 3 27B
vertex/gemma-3n-e4b-it: Gemma 3n E4B (text only, no vision)

Note: Gemma models do not support structured output schemas or media resolution settings. gemma-3-1b-it and gemma-3n-e4b-it do not support vision/image inputs.

Gemini 2.5 and earlier

vertex/gemini-2.5-flash-lite-preview-09-2025
vertex/gemini-2.5-flash-preview-09-2025
vertex/gemini-flash-lite-latest: Latest Gemini Flash Lite
vertex/gemini-flash-latest: Latest Gemini Flash
vertex/gemini-2.5-flash-lite: Gemini 2.5 Flash Lite
vertex/gemini-2.5-pro: Gemini 2.5 Pro
vertex/gemini-2.5-flash: Gemini 2.5 Flash
vertex/gemini-2.0-flash: Gemini 2.0 Flash
vertex/gemini-2.0-flash-thinking-exp-01-21: Experimental "thinking" model
vertex/gemini-1.5-flash-8b-latest: The least expensive model
vertex/gemini-1.5-pro-latest
vertex/gemini-1.5-flash-latest

And many more. Use the vertex/ prefix to reference models:

llm -m vertex/gemini-1.5-flash-8b-latest --schema 'name,age int,bio' 'invent a dog'

Note: This plugin provides Gemini models via Vertex AI (enterprise API). For the public Google AI Studio API, use the separate llm-gemini plugin.

Model Regional Availability

Different Gemini models are available in different regions. Here's the detailed availability for the main models:

Gemini 2.5 Flash

The gemini-2.5-flash (GA) model is available in the following regions:

Region Code	Geographic Location	Notes
Global	Global endpoint	Limited features (no tuning, batch prediction, or RAG)
United States
us-central1	Iowa, USA
us-east1	South Carolina, USA
us-east4	Northern Virginia, USA
us-east5	Columbus, Ohio, USA
us-south1	Dallas, Texas, USA
us-west1	Oregon, USA
us-west4	Las Vegas, Nevada, USA
Europe
europe-central2	Warsaw, Poland
europe-north1	Hamina, Finland
europe-southwest1	Madrid, Spain
europe-west1	St. Ghislain, Belgium
europe-west4	Eemshaven, Netherlands
europe-west8	Milan, Italy
Canada
northamerica-northeast1	Montréal, Canada
Asia Pacific
asia-northeast1	Tokyo, Japan	128K context window only*
asia-northeast3	Seoul, South Korea	128K context window only*
asia-south1	Mumbai, India	128K context window only*
asia-southeast1	Jurong West, Singapore	128K context window only*
australia-southeast1	Sydney, Australia	128K context window only*

*Regions marked with asterisk have limitations: 128K context window only, supervised fine-tuning not supported.

The preview model (gemini-2.5-flash-preview-09-2025) is available only via the Global endpoint.

Gemini 2.5 Pro

The gemini-2.5-pro model is available in the following regions:

Region Code	Geographic Location	Notes
Global	Global endpoint	Limited features (no tuning, batch prediction, or RAG)
United States
us-central1	Iowa, USA
us-east1	South Carolina, USA
us-east4	Northern Virginia, USA
us-east5	Columbus, Ohio, USA
us-south1	Dallas, Texas, USA
us-west1	Oregon, USA
us-west4	Las Vegas, Nevada, USA
Europe
europe-central2	Warsaw, Poland
europe-north1	Hamina, Finland
europe-southwest1	Madrid, Spain
europe-west1	St. Ghislain, Belgium
europe-west4	Eemshaven, Netherlands
europe-west8	Milan, Italy
europe-west9	Paris, France
Asia Pacific
asia-northeast1	Tokyo, Japan	128K context window only; supervised fine-tuning not supported

Important Notes:

For production use cases requiring specific features (tuning, batch prediction, RAG corpus), use a regional endpoint instead of the global endpoint
Region availability may change; check the official documentation for the latest information:

Gemini 3 (Preview)

The Gemini 3 models (Pro launched November 18, 2025; Flash launched December 17, 2025) are currently only available via the Global endpoint.

Model ID	Availability	Thinking Levels	Auto-Region Override
gemini-3-pro-preview	Global only	low, high	Yes
gemini-3-pro-preview-11-2025	Global only	low, high	Yes
gemini-3-pro-preview-11-2025-thinking	Global only	low, high	Yes
gemini-3-flash-preview	Global only	minimal, low, medium, high	Yes
gemini-3.1-pro-preview	Global only	low, medium, high	Yes
gemini-3.1-pro-preview-customtools	Global only	low, medium, high	Yes
gemini-3.1-flash-lite-preview	Global only	minimal, low, medium, high	Yes

Key Features:

1 million token context window
64K token output limit
Multimodal support (text, images, audio, video)
Google Search grounding
Configurable thinking levels (Flash has 4 levels, Pro has 2)
Knowledge cutoff: January 2025

Important: These models automatically use the global endpoint regardless of your configured region setting. You don't need to change your GOOGLE_CLOUD_REGION configuration - the plugin handles this automatically.

For more information, see:

Images, audio and video

Gemini models are multi-modal. You can provide images, audio or video files as input like this:

llm -m vertex/gemini-2.5-flash 'extract text' -a image.jpg

Or with a URL:

llm -m vertex/gemini-2.5-flash-lite 'describe image' \
  -a https://static.simonwillison.net/static/2024/pelicans.jpg

Audio works too:

llm -m vertex/gemini-2.5-flash 'transcribe audio' -a audio.mp3

And video:

llm -m vertex/gemini-2.5-flash 'describe what happens' -a video.mp4

JSON output

Use -o json_object 1 to force the output to be JSON:

llm -m vertex/gemini-2.5-flash -o json_object 1 \
  '3 largest cities in California, list of {"name": "..."}'

Outputs:

{"cities": [{"name": "Los Angeles"}, {"name": "San Diego"}, {"name": "San Jose"}]}

Code execution

Gemini models can write and execute code - they can decide to write Python code, execute it in a secure sandbox and use the result as part of their response.

To enable this feature, use -o code_execution 1:

llm -m vertex/gemini-2.5-flash -o code_execution 1 \
'use python to calculate (factorial of 13) * 3'

Google search

Some Gemini models support Grounding with Google Search, where the model can run a Google search and use the results as part of answering a prompt.

To run a prompt with Google search enabled, use -o google_search 1:

llm -m vertex/gemini-2.5-flash -o google_search 1 \
  'What happened in Ireland today?'

URL context

Gemini models support a URL context tool which, when enabled, allows the models to fetch additional content from URLs as part of their execution.

You can enable that with the -o url_context 1 option:

llm -m vertex/gemini-2.5-flash -o url_context 1 'Latest headline on simonwillison.net'

Chat

To chat interactively with the model, run llm chat:

llm chat -m vertex/gemini-2.5-flash

Timeouts

By default there is no timeout against the Vertex AI API. You can use the timeout option to protect against API requests that hang indefinitely:

llm -m vertex/gemini-2.5-flash 'epic saga about mice' -o timeout 1.5

Embeddings

This plugin provides access to Vertex AI embedding models. All embedding model IDs use the vertex/ prefix to distinguish them from other plugins.

Available Embedding Models

Model ID	Description	Dimensions
`vertex/gemini-embedding-001`	Latest state-of-the-art model	3072 (default)
`vertex/gemini-embedding-001-768`	Truncated to 768 dimensions	768
`vertex/gemini-embedding-001-1536`	Truncated to 1536 dimensions	1536
`vertex/text-embedding-005`	Text embedding model	768
`vertex/text-embedding-004`	Legacy text embedding model	768
`vertex/text-multilingual-embedding-002`	Multilingual embedding model	768

Deprecated models (October 2025):

vertex/gemini-embedding-exp-03-07 (and truncated variants -128, -256, -512, -1024, -2048)

Usage

# Using the recommended model
llm embed -m vertex/gemini-embedding-001 -c 'hello world'

# Using a smaller dimension variant for efficiency
llm embed -m vertex/gemini-embedding-001-768 -c 'hello world'

# Using the multilingual model
llm embed -m vertex/text-multilingual-embedding-002 -c 'bonjour le monde'

See the LLM embeddings documentation for further details.

Prerequisites

GCP Project Setup

Create a GCP project at https://console.cloud.google.com
Enable the Vertex AI API for your project
Set up billing for your project

Service Account Setup (if not using ADC)

Go to IAM & Admin > Service Accounts in GCP Console
Create a new service account
Grant it the "Vertex AI User" role
Create and download a JSON key
Configure the plugin with the path to this key (see Configuration above)

Costs

Vertex AI charges for model usage. See Vertex AI pricing for details.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd llm-vertex
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

llm install -e '.[test]'

To run the tests:

pytest

Troubleshooting

"No GCP project ID found" error

Make sure you've set the GOOGLE_CLOUD_PROJECT environment variable or run:

llm vertex set-project your-project-id

Authentication errors

Verify your authentication setup:

llm vertex config

For API key:

llm keys set vertex

For ADC:

gcloud auth application-default login

For service account, ensure GOOGLE_APPLICATION_CREDENTIALS points to a valid JSON file.

"API not enabled" errors

Enable the Vertex AI API:

gcloud services enable aiplatform.googleapis.com --project=your-project-id

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
llm_vertex.py		llm_vertex.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

llm-vertex

Installation

Authentication Setup

Option 1: API Key (Recommended for Testing Only)

Option 2: Application Default Credentials (Recommended for Production)

Option 3: Service Account JSON File

Configuration

Set Your GCP Project ID

Set Your Region (Optional)

View Current Configuration

Usage

Available models

Gemini 3 (Latest - Preview)

Gemini 3.1 (Latest - Preview)

Gemma 3 (Open Models)

Gemini 2.5 and earlier

Model Regional Availability

Gemini 2.5 Flash

Gemini 2.5 Pro

Gemini 3 (Preview)

Images, audio and video

JSON output

Code execution

Google search

URL context

Chat

Timeouts

Embeddings

Available Embedding Models

Usage

Prerequisites

GCP Project Setup

Service Account Setup (if not using ADC)

Costs

Development

Troubleshooting

"No GCP project ID found" error

Authentication errors

"API not enabled" errors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages