This project enables you to self-host an OpenAI-compatible chat completion API endpoint powered by one of the google gemma3 models - defaults to the google/gemma-3-12b-it model. Deployed via a FastAPI server within a Docker container, this solution supports multimodal inputs with a focus on text and images, leveraging NVIDIA GPU acceleration for efficient inference.
Tested on AWS on G6 instance
- Multimodal Capabilities: Process text and image inputs seamlessly
- OpenAI Compatibility: Integrate easily with the OpenAI Python client or any HTTP client
- GPU Support: Utilizes NVIDIA GPUs with automatic model distribution across available devices
- Persistent Caching: Stores model weights in a Docker volume to avoid repeated downloads
- Advanced Generation Parameters: Supports fine-tuned control over generation with parameters like temperature, top_p, top_k, min_p, and repetition penalty
- Health Monitoring: Built-in health check endpoint
Before you begin, ensure the following are installed on your system:
- Docker: For containerization
- Docker Compose: For managing multi-container setups
- NVIDIA Container Toolkit: For GPU support within Docker
- Compatible NVIDIA GPU(s): With at least 24GB VRAM per GPU for optimal performance
- Hugging Face Account: To access the model weights (free account required)
Follow these steps to build and run the self-hosted endpoint.
git clone https://github.com/anastasiosyal/google-gemma3-inference-server.git
cd google-gemma3-inference-serverCreate a .env file in the project root:
# Required: Your Hugging Face token for downloading model weights
HF_TOKEN=your_hugging_face_token_here
# Optional: Change the model (defaults to google/gemma-3-12b-it)
MODEL_ID=google/gemma-3-12b-itYou can get your Hugging Face token from your account settings.
Simply run:
docker compose up -dThis command will:
- Build the container if needed
- Mount necessary volumes for model caching
- Configure GPU access
- Set up environment variables
- Start the service in detached mode
The API will be available at http://localhost:8100.
To view the logs:
docker compose logs -fTo stop the service:
docker compose downYou can test the model directly using the official Hugging Face example script included in the container:
# Connect to the running container
docker compose exec -it gemma3-server bash
# Navigate to the app directory (where test_model.py is located)
cd /app
# Run the test script
python3 test_model.pyThis will run the example script from the Gemma 3 official Hugging Face repository, which demonstrates the model's basic capabilities without going through the API endpoint. It's a good way to verify that:
- The model loads correctly
- Your GPU setup is working
- Token authentication is successful
- Basic inference is working as expected
Interact with the endpoint using the OpenAI Python client or any HTTP client. Below are examples demonstrating different types of requests.
import openai
client = openai.OpenAI(
base_url="http://localhost:8100/v1",
api_key="test" # API key not required in current implementation
)
response = client.chat.completions.create(
model="google/gemma-3-8b-it",
messages=[{
"role": "user",
"content": [{
"type": "text",
"text": "Explain quantum computing in simple terms."
}]
}],
max_completion_tokens=1000,
temperature=0.7,
top_p=0.95,
top_k=64,
min_p=0.01,
repetition_penalty=1.0
)
print(response.choices[0].message.content)import openai
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "image.jpg"
image_base64 = encode_image(image_path)
client = openai.OpenAI(
base_url="http://localhost:8100/v1",
api_key="test"
)
response = client.chat.completions.create(
model="google/gemma-3-8b-it",
messages=[{
"role": "user",
"content": [{
"type": "text",
"text": "Describe this image in detail."
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
}
}]
}],
max_completion_tokens=1000
)
print(response.choices[0].message.content)The endpoint supports the following parameters for fine-tuned control:
max_completion_tokens(default: 1000): Maximum number of tokens to generatetemperature(default: 1.0): Controls randomness in generationtop_p(default: 0.95): Nucleus sampling parametertop_k(default: 64): Top-k sampling parametermin_p(default: 0.01): Minimum probability for token selectionrepetition_penalty(default: 1.0): Penalty for repeating tokens
Monitor the service status using the health endpoint:
curl http://localhost:8100/healthThe system architecture includes several key components:
- Uses the Gemma 3 8B instruction-tuned model
- Automatically distributes model across available GPUs using HuggingFace's
device_map="auto" - Utilizes bfloat16 precision for optimal performance
- Configures 24GB memory limit per GPU
- Request logging and error handling
- Support for system prompts and chat history
- Flash Attention 2.7.4 for efficient attention computation
- Automatic model sharding across multiple GPUs
- Persistent caching of model weights
- GPU Memory Issues: Ensure you have sufficient VRAM (24GB per GPU recommended)
- Model Loading Errors:
- Verify your Hugging Face token is correct in the
.envfile - Check the logs for detailed error messages using
docker compose logs -f - Verify HuggingFace cache permissions
- Verify your Hugging Face token is correct in the
- Generation Issues: Verify input format and adjust generation parameters as needed
- Performance Problems: Monitor GPU utilization and adjust worker count if necessary
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is released under the MIT License.