A FastAPI-based AI agent that provides customer service support for Koozie Group using Vertex AI's Gemini 2.5 Flash model with streaming responses for minimal latency.
- ✅ Streaming Responses: Token-by-token streaming for voice applications (minimizes latency)
- ✅ Koozie Context: Full product catalog and support information loaded into every request
- ✅ Two Endpoints:
/chat(streaming) and/chat/sync(non-streaming) - ✅ Hot Reload Development: Docker Compose setup for rapid testing
- ✅ GCP Ready: Cloud Build configuration for automated deployment
Create a .env file (see .env.example):
GCP_PROJECT_ID=heyai-backend
GCP_REGION=us-central1
GCP_PROJECT_NUMBER=127756525541
VERTEX_AI_LOCATION=us-central1
VERTEX_AI_MODEL=gemini-2.5-flash- Docker and Docker Compose installed
- GCP credentials configured (via
gcloud auth application-default loginor service account)
# Start the development server with hot reload
docker-compose -f docker-compose.dev.yml up
# The server will be available at http://localhost:8080# Make test script executable
chmod +x test_endpoints.sh
# Run tests
./test_endpoints.shOr test manually:
# Health check
curl http://localhost:8080/health
# Streaming chat (for voice apps)
curl -sN -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is a Koozie?"}'
# Synchronous chat (for testing)
curl -X POST http://localhost:8080/chat/sync \
-H "Content-Type: application/json" \
-d '{"message": "Tell me about your pens."}'Health check endpoint. Returns server status and configuration.
Response:
{
"status": "healthy",
"project_id": "heyai-backend",
"location": "us-central1",
"model": "gemini-2.5-flash",
"context_loaded": true,
"vertex_ai_initialized": true
}Streaming chat endpoint. Returns Server-Sent Events (SSE) with tokens as they're generated.
Request:
{
"message": "What products do you offer?",
"conversation_history": [
{"role": "user", "content": "Hello"},
{"role": "model", "content": "Hi! How can I help you?"}
]
}Response: Server-Sent Events stream
data: {"text": "We", "done": false}
data: {"text": " offer", "done": false}
data: {"text": " a wide", "done": false}
...
data: {"text": "", "done": true}
Synchronous chat endpoint. Returns complete response.
Request: Same as /chat
Response:
{
"status": "success",
"message": "We offer a wide range of promotional products..."
}- GCP project with Vertex AI API enabled
- Artifact Registry repository created
- Cloud Build trigger configured
The cloudbuild.yaml is configured to:
- Build Docker image
- Push to Artifact Registry
- Deploy to Cloud Run
Simply push to your repository and the Cloud Build trigger will handle deployment.
# Build and push image
gcloud builds submit --config cloudbuild.yaml
# Or deploy directly to Cloud Run
gcloud run deploy koozie-agent-service \
--source . \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars="GCP_PROJECT_ID=heyai-backend,VERTEX_AI_LOCATION=us-central1,VERTEX_AI_MODEL=gemini-2.5-flash"test-agent/
├── main.py # FastAPI server with Vertex AI integration
├── context.txt # Koozie Group product catalog and support info
├── requirements.txt # Python dependencies
├── Dockerfile # Production container
├── Dockerfile.dev # Development container
├── docker-compose.dev.yml # Hot reload development setup
├── cloudbuild.yaml # GCP Cloud Build configuration
├── .env.example # Environment variable template
├── .gitignore # Git ignore rules
└── test_endpoints.sh # Test script
- The server loads
context.txtat startup and includes it in every request via system instructions - Streaming endpoint uses Server-Sent Events (SSE) for real-time token delivery
- GCP credentials are automatically detected via Application Default Credentials
- The service is configured for Cloud Run deployment with auto-scaling