diff --git a/BUILDZ_AI_FIXES.md b/BUILDZ_AI_FIXES.md new file mode 100644 index 00000000000..a97ea24b353 --- /dev/null +++ b/BUILDZ_AI_FIXES.md @@ -0,0 +1,168 @@ +# buildz.ai Error Fixes + +This document outlines the fixes applied to resolve the 500 error and WebSocket connection issues on buildz.ai/workspace. + +## Issues Identified + +1. **API 500 Error**: `/api/workspaces` endpoint was failing due to insufficient error handling +2. **WebSocket Connection Failure**: `wss://buildz.ai/socket.io/` connection was being closed before establishment + +## Root Causes + +### 1. API Error Handling +- The `/api/workspaces` endpoint lacked comprehensive try-catch error handling +- Database connection errors or session issues were not properly caught and logged +- Error responses didn't provide sufficient debugging information + +### 2. WebSocket Configuration Issues +- Missing WebSocket-specific ingress annotations for GKE +- No BackendConfig for proper WebSocket connection handling +- Potential CORS configuration issues for buildz.ai domain +- Client-side socket URL validation needed improvement + +## Fixes Applied + +### 1. Enhanced API Error Handling + +**File**: `apps/sim/app/api/workspaces/route.ts` + +- Wrapped the entire GET function in try-catch block +- Added comprehensive logging for debugging +- Enhanced error responses with detailed error messages +- Added user context logging for better troubleshooting + +### 2. WebSocket Infrastructure Improvements + +**File**: `helm/sim/examples/ingress-buildz.yaml` + +- Added WebSocket-specific annotations: + - `nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"` + - `nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"` + - `cloud.google.com/backend-config` reference for WebSocket support + +**File**: `helm/sim/examples/backend-config-buildz.yaml` (NEW) + +- Created BackendConfig for WebSocket connections: + - Connection draining for graceful shutdowns + - Extended timeout for WebSocket connections (3600s) + - Session affinity with CLIENT_IP + - Health check configuration pointing to `/health` endpoint + +### 3. Socket Server CORS Configuration + +**File**: `apps/sim/socket-server/config/socket.ts` + +- Explicitly added buildz.ai domains to allowed origins: + - `https://buildz.ai` + - `https://www.buildz.ai` + +### 4. Client-Side Socket Configuration + +**File**: `apps/sim/contexts/socket-context.tsx` + +- Added validation to detect socket URL misconfigurations +- Enhanced logging for socket connection debugging +- Added environment variable debugging information + +## Deployment Instructions + +### Automatic Deployment + +Run the deployment script: + +```bash +./deploy-buildz-fix.sh +``` + +### Manual Deployment + +1. **Apply BackendConfig**: + ```bash + kubectl apply -f helm/sim/examples/backend-config-buildz.yaml + ``` + +2. **Update Ingress**: + ```bash + kubectl apply -f helm/sim/examples/ingress-buildz.yaml + ``` + +3. **Upgrade Helm Deployment**: + ```bash + helm upgrade sim-gcp ./helm/sim \ + --namespace simstudio \ + --values helm/sim/examples/values-gcp-buildz.yaml \ + --wait \ + --timeout=10m + ``` + +4. **Verify Deployment**: + ```bash + kubectl get pods -n simstudio -l app.kubernetes.io/name=sim-gcp + kubectl get ingress -n simstudio sim-ingress + ``` + +## Verification Steps + +### 1. API Endpoint Test +```bash +# Test the workspaces API (requires authentication) +curl -H "Authorization: Bearer " https://buildz.ai/api/workspaces +``` + +### 2. WebSocket Health Check +```bash +# Test WebSocket server health +curl https://ws.buildz.ai/health +``` + +### 3. WebSocket Connection Test +- Open browser DevTools on https://buildz.ai +- Navigate to Network tab, filter by "WS" +- Look for successful socket.io connections to ws.buildz.ai + +### 4. Diagnostic Script +Run the diagnostic script for comprehensive checks: +```bash +./diagnose-buildz-issue.sh +``` + +## Expected Behavior After Fixes + +1. **API Endpoints**: Should return proper JSON responses or detailed error messages instead of generic 500 errors +2. **WebSocket Connections**: Should successfully connect to `wss://ws.buildz.ai/socket.io/` +3. **Real-time Features**: Collaborative editing, presence indicators, and live updates should work properly + +## Monitoring and Troubleshooting + +### Log Monitoring +```bash +# Monitor application logs +kubectl logs -n simstudio -l app.kubernetes.io/name=sim-gcp -f + +# Monitor WebSocket server logs +kubectl logs -n simstudio -l app=sim-gcp-realtime -f +``` + +### Common Issues and Solutions + +1. **DNS Resolution**: Ensure ws.buildz.ai resolves correctly +2. **SSL Certificate**: Verify certificate covers both buildz.ai and ws.buildz.ai +3. **Environment Variables**: Check NEXT_PUBLIC_SOCKET_URL is set correctly in pods +4. **Load Balancer**: Ensure GKE ingress properly routes WebSocket traffic + +## Files Modified + +1. `apps/sim/app/api/workspaces/route.ts` - Enhanced error handling +2. `apps/sim/socket-server/config/socket.ts` - Added buildz.ai CORS origins +3. `apps/sim/contexts/socket-context.tsx` - Added URL validation and debugging +4. `helm/sim/examples/ingress-buildz.yaml` - Added WebSocket annotations +5. `helm/sim/examples/backend-config-buildz.yaml` - NEW: WebSocket backend config + +## Files Created + +1. `deploy-buildz-fix.sh` - Automated deployment script +2. `diagnose-buildz-issue.sh` - Diagnostic script +3. `helm/sim/examples/backend-config-buildz.yaml` - WebSocket backend configuration +4. `BUILDZ_AI_FIXES.md` - This documentation + +The fixes address both the immediate 500 error and the underlying WebSocket connectivity issues, providing a more robust and debuggable system. \ No newline at end of file diff --git a/apps/sim/app/api/workspaces/route.ts b/apps/sim/app/api/workspaces/route.ts index b184ca6e864..48f059f328d 100644 --- a/apps/sim/app/api/workspaces/route.ts +++ b/apps/sim/app/api/workspaces/route.ts @@ -10,46 +10,62 @@ const logger = createLogger('Workspaces') // Get all workspaces for the current user export async function GET() { - const session = await getSession() + try { + const session = await getSession() - if (!session?.user?.id) { - return NextResponse.json({ error: 'Unauthorized' }, { status: 401 }) - } + if (!session?.user?.id) { + return NextResponse.json({ error: 'Unauthorized' }, { status: 401 }) + } - // Get all workspaces where the user has permissions - const userWorkspaces = await db - .select({ - workspace: workspace, - permissionType: permissions.permissionType, - }) - .from(permissions) - .innerJoin(workspace, eq(permissions.entityId, workspace.id)) - .where(and(eq(permissions.userId, session.user.id), eq(permissions.entityType, 'workspace'))) - .orderBy(desc(workspace.createdAt)) + logger.info('Fetching workspaces for user', { userId: session.user.id }) + + // Get all workspaces where the user has permissions + const userWorkspaces = await db + .select({ + workspace: workspace, + permissionType: permissions.permissionType, + }) + .from(permissions) + .innerJoin(workspace, eq(permissions.entityId, workspace.id)) + .where(and(eq(permissions.userId, session.user.id), eq(permissions.entityType, 'workspace'))) + .orderBy(desc(workspace.createdAt)) - if (userWorkspaces.length === 0) { - // Create a default workspace for the user - const defaultWorkspace = await createDefaultWorkspace(session.user.id, session.user.name) + if (userWorkspaces.length === 0) { + logger.info('No workspaces found, creating default workspace', { userId: session.user.id }) + // Create a default workspace for the user + const defaultWorkspace = await createDefaultWorkspace(session.user.id, session.user.name) - // Migrate existing workflows to the default workspace - await migrateExistingWorkflows(session.user.id, defaultWorkspace.id) + // Migrate existing workflows to the default workspace + await migrateExistingWorkflows(session.user.id, defaultWorkspace.id) - return NextResponse.json({ workspaces: [defaultWorkspace] }) - } + return NextResponse.json({ workspaces: [defaultWorkspace] }) + } + + // If user has workspaces but might have orphaned workflows, migrate them + await ensureWorkflowsHaveWorkspace(session.user.id, userWorkspaces[0].workspace.id) - // If user has workspaces but might have orphaned workflows, migrate them - await ensureWorkflowsHaveWorkspace(session.user.id, userWorkspaces[0].workspace.id) + // Format the response with permission information + const workspacesWithPermissions = userWorkspaces.map( + ({ workspace: workspaceDetails, permissionType }) => ({ + ...workspaceDetails, + role: permissionType === 'admin' ? 'owner' : 'member', // Map admin to owner for compatibility + permissions: permissionType, + }) + ) - // Format the response with permission information - const workspacesWithPermissions = userWorkspaces.map( - ({ workspace: workspaceDetails, permissionType }) => ({ - ...workspaceDetails, - role: permissionType === 'admin' ? 'owner' : 'member', // Map admin to owner for compatibility - permissions: permissionType, + logger.info('Successfully fetched workspaces', { + userId: session.user.id, + workspaceCount: workspacesWithPermissions.length }) - ) - return NextResponse.json({ workspaces: workspacesWithPermissions }) + return NextResponse.json({ workspaces: workspacesWithPermissions }) + } catch (error) { + logger.error('Failed to fetch workspaces:', error) + return NextResponse.json( + { error: 'Failed to fetch workspaces', details: error instanceof Error ? error.message : 'Unknown error' }, + { status: 500 } + ) + } } // POST /api/workspaces - Create a new workspace diff --git a/apps/sim/contexts/socket-context.tsx b/apps/sim/contexts/socket-context.tsx index 56160cf3e9c..bcdf10f0339 100644 --- a/apps/sim/contexts/socket-context.tsx +++ b/apps/sim/contexts/socket-context.tsx @@ -165,12 +165,24 @@ export function SocketProvider({ children, user }: SocketProviderProps) { const token = await generateSocketToken() const socketUrl = getEnv('NEXT_PUBLIC_SOCKET_URL') || 'http://localhost:3002' + + // Validate that we have a proper socket URL and it's not defaulting to the main domain + if (socketUrl.includes('buildz.ai') && !socketUrl.includes('ws.buildz.ai')) { + logger.error('Invalid socket URL detected - should use ws.buildz.ai subdomain', { + socketUrl, + envVar: getEnv('NEXT_PUBLIC_SOCKET_URL'), + processEnv: process.env.NEXT_PUBLIC_SOCKET_URL, + }) + throw new Error('Socket server URL misconfiguration detected') + } logger.info('Attempting to connect to Socket.IO server', { url: socketUrl, userId: user?.id || 'no-user', hasToken: !!token, timestamp: new Date().toISOString(), + envVar: getEnv('NEXT_PUBLIC_SOCKET_URL'), + processEnv: process.env.NEXT_PUBLIC_SOCKET_URL, }) const socketInstance = io(socketUrl, { diff --git a/apps/sim/socket-server/config/socket.ts b/apps/sim/socket-server/config/socket.ts index eeb36dbdb74..c1bd110f62d 100644 --- a/apps/sim/socket-server/config/socket.ts +++ b/apps/sim/socket-server/config/socket.ts @@ -15,6 +15,9 @@ function getAllowedOrigins(): string[] { env.NEXT_PUBLIC_VERCEL_URL, 'http://localhost:3000', 'http://localhost:3001', + // Explicitly add buildz.ai domains + 'https://buildz.ai', + 'https://www.buildz.ai', ...(env.ALLOWED_ORIGINS?.split(',') || []), ].filter((url): url is string => Boolean(url)) diff --git a/deploy-buildz-fix.sh b/deploy-buildz-fix.sh new file mode 100755 index 00000000000..76802f5f78d --- /dev/null +++ b/deploy-buildz-fix.sh @@ -0,0 +1,59 @@ +#!/bin/bash + +# Deployment script for buildz.ai WebSocket and API fixes +# This script applies the necessary configurations to fix the 500 error and WebSocket connection issues + +set -e + +echo "🚀 Deploying fixes for buildz.ai..." + +# Check if kubectl is available +if ! command -v kubectl &> /dev/null; then + echo "❌ kubectl not found. Please install kubectl and configure it for your cluster." + exit 1 +fi + +# Check if we're in the right directory +if [[ ! -f "helm/sim/examples/values-gcp-buildz.yaml" ]]; then + echo "❌ Please run this script from the project root directory" + exit 1 +fi + +echo "📋 Applying configurations..." + +# Apply the backend config for WebSocket support +echo "1. Creating BackendConfig for WebSocket support..." +kubectl apply -f helm/sim/examples/backend-config-buildz.yaml + +# Apply the updated ingress configuration +echo "2. Updating Ingress configuration..." +kubectl apply -f helm/sim/examples/ingress-buildz.yaml + +# Update the Helm deployment with the latest configurations +echo "3. Upgrading Helm deployment..." +helm upgrade sim-gcp ./helm/sim \ + --namespace simstudio \ + --values helm/sim/examples/values-gcp-buildz.yaml \ + --wait \ + --timeout=10m + +echo "4. Waiting for pods to be ready..." +kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=sim-gcp --namespace=simstudio --timeout=300s + +echo "5. Checking deployment status..." +kubectl get pods -n simstudio -l app.kubernetes.io/name=sim-gcp + +echo "6. Checking ingress status..." +kubectl get ingress -n simstudio sim-ingress + +echo "✅ Deployment completed successfully!" +echo "" +echo "🔍 To verify the fixes:" +echo "1. Check API endpoint: curl -H 'Authorization: Bearer ' https://buildz.ai/api/workspaces" +echo "2. Check WebSocket health: curl https://ws.buildz.ai/health" +echo "3. Monitor logs: kubectl logs -n simstudio -l app=sim-gcp-realtime -f" +echo "" +echo "📝 If issues persist, check:" +echo "- DNS resolution for ws.buildz.ai" +echo "- SSL certificate for ws.buildz.ai subdomain" +echo "- Environment variable NEXT_PUBLIC_SOCKET_URL in the app pods" \ No newline at end of file diff --git a/diagnose-buildz-issue.sh b/diagnose-buildz-issue.sh new file mode 100755 index 00000000000..e8c0dbf01d7 --- /dev/null +++ b/diagnose-buildz-issue.sh @@ -0,0 +1,149 @@ +#!/bin/bash + +# Diagnostic script for buildz.ai issues +# This script helps identify the root cause of the API 500 error and WebSocket connection issues + +set -e + +echo "🔍 Diagnosing buildz.ai issues..." + +# Function to check if a command exists +command_exists() { + command -v "$1" >/dev/null 2>&1 +} + +# Function to test HTTP endpoint +test_http() { + local url="$1" + local description="$2" + echo "Testing $description: $url" + + if command_exists curl; then + response=$(curl -s -o /dev/null -w "%{http_code}" "$url" 2>/dev/null || echo "000") + if [[ "$response" == "200" ]]; then + echo "✅ $description: OK ($response)" + else + echo "❌ $description: Failed ($response)" + fi + else + echo "⚠️ curl not available, skipping HTTP test" + fi +} + +# Function to test DNS resolution +test_dns() { + local domain="$1" + echo "Testing DNS resolution for $domain..." + + if command_exists nslookup; then + if nslookup "$domain" >/dev/null 2>&1; then + echo "✅ DNS resolution for $domain: OK" + else + echo "❌ DNS resolution for $domain: Failed" + fi + else + echo "⚠️ nslookup not available, skipping DNS test" + fi +} + +# Function to check Kubernetes resources +check_k8s_resources() { + if ! command_exists kubectl; then + echo "⚠️ kubectl not available, skipping Kubernetes checks" + return + fi + + echo "📋 Checking Kubernetes resources..." + + # Check if namespace exists + if kubectl get namespace simstudio >/dev/null 2>&1; then + echo "✅ Namespace 'simstudio' exists" + else + echo "❌ Namespace 'simstudio' not found" + return + fi + + # Check pods + echo "Pod status:" + kubectl get pods -n simstudio -l app.kubernetes.io/name=sim-gcp 2>/dev/null || echo "❌ No app pods found" + kubectl get pods -n simstudio -l app=sim-gcp-realtime 2>/dev/null || echo "❌ No realtime pods found" + + # Check services + echo "Service status:" + kubectl get svc -n simstudio sim-gcp-app 2>/dev/null || echo "❌ App service not found" + kubectl get svc -n simstudio sim-gcp-realtime 2>/dev/null || echo "❌ Realtime service not found" + + # Check ingress + echo "Ingress status:" + kubectl get ingress -n simstudio sim-ingress 2>/dev/null || echo "❌ Ingress not found" + + # Check backend config + echo "BackendConfig status:" + kubectl get backendconfig -n simstudio sim-gcp-realtime-backendconfig 2>/dev/null || echo "⚠️ BackendConfig not found (may not be applied yet)" +} + +# Function to check environment variables in pods +check_env_vars() { + if ! command_exists kubectl; then + echo "⚠️ kubectl not available, skipping environment variable checks" + return + fi + + echo "🔧 Checking environment variables in pods..." + + # Get first app pod + local app_pod=$(kubectl get pods -n simstudio -l app.kubernetes.io/name=sim-gcp -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) + if [[ -n "$app_pod" ]]; then + echo "Checking environment variables in app pod: $app_pod" + echo "NEXT_PUBLIC_APP_URL:" + kubectl exec -n simstudio "$app_pod" -- printenv NEXT_PUBLIC_APP_URL 2>/dev/null || echo "❌ Not set" + echo "NEXT_PUBLIC_SOCKET_URL:" + kubectl exec -n simstudio "$app_pod" -- printenv NEXT_PUBLIC_SOCKET_URL 2>/dev/null || echo "❌ Not set" + echo "DATABASE_URL:" + kubectl exec -n simstudio "$app_pod" -- printenv DATABASE_URL 2>/dev/null | sed 's/.*@/***@/' || echo "❌ Not set" + else + echo "❌ No app pods found" + fi + + # Get first realtime pod + local realtime_pod=$(kubectl get pods -n simstudio -l app=sim-gcp-realtime -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) + if [[ -n "$realtime_pod" ]]; then + echo "Checking environment variables in realtime pod: $realtime_pod" + echo "SOCKET_SERVER_URL:" + kubectl exec -n simstudio "$realtime_pod" -- printenv SOCKET_SERVER_URL 2>/dev/null || echo "❌ Not set" + echo "NEXT_PUBLIC_APP_URL:" + kubectl exec -n simstudio "$realtime_pod" -- printenv NEXT_PUBLIC_APP_URL 2>/dev/null || echo "❌ Not set" + else + echo "❌ No realtime pods found" + fi +} + +echo "🌐 Testing DNS resolution..." +test_dns "buildz.ai" +test_dns "ws.buildz.ai" + +echo "" +echo "🔗 Testing HTTP endpoints..." +test_http "https://buildz.ai" "Main app" +test_http "https://ws.buildz.ai/health" "WebSocket server health" + +echo "" +check_k8s_resources + +echo "" +check_env_vars + +echo "" +echo "📝 Manual checks to perform:" +echo "1. Verify SSL certificate covers both buildz.ai and ws.buildz.ai" +echo "2. Check if NEXT_PUBLIC_SOCKET_URL is properly set in the browser (DevTools -> Application -> Local Storage)" +echo "3. Test WebSocket connection manually:" +echo " - Open browser DevTools on https://buildz.ai" +echo " - Go to Network tab, filter by WS" +echo " - Look for socket.io connections and check for errors" +echo "4. Check application logs:" +echo " kubectl logs -n simstudio -l app.kubernetes.io/name=sim-gcp --tail=100" +echo " kubectl logs -n simstudio -l app=sim-gcp-realtime --tail=100" + +echo "" +echo "🔍 Diagnosis complete!" \ No newline at end of file diff --git a/helm/sim/examples/backend-config-buildz.yaml b/helm/sim/examples/backend-config-buildz.yaml new file mode 100644 index 00000000000..b3a0ee16cab --- /dev/null +++ b/helm/sim/examples/backend-config-buildz.yaml @@ -0,0 +1,24 @@ +apiVersion: cloud.google.com/v1 +kind: BackendConfig +metadata: + name: sim-gcp-realtime-backendconfig + namespace: simstudio +spec: + # Enable connection draining for graceful shutdowns + connectionDraining: + drainingTimeoutSec: 60 + # Configure timeout for WebSocket connections + timeoutSec: 3600 + # Enable session affinity for WebSocket connections + sessionAffinity: + affinityType: "CLIENT_IP" + affinityCookieTtlSec: 3600 + # Health check configuration + healthCheck: + checkIntervalSec: 10 + timeoutSec: 5 + healthyThreshold: 1 + unhealthyThreshold: 3 + type: HTTP + requestPath: /health + port: 3002 \ No newline at end of file diff --git a/helm/sim/examples/ingress-buildz.yaml b/helm/sim/examples/ingress-buildz.yaml index 6826fbc8b73..4ab423b400c 100644 --- a/helm/sim/examples/ingress-buildz.yaml +++ b/helm/sim/examples/ingress-buildz.yaml @@ -7,6 +7,11 @@ metadata: kubernetes.io/ingress.global-static-ip-name: "simstudio-ip" networking.gke.io/managed-certificates: "buildz-ssl-cert" kubernetes.io/ingress.class: "gce" + # Enable WebSocket support for the realtime service + nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" + nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" + # Support WebSocket upgrades + cloud.google.com/backend-config: '{"ports": {"3002":"sim-gcp-realtime-backendconfig"}}' spec: rules: - host: buildz.ai