Check overall system health:
curl https://ttb-verifier.unitedentropy.com/health | jq .Expected responses:
Healthy (all backends available):
{
"status": "healthy",
"backends": {
"ollama": {"available": true, "error": null}
},
"capabilities": {
"ocr_backends": ["ollama"],
"degraded_mode": false
}
}Degraded (Ollama unavailable):
{
"status": "degraded",
"backends": {
"ollama": {"available": false, "error": "Model 'llama3.2-vision' not found"}
},
"capabilities": {
"ocr_backends": [],
"degraded_mode": true
}
}Check EC2 instance:
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=ttb-verifier" \
--query 'Reservations[0].Instances[0].[InstanceId,State.Name,LaunchTime]' \
--output tableCheck container status:
INSTANCE_ID=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=ttb-verifier" \
--query 'Reservations[0].Instances[0].InstanceId' \
--output text)
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker ps"]' \
--query 'Command.CommandId' \
--output text
# Wait a few seconds, then get output:
aws ssm get-command-invocation \
--command-id <COMMAND_ID> \
--instance-id "$INSTANCE_ID" \
--query 'StandardOutputContent' \
--output textCheck model download progress (if in degraded mode):
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["tail -20 /var/log/ollama-model-download.log 2>&1 || echo \"Log not found\""]' \
--query 'Command.CommandId' \
--output textRestart just the verifier container:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["cd /app && docker-compose restart verifier"]' \
--query 'Command.CommandId' \
--output textRestart all containers:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["cd /app && docker-compose restart"]' \
--query 'Command.CommandId' \
--output textVerifier container logs:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs ttb-verifier --tail 50"]' \
--query 'Command.CommandId' \
--output textOllama container logs:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker logs ttb-ollama --tail 50"]' \
--query 'Command.CommandId' \
--output textaws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["df -h"]' \
--query 'Command.CommandId' \
--output textTest Ollama backend:
curl -X POST https://ttb-verifier.unitedentropy.com/verify \
-F "image=@test_samples/label_good_001.jpg" \
-F "ocr_backend=ollama" | jq .Symptoms:
/healthreturns"status": "degraded"- Ollama backend shows
"available": false - Requests with
ocr_backend=ollamareturn errors
Diagnosis:
- Check how long instance has been running:
aws ec2 describe-instances \ --instance-ids "$INSTANCE_ID" \ --query 'Reservations[0].Instances[0].LaunchTime'
- If < 12 minutes: Expected - Model still downloading in background
- If > 12 minutes: Problem - Model download may have failed
Resolution:
If < 12 minutes (normal):
- Wait for background download to complete
- Monitor via health endpoint
- System is operational in degraded mode
If > 12 minutes (abnormal):
-
Check model download logs:
aws ssm send-command \ --instance-ids "$INSTANCE_ID" \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["cat /var/log/ollama-model-download.log"]'
-
Check for disk space issues:
aws ssm send-command \ --instance-ids "$INSTANCE_ID" \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["df -h", "du -sh /home /tmp"]'
-
Check if model file exists:
aws ssm send-command \ --instance-ids "$INSTANCE_ID" \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["docker exec ttb-ollama ollama list"]'
-
If model download failed, manually trigger:
aws ssm send-command \ --instance-ids "$INSTANCE_ID" \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["cd /app && docker-compose exec -T ollama ollama pull llama3.2-vision"]'
Symptoms:
- ALB health check failing
- 502/503 errors from load balancer
/healthendpoint times out
Diagnosis:
# Check container status
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker ps -a", "docker logs ttb-verifier --tail 50"]'Resolution:
If container stopped:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["cd /app && docker-compose up -d"]'If container running but unhealthy:
# Restart container
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["cd /app && docker-compose restart verifier"]'If persistent issues:
# Redeploy application
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["/app/workflow_deploy.sh"]'Symptoms:
- Requests taking > 10 seconds
- Timeout errors
Diagnosis:
-
Check which backend is being used:
- Ollama: Expected 30-60 seconds
-
Check system load:
aws ssm send-command \ --instance-ids "$INSTANCE_ID" \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["uptime", "free -h", "docker stats --no-stream"]'
Resolution:
If Ollama is slow:
- Expected behavior - Ollama takes 30-60 seconds per image
- Consider scaling to larger instance type (t3.large)
If system is slow:
- Check system resources (CPU, memory)
- Check for concurrent requests overwhelming system
- Consider implementing rate limiting
Symptoms:
- Container fails to start
- Model download fails
- Application logs show write errors
Diagnosis:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["df -h", "docker system df"]'Resolution:
Clean up Docker:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["docker system prune -af"]'Remove old model files:
aws ssm send-command \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["rm -f /home/model.tar.gz /tmp/model.tar.gz"]'If persistent: Increase root volume size in infrastructure/instance.tf