A production-ready AWS infrastructure platform designed for testing and validating DevOps AI agents' incident response capabilities. This demo environment simulates real-world infrastructure incidents with fast CloudWatch alarms, automated recovery mechanisms, and comprehensive monitoring - perfect for training and validating AI-powered automation agents.
- Interactive Dashboard: Python-based web application with one-click incident simulation
- 6 Incident Types: Unhealthy hosts, crashes, slow responses, 5xx floods, and shutdowns
- Fast Alarms: 10-second CloudWatch evaluation periods for rapid incident detection (2-4 minutes)
- Auto-Recovery: Built-in 5-minute recovery timer to automatically restore healthy state
- Cost-Optimized: Auto-shutdown Lambda stops instances after 2 hours, saving $50-100/month
- GitHub Actions Integration: Automated incident triggering via CI/CD workflows
- Quick Start
- Architecture
- The Web Application
- Incident Simulation
- Monitoring & Alarms
- DevOps Agent Integration
- Infrastructure Details
- Development
- Advanced Topics
- Documentation Index
- Technical Reference
Get the demo infrastructure running in under 5 minutes.
- AWS account with credentials configured
- Terraform >= 1.0 (installation guide)
# 1. Clone the repository
git clone <repository-url>
cd aws-devops-agent-demo
# 2. Initialize Terraform
terraform init
# 3. Deploy to dev environment
terraform apply -var-file=environments/dev/dev.tfvars
# 4. Get the application URL
terraform output -raw alb_url# Get your ALB URL
ALB_URL=$(terraform output -raw alb_url)
# Trigger an unhealthy host incident via the dashboard
open ${ALB_URL} # Opens interactive dashboard in browser
# Or trigger via curl
curl "${ALB_URL}/simulate/unhealthy"
# Check CloudWatch alarms (should trigger in 2-3 minutes)
# The system will auto-recover in 5 minutesThe infrastructure creates a complete testing environment with Application Load Balancer, EC2 instances, CloudWatch monitoring, and automated recovery.
graph TB
Internet[Internet Users]
ALB[Application Load Balancer<br/>Port 80]
EC2_1[EC2 Instance #1<br/>Python Web App<br/>Port 80]
EC2_2[EC2 Instance #2<br/>Python Web App<br/>Port 80]
CW[CloudWatch Alarms<br/>10-second periods]
Lambda[Auto-Shutdown Lambda<br/>2-hour timer]
Internet -->|HTTP Requests| ALB
ALB -->|Health Checks| EC2_1
ALB -->|Health Checks| EC2_2
EC2_1 -->|Custom Metrics<br/>HealthStatus, Incidents| CW
EC2_2 -->|Custom Metrics<br/>HealthStatus, Incidents| CW
Lambda -.->|Stop Instances<br/>Cost Savings| EC2_1
Lambda -.->|Stop Instances<br/>Cost Savings| EC2_2
subgraph VPC["VPC (10.0.0.0/16)"]
ALB
subgraph PublicSubnets["Public Subnets (Multi-AZ)"]
EC2_1
EC2_2
end
end
style CW fill:#FF9900,color:#fff
style Lambda fill:#FF9900,color:#fff
style ALB fill:#0066CC,color:#fff
style EC2_1 fill:#0066CC,color:#fff
style EC2_2 fill:#0066CC,color:#fff
| Component | Description | Purpose |
|---|---|---|
| VPC | 10.0.0.0/16 network | Isolated network environment |
| Application Load Balancer | Internet-facing ALB | Routes traffic, health checks |
| EC2 Instances | t3.micro (default 2x) | Run Python web application |
| Python Web App | HTTP server on port 80 | Simulation endpoints, metrics |
| CloudWatch Alarms | 10-second evaluation periods | Fast incident detection |
| Auto-Shutdown Lambda | Runs every 2 hours | Cost savings for idle demos |
| IAM Roles | EC2 instance profile | CloudWatch metrics, SSM access |
- Public Subnets: Multi-AZ deployment for high availability
- Internet Gateway: Direct internet access for ALB
- Security Groups: ALB (port 80 ingress) → EC2 (port 80 from ALB only)
- Route Tables: Public routing to internet gateway
- main.tf: VPC, ALB, EC2 instances, networking
- monitoring.tf: CloudWatch alarms (lines 5-77)
- auto_shutdown.tf: Lambda function for cost optimization (lines 20-96)
- templates/userdata.sh.tpl: Python application installation
A Python 3.12 HTTP server running as a systemd service on port 80, providing an interactive dashboard and simulation endpoints for incident testing.
- Technology: Python 3 built-in
http.servermodule - Deployment: Systemd service with auto-restart
- Metrics: CloudWatch custom metrics via boto3
- Auto-Recovery: 5-minute timer for automatic health restoration
Access the dashboard at your ALB URL to:
- View real-time health status
- Monitor instance information (ID, environment)
- Trigger incidents with one-click buttons
- View all available API endpoints
- Auto-refreshes every 5 seconds
| Endpoint | Method | Description | Auto-Recovery |
|---|---|---|---|
/simulate/unhealthy |
GET | Fails health checks (returns 503) | 5 minutes |
/simulate/healthy |
GET | Restores healthy status (returns 200) | N/A |
/simulate/crash |
GET | Crashes application in 10 seconds | systemd restart |
/simulate/slow-health |
GET | Triggers slow/timeout responses | 5 minutes |
GET /health - ALB health check endpoint
Responses:
- 200 OK:
{"status": "healthy"}- Instance is healthy - 503 Service Unavailable:
{"status": "unhealthy", "reason": "..."}- Health check failing
ALB health check configuration:
- Interval: 30 seconds
- Timeout: 5 seconds
- Healthy threshold: 2 consecutive successes
- Unhealthy threshold: 2 consecutive failures
The application publishes custom metrics every 60 seconds:
| Metric | Values | Purpose |
|---|---|---|
| HealthStatus | 1.0 (healthy) / 0.0 (unhealthy) | Track health state over time |
| IncidentSimulations | Count | Track incident trigger frequency |
Namespace: CustomApp/HealthDemo
Dimensions: InstanceId, Environment, IncidentType
When triggering unhealthy or slow-health incidents, the application automatically schedules recovery:
- Delay: 300 seconds (5 minutes)
- Action: Restores
healthy = truestatus - Cancellation: Calling
/simulate/healthycancels pending recovery - Metrics: Publishes updated health status to CloudWatch
- Logging: All recovery events logged to
/var/log/user-data.log
This allows testing agent detection and response without manual intervention.
Source: templates/userdata.sh.tpl (lines 33-650)
Key Features:
- Python 3.12 with boto3 for AWS SDK
- Multi-threaded: Main HTTP server + background metrics publisher
- Graceful error handling for CloudWatch API failures
- IMDSv2 for secure instance metadata retrieval
- Systemd service with automatic restart on crash
Systemd Service:
# Check service status
sudo systemctl status webapp
# View logs
sudo journalctl -u webapp -f
# Restart service
sudo systemctl restart webappTest DevOps agent incident response capabilities with 6 realistic failure scenarios.
Incidents can be triggered via:
- Interactive Dashboard: One-click buttons in web UI
- Direct API: curl commands to ALB URL
- GitHub Actions: Automated workflow (
.github/workflows/trigger-incidents.yml) - AWS SSM: Direct commands to instances
All incidents generate CloudWatch alarms within 2-4 minutes for rapid agent testing.
Behavior: Single instance fails health checks
Triggers: unhealthy-hosts alarm (UnHealthyHostCount ≥ 1)
Expected Alarm Time: 2-3 minutes
Auto-Recovery: 5 minutes
Use Case: Test single-host failure detection and recovery
# Via API
curl "${ALB_URL}/simulate/unhealthy"
# Via GitHub Actions
gh workflow run trigger-incidents.yml \
-f environment=dev \
-f incident_type=unhealthy_hostBehavior: All instances fail health checks simultaneously
Triggers: unhealthy-hosts alarm (UnHealthyHostCount ≥ instance_count)
Expected Alarm Time: 2-3 minutes
Auto-Recovery: 5 minutes
Use Case: Test total service outage detection
# Via GitHub Actions (requires SSM)
gh workflow run trigger-incidents.yml \
-f environment=dev \
-f incident_type=unhealthy_allBehavior: Application exits with code 1 after 10 seconds
Triggers: unhealthy-hosts alarm (systemd restarts app in ~5s)
Expected Alarm Time: 2-4 minutes
Auto-Recovery: systemd automatic restart
Use Case: Test application crash detection and systemd recovery
# Via API
curl "${ALB_URL}/simulate/crash"
# Note: Application responds "will crash in 10 seconds" then exitsBehavior: Health endpoint becomes unresponsive/slow
Triggers: high-response-time alarm (TargetResponseTime > 2s)
Expected Alarm Time: 2-3 minutes
Auto-Recovery: 5 minutes
Use Case: Test performance degradation detection
# Via API
curl "${ALB_URL}/simulate/slow-health"Behavior: Lambda function stops all EC2 instances
Triggers: All alarms (unhealthy hosts, response time, 5xx)
Expected Alarm Time: 3-5 minutes
Auto-Recovery: Manual (requires aws ec2 start-instances)
Use Case: Test infrastructure shutdown detection
Requires: enable_auto_shutdown = true
# Via GitHub Actions
gh workflow run trigger-incidents.yml \
-f environment=dev \
-f incident_type=shutdown_instancesBehavior: Generates 15 concurrent HTTP 503 errors
Triggers: 5xx-errors alarm (HTTPCode_Target_5XX_Count > 10)
Expected Alarm Time: 2-3 minutes
Auto-Recovery: None (errors are transient)
Use Case: Test error rate spike detection
# Via API (generate multiple requests)
for i in {1..15}; do
curl -s "${ALB_URL}/simulate/unhealthy" &
done
waitFor automated, repeatable incident testing with alarm monitoring and auto-restore capabilities.
File: .github/workflows/trigger-incidents.yml
Features:
- Infrastructure auto-discovery via Terraform outputs
- Pre-flight health checks
- CloudWatch alarm state monitoring
- Optional auto-restore with configurable delay
- Detailed execution summary with timelines
Example Usage:
# Trigger workflow via GitHub CLI
gh workflow run trigger-incidents.yml \
-f environment=dev \
-f incident_type=unhealthy_host \
-f target_instance_index=0 \
-f restore_after_incident=true \
-f restore_delay_seconds=300 \
-f wait_for_alarm=trueSee .github/workflows/README-incidents.md for comprehensive workflow documentation.
For direct instance access without SSH:
# Get instance ID
INSTANCE_ID=$(terraform output -json instance_ids | jq -r '.[0]')
# Trigger unhealthy via SSM
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["curl http://localhost/simulate/unhealthy"]'
# Restore healthy
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["curl http://localhost/simulate/healthy"]'Fast CloudWatch alarms designed for rapid incident detection (10-second evaluation periods).
Configuration: monitoring.tf (enabled when enable_monitoring = true)
| Alarm Name | Metric | Threshold | Period | Evaluation | Expected Trigger Time |
|---|---|---|---|---|---|
| unhealthy-hosts | UnHealthyHostCount | ≥ 1 | 10 seconds | 1 period | 2-3 minutes |
| high-response-time | TargetResponseTime | > 2 seconds | 10 seconds | 1 period | 2-3 minutes |
| 5xx-errors | HTTPCode_Target_5XX_Count | > 10 | 10 seconds | 1 period | 2-3 minutes |
Namespace: AWS/ApplicationELB
Dimensions: TargetGroup, LoadBalancer
Traditional CloudWatch alarms use 60-300 second periods. This demo uses 10-second periods for:
- Faster Detection: Alarms trigger in 2-3 minutes vs 5-10 minutes
- Better Agent Testing: Rapid feedback loop for AI agent validation
- Realistic Demo: Keeps testing sessions short and interactive
Trade-off: Higher CloudWatch API costs (~$0.10/alarm/month). For production, use 60-300 second periods.
Published by the Python application every 60 seconds:
Namespace: CustomApp/HealthDemo
| Metric | Values | Use Case |
|---|---|---|
| HealthStatus | 1.0 / 0.0 | Track application health over time |
| IncidentSimulations | Count by type | Monitor incident testing frequency |
Query Example:
aws cloudwatch get-metric-statistics \
--namespace CustomApp/HealthDemo \
--metric-name HealthStatus \
--dimensions Name=Environment,Value=dev \
--start-time 2026-01-24T00:00:00Z \
--end-time 2026-01-24T23:59:59Z \
--period 300 \
--statistics Average- Baseline First: Deploy infrastructure and observe normal metrics for 10 minutes
- Document Timelines: Record incident trigger → alarm transition times
- Test One Type at a Time: Isolate variables for accurate agent validation
- Use Auto-Restore: For repeatable automated testing
- Disable Auto-Restore: When validating agent remediation actions
- Monitor CloudWatch Costs: 10-second periods increase API usage
# List all alarms
aws cloudwatch describe-alarms \
--query 'MetricAlarms[*].[AlarmName,StateValue]' \
--output table
# Get specific alarm state
ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts')
aws cloudwatch describe-alarms --alarm-names ${ALARM_NAME}
# View alarm history
aws cloudwatch describe-alarm-history \
--alarm-name ${ALARM_NAME} \
--max-records 10Configure and test AI-powered DevOps agents using this demo infrastructure.
Grant your agent IAM permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:DescribeAlarms",
"cloudwatch:GetMetricStatistics",
"ec2:DescribeInstances",
"elasticloadbalancing:DescribeTargetHealth",
"elasticloadbalancing:DescribeLoadBalancers",
"logs:GetLogEvents"
],
"Resource": "*"
}
]
}For agent remediation testing, add:
{
"Effect": "Allow",
"Action": [
"ssm:SendCommand",
"ec2:StartInstances",
"ec2:RebootInstances"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}Configure agent to discover infrastructure via tags:
# Get environment tag for filtering
ENVIRONMENT=$(terraform output -raw environment_tag)
# Discover EC2 instances
aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=${ENVIRONMENT}" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,Tags[?Key==`Name`].Value|[0]]'
# Get alarm names
terraform output -json cloudwatch_alarm_namesResource Tags:
Environment: dev/qa/prod (for filtering)ManagedBy: Terraform (automated provisioning)Name: Descriptive resource name
Provide agent with alarm names for monitoring:
terraform output cloudwatch_alarm_names
# Example output:
# {
# "unhealthy_hosts": "dev-devops-demo-unhealthy-hosts",
# "high_response_time": "dev-devops-demo-high-response-time",
# "http_5xx_errors": "dev-devops-demo-5xx-errors"
# }Objective: Validate agent can detect and diagnose incidents
# Trigger unhealthy host incident
curl "${ALB_URL}/simulate/unhealthy"
echo "Incident triggered at $(date)"# Monitor alarm state
ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts')
while true; do
STATE=$(aws cloudwatch describe-alarms \
--alarm-names ${ALARM_NAME} \
--query 'MetricAlarms[0].StateValue' \
--output text)
echo "[$(date +%H:%M:%S)] Alarm state: ${STATE}"
[ "$STATE" == "ALARM" ] && break
sleep 10
doneYour agent should:
- Detect: Identify alarm state change (polling or EventBridge)
- Identify: Determine affected resources (EC2 instances, ALB target group)
- Diagnose: Correlate alarm with health check failures
- Propose: Suggest remediation actions
Expected Agent Output:
Incident Detected: Unhealthy Host Alarm
- Alarm: dev-devops-demo-unhealthy-hosts
- Metric: UnHealthyHostCount = 1
- Affected Instance: i-0123456789abcdef0
- Root Cause: Health check endpoint returning 503
Proposed Remediation:
1. Restart application: systemctl restart webapp
2. Or restore via API: curl http://localhost/simulate/healthy
3. Or wait for auto-recovery (5 minutes)
| Capability | Description | Validation Method |
|---|---|---|
| Alarm Detection | Detect CloudWatch alarm state changes | Trigger incident, verify agent logs |
| Resource Correlation | Identify affected EC2/ALB resources | Check agent identifies correct instances |
| Root Cause Analysis | Diagnose health check failures | Verify agent determines 503 response |
| Remediation Proposal | Suggest fix actions | Review proposed commands |
| Action Execution | Execute SSM commands or API calls | Test with read-write IAM permissions |
| Recovery Validation | Verify alarm returns to OK state | Confirm agent monitors post-fix |
For agents with remediation capabilities (requires IAM permissions):
# Agent executes
INSTANCE_ID="<affected-instance>"
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["sudo systemctl restart webapp"]'# Agent executes
INSTANCE_ID="<affected-instance>"
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["curl http://localhost/simulate/healthy"]'# For shutdown_instances incident
INSTANCE_IDS=$(terraform output -json instance_ids | jq -r '.[]')
aws ec2 start-instances --instance-ids ${INSTANCE_IDS}After agent remediation, verify:
- Alarm state returned to OK (within 2-3 minutes)
- Target health shows "healthy" in ALB
- Application responds with 200 on
/health - CloudWatch metrics show HealthStatus = 1.0
- No new alarms triggered
For testing AI agents with natural language interfaces:
You are a DevOps agent monitoring AWS infrastructure in us-east-1.
Environment: dev
Available Alarms:
- dev-devops-demo-unhealthy-hosts
- dev-devops-demo-high-response-time
- dev-devops-demo-5xx-errors
Task: Monitor CloudWatch alarms and respond to incidents.
When an alarm triggers:
1. Identify affected resources
2. Diagnose root cause
3. Propose remediation
4. Execute fix (if approved)
5. Validate recovery
Start monitoring now.
# Poll CloudWatch every 30 seconds
while True:
alarms = cloudwatch.describe_alarms(AlarmNames=alarm_list)
for alarm in alarms['MetricAlarms']:
if alarm['StateValue'] == 'ALARM':
handle_incident(alarm)
time.sleep(30){
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"state": {
"value": ["ALARM"]
}
}
}Configure webhook to notify agent when incidents are triggered via workflow.
Three pre-configured environments with different resource profiles:
| Setting | dev | qa | prod |
|---|---|---|---|
| Instance Type | t3.micro | t3.small | t3.medium |
| Instance Count | 2 | 2 | 3 |
| Auto-Shutdown | Enabled (2hr) | Enabled (2hr) | Disabled |
| Monitoring | Enabled | Enabled | Enabled |
| SSH Access | Optional | Optional | Disabled |
| Region | us-east-1 | us-east-1 | us-east-1 |
Cost Estimates (with auto-shutdown):
- dev: ~$15-20/month (active 8 hours/day)
- qa: ~$25-30/month (active 8 hours/day)
- prod: ~$80-100/month (always-on, larger instances)
Edit environment-specific tfvars files:
# environments/dev/dev.tfvars
env = "dev"
region = "us-east-1"
instance_type = "t3.micro"
instance_count = 2
enable_auto_shutdown = true
enable_monitoring = true
enable_ssh_access = falseCommon Customizations:
# Increase instance count for multi-host testing
instance_count = 4
# Use larger instances for performance testing
instance_type = "t3.small"
# Disable auto-shutdown for long-running tests
enable_auto_shutdown = false
# Enable SSH for direct instance access
enable_ssh_access = true
ssh_allowed_cidrs = ["1.2.3.4/32"] # Your IP
key_pair_name = "my-key" # Existing EC2 key pairApply changes:
terraform apply -var-file=environments/dev/dev.tfvarsFile: auto_shutdown.tf
Enabled by default: dev, qa (disabled in prod)
How It Works:
- EventBridge rule triggers Lambda every 2 hours
- Lambda stops all EC2 instances in the environment
- Saves ~60% on EC2 costs for idle demos
Manual Control:
# Stop instances manually
INSTANCE_IDS=$(terraform output -json instance_ids | jq -r '.[]')
aws ec2 stop-instances --instance-ids ${INSTANCE_IDS}
# Start instances
aws ec2 start-instances --instance-ids ${INSTANCE_IDS}Disable Auto-Shutdown:
# environments/dev/dev.tfvars
enable_auto_shutdown = falseCost Savings Example:
- Without auto-shutdown: 2x t3.micro × 730 hours = $15/month
- With auto-shutdown (8hr/day): 2x t3.micro × 240 hours = $5/month
- Savings: $10/month (67%)
# Completely remove infrastructure
terraform destroy -var-file=environments/dev/dev.tfvarsDefault: SSH disabled for security
To Enable:
- Create EC2 key pair in AWS console
- Update tfvars:
enable_ssh_access = true
ssh_allowed_cidrs = ["YOUR_IP/32"]
key_pair_name = "your-key-name"- Apply changes:
terraform apply -var-file=environments/dev/dev.tfvars- SSH to instance:
INSTANCE_IP=$(aws ec2 describe-instances \
--instance-ids $(terraform output -json instance_ids | jq -r '.[0]') \
--query 'Reservations[0].Instances[0].PublicIpAddress' \
--output text)
ssh -i ~/.ssh/your-key.pem ec2-user@${INSTANCE_IP}Default: Local state (.terraform/terraform.tfstate)
For Team Collaboration: Use S3 backend
Update versions.tf:
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "aws-devops-demo/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
}Create backend resources:
# Create S3 bucket
aws s3 mb s3://my-terraform-state-bucket --region us-east-1
# Create DynamoDB table for locking
aws dynamodb create-table \
--table-name terraform-state-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1# Install pre-commit hooks (REQUIRED before first commit)
pre-commit install
# Initialize Terraform
terraform init
# Run pre-commit checks manually
pre-commit run --all-files
# Validate Terraform
terraform fmt -recursive
terraform validate
# Run TFLint (downloads AWS plugin first time)
tflint --init
tflint --config=.tflint.hcl
# Run Checkov security scan
checkov --config-file .checkov.yml
# Generate documentation
terraform-docs markdown table --output-file README.md .Automatically run before each commit:
| Check | Purpose |
|---|---|
| terraform_fmt | Format .tf files |
| terraform_validate | Validate syntax |
| terraform_docs | Generate README documentation |
| terraform_tflint | Lint with AWS rules |
| terraform_checkov | Security scanning |
| gitleaks | Secret detection |
| trailing-whitespace | Fix whitespace |
| check-yaml | Validate YAML files |
Configuration: .pre-commit-config.yaml
Location: .github/workflows/
| Workflow | Trigger | Purpose |
|---|---|---|
| pre-commit-ci.yaml | Pull request | Run all quality checks |
| terraform-aws.yml | Manual (workflow_dispatch) | Deploy infrastructure |
| trigger-incidents.yml | Manual (workflow_dispatch) | Trigger incidents for testing |
| release.yaml | Push to main | Semantic versioning and changelog |
Follow Conventional Commits for automated versioning:
# Patch version bump (1.0.0 → 1.0.1)
git commit -m "fix(monitoring): correct alarm threshold"
# Minor version bump (1.0.0 → 1.1.0)
git commit -m "feat(incidents): add new slow-response incident type"
# Major version bump (1.0.0 → 2.0.0)
git commit -m "feat(api): change health endpoint path
BREAKING CHANGE: health endpoint moved from /health to /api/v2/health"
# No version bump
git commit -m "docs: update README with new examples"
git commit -m "chore: update pre-commit hooks"Types:
feat: New feature (minor bump)fix: Bug fix (patch bump)perf: Performance improvement (patch bump)refactor: Code refactoring (patch bump)docs: Documentation only (no bump)chore: Maintenance tasks (no bump)test: Test additions (no bump)
Critical Rule: NEVER commit directly to main
Workflow:
- Create feature branch:
git checkout -b feat/your-feature - Make changes with semantic commits
- Push and create pull request
- Wait for CI/CD checks to pass
- Merge via pull request
Branch Naming:
feat/description- New featuresfix/description- Bug fixeschore/description- Maintenance tasks
Pre-configured development environment available:
# Open in VSCode
code .
# Prompt: "Reopen in Container" → Accept
# Or manually
docker-compose -f .devcontainer/docker-compose.yml upIncludes: Terraform, TFLint, Checkov, AWS CLI, pre-commit
For complete development guidelines, see CLAUDE.md.
Symptoms: ALB returns 502/503 errors, health checks failing
Diagnosis:
# Check instance state
INSTANCE_ID=$(terraform output -json instance_ids | jq -r '.[0]')
aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
--query 'Reservations[0].Instances[0].State.Name'
# Check service status via SSM
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["systemctl status webapp"]'
# View application logs
aws ssm send-command \
--instance-ids ${INSTANCE_ID} \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["journalctl -u webapp -n 50"]'Solutions:
- Restart webapp:
systemctl restart webapp - Check UserData logs:
/var/log/user-data.log - Verify boto3 installed:
python3 -c "import boto3" - Restart instance if needed
Symptoms: Incident triggered but alarm stays OK
Diagnosis:
# Check alarm configuration
ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts')
aws cloudwatch describe-alarms --alarm-names ${ALARM_NAME}
# Check actual metric values
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name UnHealthyHostCount \
--dimensions Name=TargetGroup,Value=$(terraform output -raw alb_arn | cut -d: -f6) \
--start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Maximum
# Verify target health
aws elbv2 describe-target-health \
--target-group-arn $(aws elbv2 describe-target-groups \
--load-balancer-arn $(terraform output -raw alb_arn) \
--query 'TargetGroups[0].TargetGroupArn' --output text)Solutions:
- Wait full evaluation period (10-60 seconds)
- Verify instance actually unhealthy in ALB
- Check alarm dimensions match resources
- Confirm
enable_monitoring = true
Symptoms: Instances not stopping after 2 hours
Diagnosis:
# Check Lambda function
LAMBDA_ARN=$(terraform output -raw auto_shutdown_lambda_arn)
aws lambda get-function --function-name ${LAMBDA_ARN}
# Check EventBridge rule
aws events list-rules --name-prefix "dev-devops-demo-auto-shutdown"
# View Lambda logs
aws logs tail /aws/lambda/dev-devops-demo-auto-shutdown --followSolutions:
- Invoke Lambda manually for testing
- Check Lambda IAM permissions for ec2:StopInstances
- Verify EventBridge rule is ENABLED
- Check instance IDs in Lambda environment variables
Deploy to multiple regions using Terraform workspaces:
# Create workspace for us-west-2
terraform workspace new us-west-2
# Update provider region
export TF_VAR_region=us-west-2
# Deploy
terraform apply -var-file=environments/dev/dev.tfvars
# Switch back to us-east-1
terraform workspace select defaultOr use separate state files per region:
# Deploy to us-east-1
terraform apply -var-file=environments/dev/dev.tfvars -var="region=us-east-1"
# Deploy to us-west-2 with different state
terraform apply -var-file=environments/dev/dev.tfvars -var="region=us-west-2" \
-state=terraform-us-west-2.tfstateExtend the web application with custom incidents:
File: templates/userdata.sh.tpl (line 545+)
Example: Add database connection failure simulation
elif self.path == '/simulate/db-timeout':
health_status["healthy"] = False
health_status["reason"] = "Database connection timeout"
health_status["last_updated"] = datetime.now().isoformat()
schedule_auto_recovery(300)
publish_health_metric()
publish_incident_metric('db-timeout')
self.send_json_response(200, {
"message": "Simulating database timeout (auto-recovery in 5 minutes)"
})Redeploy:
terraform taint aws_instance.web[0] # Force UserData re-run
terraform apply -var-file=environments/dev/dev.tfvarsSend CloudWatch alarms to external systems:
SNS Topics (add to monitoring.tf):
resource "aws_sns_topic" "alarms" {
name = "${local.name_prefix}-alarm-notifications"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alarms.arn
protocol = "email"
endpoint = "your-email@example.com"
}
# Add to alarms
resource "aws_cloudwatch_metric_alarm" "unhealthy_hosts" {
# ... existing config ...
alarm_actions = [aws_sns_topic.alarms.arn]
}Webhook Integration:
resource "aws_sns_topic_subscription" "webhook" {
topic_arn = aws_sns_topic.alarms.arn
protocol = "https"
endpoint = "https://your-agent-endpoint.com/alarms"
}| Document | Purpose |
|---|---|
| README.md (this file) | Main documentation, quick start, architecture |
| CLAUDE.md | Development guidelines, branching strategy, CI/CD |
| BEST-PRACTICES.md | Terraform best practices, security, performance |
| .github/workflows/README-incidents.md | Comprehensive incident workflow documentation |
| CHANGELOG.md | Version history and release notes |
| LICENSE | MIT license terms |
- Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
- CloudWatch Alarms: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
- ALB Health Checks: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
- AWS SSM Session Manager: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html
- Semantic Commits: https://www.conventionalcommits.org/
Auto-generated Terraform documentation (updated by terraform-docs pre-commit hook).
| Name | Version |
|---|---|
| terraform | >= 1.0 |
| archive | ~> 2.7 |
| aws | ~> 6.28 |
| Name | Version |
|---|---|
| archive | 2.7.1 |
| aws | 6.28.0 |
No modules.
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| enable_auto_shutdown | Enable automatic shutdown of instances after 2 hours (cost savings for demo) | bool |
true |
no |
| enable_monitoring | Enable CloudWatch alarms for ALB and targets | bool |
true |
no |
| enable_ssh_access | Enable SSH access to EC2 instances (requires ssh_allowed_cidrs) | bool |
false |
no |
| env | The environment to deploy the resources | string |
"dev" |
no |
| instance_count | Number of EC2 instances to create | number |
2 |
no |
| instance_type | EC2 instance type for the demo web servers | string |
"t3.micro" |
no |
| key_pair_name | EC2 Key Pair name for SSH access (optional, leave empty to disable SSH key) | string |
"" |
no |
| prefix | The prefix for all resource's names | string |
"dev" |
no |
| region | The region to deploy the resources | string |
"us-east-1" |
no |
| ssh_allowed_cidrs | List of CIDR blocks allowed to SSH to EC2 instances | list(string) |
[] |
no |
| vpc_cidr | CIDR block for the VPC | string |
"10.0.0.0/16" |
no |
| Name | Description |
|---|---|
| alb_arn | ARN of the Application Load Balancer |
| alb_dns_name | DNS name of the Application Load Balancer |
| alb_url | URL to access the application |
| auto_shutdown_enabled | Whether auto-shutdown is enabled |
| auto_shutdown_lambda_arn | ARN of the auto-shutdown Lambda function (if enabled) |
| cloudwatch_alarm_names | Map of alarm types to alarm names for incident testing |
| cloudwatch_alarms | Names of CloudWatch alarms (if monitoring enabled) |
| environment | The environment name |
| environment_tag | Tag value to use in DevOps Agent Space for resource discovery |
| health_check_url | URL to check health endpoint |
| instance_ids | IDs of the EC2 instances |
| instance_private_ips | Private IP addresses of the EC2 instances |
| public_subnet_ids | IDs of the public subnets |
| region | The AWS region where resources are deployed |
| resource_prefix | The prefix used for resource naming |
| restore_health_command | Command to restore healthy status |
| trigger_failure_command | Command to trigger health check failure (run via SSM or SSH) |
| vpc_id | ID of the VPC |