Cost Optimization and Incident Prevention with AWS Lambda Schedulers

Introduction

This is draft of article for aws community blog.

In modern cloud infrastructure, cost optimization and proactive incident prevention are crucial for maintaining efficient operations. This document outlines our implementation of AWS Lambda-based scheduling and monitoring systems that help reduce costs and prevent potential issues before they impact production.

Resource Scheduling System

Our infrastructure utilizes several specialized Lambda functions to manage different AWS resources:

1. ASG (Auto Scaling Group) Scheduler

The ASG scheduler manages compute resources based on time schedules:

def lambda_handler(event, context):
"""
Manages Auto Scaling Groups based on schedule:
Working hours (9:00-18:00): Normal capacity
Off hours: Minimum capacity
Weekends: Zero capacity (for non-production)
"""
try:
asg_name = event.get('asg_name')
if is_weekend():
update_asg_capacity(asg_name, min=0, desired=0, max=0)
elif is_working_hours():
update_asg_capacity(asg_name, min=1, desired=2, max=4)
else:
update_asg_capacity(asg_name, min=1, desired=1, max=2)
except Exception as e:
logger.error(f"ASG scheduling failed: {str(e)}")

2. RDS (Relational Database Service) Maintenance Scheduler

The RDS scheduler handles database maintenance tasks:

def lambda_handler(event, context):
"""
Manages RDS instances:
Stops development databases during off-hours
Maintains production databases 24/7
Schedules maintenance windows
"""
for instance in get_rds_instances():
if instance.tags.get('Environment') != 'production':
if is_off_hours():
stop_rds_instance(instance.id)
else:
start_rds_instance(instance.id)

3. EKS (Elastic Kubernetes Service) Scheduler

The EKS scheduler manages Kubernetes clusters:

def lambda_handler(event, context):
"""
Manages EKS node groups:
Scales down during off-hours
Adjusts capacity based on workload patterns
"""
for nodegroup in list_nodegroups():
if should_scale_down(nodegroup):
update_nodegroup_size(nodegroup, desired=0)
else:
restore_nodegroup_capacity(nodegroup)

Incident Prevention System

CloudWatch Metrics Monitoring

Our system implements proactive monitoring of critical metrics:

Database Metrics
- Storage space utilization
- CPU usage
- Connection count
- IOPS utilization
Application Metrics
- Response times
- Error rates
- Queue lengths
- Memory usage

Automated Prevention Actions

Example of automated response to metrics:

def handle_metric_alarm(event, context):
"""
Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications
"""
metric_name = event['detail']['metricName']
if metric_name == 'FreeStorageSpace':
execute_vacuum_maintenance()
elif metric_name == 'CPUUtilization':
scale_compute_resources()

Slack Notifications

Our system sends notifications to Slack channels:

def handle_metric_alarm(event, context):

Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications

IAM Security Configuration

Each Lambda function has specific IAM roles with least-privilege access:

EC2 Scheduler Role
resource "aws_iam_role" "ec2_scheduler_lambda" {
name = "Ec2SchedulerLambda"
# Permissions for EC2 management
}
RDS Scheduler Role
resource "aws_iam_role" "rds_scheduler_lambda" {
name = "RDSSchedulerLambda"
# Permissions for RDS management
}
EKS Scheduler Role
resource "aws_iam_role" "eks_scheduler_lambda" {
name = "eksSchedulerLambda"
# Permissions for EKS management
}

Cost Optimization Features

Automated Resource Management
- Scheduled start/stop of development resources
- Capacity adjustment based on usage patterns
- Weekend and holiday scheduling
Preventive Maintenance
- Automated database VACUUM operations
- Storage space monitoring
- Performance optimization
Resource Right-sizing
- Regular utilization analysis
- Automatic scaling adjustments
- Cost-effective resource allocation

Benefits Achieved

Cost Reduction
- 40-60% reduction in development environment costs
- Elimination of idle resource costs
- Optimized resource utilization
Improved Reliability
- Zero downtime due to storage issues
- Proactive issue detection
- Automated maintenance procedures
Operational Efficiency
- Reduced manual intervention
- Consistent resource management
- Automated incident response

Best Practices

Resource Tagging

tags = {
    Name = "resource-name"
    Environment = "dev"
    Schedule = "business-hours"
}

Monitoring Configuration
- Set appropriate thresholds based on historical data
- Implement graduated response actions
- Maintain comprehensive monitoring documentation
Security Measures
- Use least-privilege IAM roles
- Implement proper error handling
- Maintain audit logs

Implementation Example

Here's an example of our RDS maintenance implementation:

def maintenance_task():
try:
# Connect to database
conn = connect_to_database()
# Execute maintenance
execute_vacuum_full()
# Notify success
send_notification("Maintenance completed successfully")
finally:
# Auto-terminate instance
terminate_instance()
)
## Monitoring and Alerting

Monitoring and Alerting

Critical Metrics
- Database storage utilization
- Application error rates
- Resource utilization patterns
- Performance metrics
Alert Thresholds
- Warning: 70% utilization
- Critical: 85% utilization
- Emergency: 95% utilization
Response Actions
- Automated maintenance
- Resource scaling
- Team notifications

Conclusion

Our AWS Lambda-based scheduling and monitoring system has proven highly effective in:

Reducing operational costs through automated resource management
Preventing incidents through proactive monitoring
Improving system reliability through automated maintenance
Reducing team workload through automation

The combination of scheduled resource management and proactive monitoring ensures optimal resource utilization while maintaining system stability and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cost-optimize		cost-optimize
iam		iam
prevention-problem/rds-metric-adhoc		prevention-problem/rds-metric-adhoc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cost Optimization and Incident Prevention with AWS Lambda Schedulers

Introduction

Resource Scheduling System

1. ASG (Auto Scaling Group) Scheduler

2. RDS (Relational Database Service) Maintenance Scheduler

3. EKS (Elastic Kubernetes Service) Scheduler

Incident Prevention System

CloudWatch Metrics Monitoring

Automated Prevention Actions

Slack Notifications

IAM Security Configuration

Cost Optimization Features

Benefits Achieved

Best Practices

Implementation Example

Monitoring and Alerting

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

License

igorgorovoy/cost-optimization-using-aws-lambda

Folders and files

Latest commit

History

Repository files navigation

Cost Optimization and Incident Prevention with AWS Lambda Schedulers

Introduction

Resource Scheduling System

1. ASG (Auto Scaling Group) Scheduler

2. RDS (Relational Database Service) Maintenance Scheduler

3. EKS (Elastic Kubernetes Service) Scheduler

Incident Prevention System

CloudWatch Metrics Monitoring

Automated Prevention Actions

Slack Notifications

IAM Security Configuration

Cost Optimization Features

Benefits Achieved

Best Practices

Implementation Example

Monitoring and Alerting

Conclusion

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages