A demonstration of monitoring and observability using Prometheus and Grafana on AWS infrastructure, showcasing the use of Histogram metrics to present the 95 precentil response time and the heat map of the API latency.
- Architecture Overview
- Components Description
- Demo Flow
- Prerequisites
- Step-by-Step Setup Guide
- Testing Procedures
- Cleanup
This demo deploys 4 EC2 instances on AWS to demonstrate comprehensive monitoring and observability:
┌──────────────────────────────────┐
│ Load Generation Server │
│ (Locust/Custom Python) │
│ - Normal loading │
│ - High read requests │
│ - High write requests │
│ - DB CPU stress │
└──────────────┬───────────────────┘
│
│ HTTP Requests (port 5000)
│
▼
┌──────────────────┐ ┌──────────────────┐
│ Web Server │ │ Monitor Server │
│ (Flask App) │◄────────│ (Prometheus & │
│ - Query API │ Metrics │ Grafana) │
│ - Write API │ (8000, │ │
│ - Metrics │ 9100) │ - Prometheus │
│ endpoints │ │ - Grafana │
└────────┬─────────┘ │ - Dashboards │
│ └────────┬─────────┘
│ MySQL Query │
│ (port 3306) │
│ Metrics │
│ (9100, 9104)│
▼ ▼
┌────────────────┐
│ MySQL DB │
│ │
│ - customer │
│ - order │
│ - orderitems │
│ - mysqld_ │
│ exporter │
│ - node_ │
│ exporter │
└────────────────┘
- All instances deployed in a single VPC with security groups for controlled access
- Web Server: Public/Private subnet (HTTP ports 5000, 8000 for metrics)
- MySQL DB: Private subnet (Port 3306, Prometheus exporter on 9104)
- Monitor Server: Public subnet (Prometheus 9090, Grafana 3000) - monitors Web Server and DB Server only
- Load Gen Server: Public subnet (Locust 8089) - loads Web Server only
Purpose: Serves API endpoints for order management with built-in metrics collection
Specifications:
- Instance Type: t3.medium
- OS: Amazon Linux 2 or Ubuntu 22.04
- Application: Python 3.9+ with Flask framework
- Metrics: Prometheus client library
Features:
GET /api/order/<order_id> - Query order by ID
POST /api/order - Create new order
GET /metrics - Prometheus metrics endpoint
POST /api/order request body:
{
"customer_id": 10,
"customer": {
"name": "Jane Doe",
"email": "jane@example.com"
},
"items": [
{
"product_id": 101,
"quantity": 2,
"unit_price": 19.99
}
]
}customer_idanditemsare required.customeris optional.- If
customer_iddoes not exist in thecustomertable, the API creates a customer row usingcustomer.name/customer.emailwhen provided, or generated fallback values.
Metrics Exposed:
- HTTP request count (by endpoint, method, status)
- HTTP request duration (histogram)
- API health metrics focused on request volume, latency, and status distribution
Purpose: Persistent data storage for order management system
Specifications:
- Instance Type: t3.medium
- OS: Amazon Linux 2 or Ubuntu 22.04
- Database: MySQL 8.0
- Metrics Export: mysqld_exporter (Prometheus)
Schema:
CREATE TABLE customer (
customer_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
email VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT,
total_amount DECIMAL(10,2),
status VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (customer_id) REFERENCES customer(customer_id)
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY AUTO_INCREMENT,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);Metrics Exposed:
- Connection count
- Query performance metrics
- Replication lag (if applicable)
- InnoDB metrics
- Table statistics
Purpose: Centralized monitoring and visualization
Specifications:
- Instance Type: t3.small (expandable based on metrics volume)
- OS: Amazon Linux 2 or Ubuntu 22.04
- Prometheus: Latest stable version
- Grafana: Latest stable version
Prometheus Targets:
- Web Server metrics (port 8000)
- MySQL Database exporter (port 9104)
- Node exporter on all servers (port 9100)
Grafana Dashboards (3 main dashboards):
-
Infrastructure Dashboard
- CPU utilization (Web Server and DB Server)
- Network I/O (Web Server and DB Server)
- Memory usage
- Disk I/O
-
Application Performance Dashboard
- 95th percentile response time for read API
- Request latency distribution (heatmap)
- Request count per second
- Error rate
-
Database Dashboard
- Connection count
- Query execution time
- Slow queries
- InnoDB buffer pool efficiency
Purpose: Simulate realistic and stress-test scenarios
Specifications:
- Instance Type: t3.medium
- OS: Amazon Linux 2 or Ubuntu 22.04
- Load Testing: Locust (Python-based)
Load Scenarios:
- Stage 1 - Normal Loading: 10 users, steady-state for 5 minutes
- Stage 2 - High Read Requests: 50 concurrent users, 80% read / 20% write for 5 minutes
- Stage 3 - High Write Requests: 30 concurrent users, 20% read / 80% write for 5 minutes
- Stage 4 - DB CPU Stress: Combined 40 concurrent users + stress-ng on DB server for 10 minutes
- All servers operational and connected to Prometheus
- Baseline metrics collected
- Grafana dashboards initialized with historical data
- Load generator ramps to 10 users
- Observe steady-state metrics on Grafana
- Verify API response times (target <100ms)
- Monitor resource utilization (CPU <30%, Network <50%)
- Load increases to 50 concurrent users
- 80% read operations (query API), 20% write
- Observe response time distribution heatmap
- Monitor database connection pool
- Verify 95th percentile response time (target <300ms)
- Load changes to 30 concurrent users
- 20% read, 80% write operations
- Observe increased database write performance
- Monitor disk I/O on database server
- Check for write contention
- 40 concurrent users (mixed read/write)
- stress-ng adds CPU load to database (50-80% CPU)
- Observe system behavior under stress
- Verify graceful degradation
- Monitor error rates and timeout behavior
- Gradually reduce load to zero
- Document peak metrics
- Export metrics/dashboards for analysis
- IAM user with EC2, VPC, and Security Group permissions
- VPC with appropriate CIDR blocks (e.g., 10.0.0.0/16)
- At least 4 elastic IPs or public subnet access
- AWS CLI configured with appropriate credentials
- SSH client installed
- Terraform or CloudFormation (optional, for IaC)
- Basic knowledge of AWS, Linux, and networking
- Ubuntu 22.04 LTS or Amazon Linux 2 AMI
- Minimum 2 vCPU, 4GB RAM per instance (except t3.small for monitor)
- 20GB EBS root volume
Description: Create foundational AWS infrastructure using the default VPC and deploy all 4 EC2 instances.
What this step does:
-
Uses the default VPC for your AWS account
-
Selects the first available subnet from the default VPC
-
Creates 4 Security Groups with proper ingress/egress rules:
- Web Server SG: Allows SSH (from your IP), Flask API (5000 from Load Gen), Metrics (8000, 9100 from Monitor)
- DB Server SG: Allows SSH (from your IP), MySQL (3306 from Web Server), Exporters (9100, 9104 from Monitor)
- Monitor Server SG: Allows SSH (from your IP), Prometheus (9090 from your IP), Grafana (3000 from your IP)
- Load Gen Server SG: Allows SSH (from your IP), Locust (8089 from your IP)
-
Launches 4 EC2 Instances in the selected subnet with public IP assignment:
- Web Server (t3.medium): Ubuntu 22.04 LTS - Flask application host
- DB Server (t3.medium): Ubuntu 22.04 LTS - MySQL database host
- Monitor Server (t3.small): Ubuntu 22.04 LTS - Prometheus & Grafana host
- Load Gen Server (t3.medium): Ubuntu 22.04 LTS - Locust load testing host
-
Configures networking:
- All instances in the same subnet for direct connectivity
- Public IPs automatically assigned for external access
- Security groups enforce least-privilege access
Terraform Files Location: /terraform/
Files Included:
variables.tf- Input variable definitionsmain.tf- Infrastructure code (VPC, Security Groups, EC2 Instances)outputs.tf- Connection information and resource IDsterraform.tfvars- Default configuration valuesREADME.md- Detailed Terraform setup instructions
How to Proceed:
-
Navigate to the terraform directory:
cd terraform -
Update
terraform.tfvarswith your IP address:your_ip = "203.0.113.0/32" # Replace with your actual IP
-
Review the infrastructure:
terraform init terraform plan
-
Deploy the infrastructure:
terraform apply
-
Retrieve connection information:
terraform output ssh_commands terraform output access_urls
See terraform/README.md for detailed instructions, troubleshooting, and advanced configurations.
Description: Configure the Flask-based API server with Prometheus metrics collection.
What this step does:
- Installs Python 3, pip, and virtual environment tools
- Creates and activates a Python virtual environment
- Installs required dependencies: Flask, Prometheus client, and PyMySQL
- Deploys Flask application with two APIs:
- GET
/api/order/<order_id>- Query orders from database - POST
/api/order- Create new orders with items
- GET
- Exposes Prometheus metrics endpoint at
/metrics - Implements HTTP request counting and request duration histograms for API health monitoring
- Configures systemd service for automatic Flask app startup
- Sets environment variables for database connectivity
Source layout: The Flask app source lives under src/ at the repo root (src/app.py, src/requirements.txt, src/order-api.service). Terraform injects these into the web server via user_data.
Ask me: "Generate Terraform and shell scripts for Step 2 to setup Web Server"
Description: Configure the MySQL database server with schema, data, and monitoring exporters.
What this step does:
- Installs MySQL Server 8.0 and enables the service
- Creates
orders_dbdatabase and three tables:customertable with customer informationorderstable with order detailsorder_itemstable with individual items per order
- Creates foreign key relationships and performance indexes on key columns
- Creates database user
app_userwith appropriate permissions for the Flask app - Seeds initial test data (sample customers, orders, and items)
- Installs and configures
mysqld_exporterfor Prometheus metrics collection - Creates exporter user with minimal privileges for security
- Installs and configures
node_exporterfor system-level metrics (CPU, memory, disk, network) - Enables both exporters as systemd services
Source layout: Database schema and seed data live under db/ at the repo root (db/schema.sql, db/seed.sql). Terraform injects these into the DB server via user_data.
Ask me: "Generate Terraform and shell scripts for Step 3 to setup MySQL Server"
Description: Deploy Prometheus time-series database and Grafana visualization platform for centralized monitoring.
What this step does:
- Installs Prometheus with proper user permissions and directory structure
- Configures Prometheus to scrape metrics from:
- Web Server on port 8000 (Flask app metrics) and port 9100 (node_exporter)
- DB Server on port 9104 (mysqld_exporter) and port 9100 (node_exporter)
- Prometheus itself on port 9090
- Sets up Prometheus systemd service with 15-second scrape intervals
- Installs Grafana server and enables auto-start
- Configures Prometheus as Grafana's primary data source
- Sets up 3 main Grafana dashboards:
- Infrastructure Dashboard: CPU, network I/O, memory, disk I/O metrics
- Application Performance Dashboard: Response times, latency distribution, request rates, error rates
- Database Dashboard: Connection counts, query performance, table statistics
- Ensures Prometheus and Grafana services start automatically on reboot
Ask me: "Generate Terraform and shell scripts for Step 4 to setup Monitor Server"
Description: Configure load generation server with Locust for simulating different traffic patterns.
What this step does:
- Installs Python 3, pip, and stress-ng tools
- Installs Locust framework for distributed load testing
- Creates Locust test file with balanced read/write operations:
- Read tasks: Query existing orders via GET
/api/order/<order_id> - Write tasks: Create new orders with items via POST
/api/order - Both operations use random data to simulate realistic behavior
- Read tasks: Query existing orders via GET
- Configures wait times between requests (1-3 seconds) to simulate user think time
- Prepares server for different load scenarios (normal, read-heavy, write-heavy, stressed)
Ask me: "Generate Terraform and shell scripts for Step 5 to setup Load Generation Server"
Description: Set up visualization dashboards in Grafana to monitor system and application metrics.
What this step does:
-
Adds Prometheus as a data source to Grafana
-
Creates Infrastructure Dashboard showing:
- CPU utilization for Web and DB servers
- Network I/O throughput
- Memory and disk usage
- Real-time system health metrics
-
Creates Application Performance Dashboard showing:
- 95th percentile response time for read API
- Response time distribution as a heatmap
- HTTP request rate per second
- Error rate percentage
-
Creates Database Dashboard showing:
- Active database connections
- Query execution time
- Table-level statistics
- InnoDB buffer pool efficiency
-
Configures appropriate refresh intervals and time ranges for real-time monitoring
-
Sets up alert thresholds for key metrics (optional)
Ask me: "Generate Terraform and shell scripts for Step 6 to configure Grafana Dashboards"
- Verify all services are running:
# From web server
curl http://localhost:5000/health
# From db server
mysql -u root -p orders_db -e "SELECT COUNT(*) FROM orders;"
mysqld_exporter —check
# From monitor server
curl http://localhost:9090/-/healthy
curl http://localhost:3000/api/health- Verify connectivity:
# From web server to DB
nc -zv db-server-private-ip 3306
# From monitor to web/db servers
nc -zv web-server-private-ip 9100
nc -zv db-server-private-ip 9100
nc -zv db-server-private-ip 9104- Verify data in database:
# From web server
python3 << 'EOF'
import pymysql
conn = pymysql.connect(host='db-server-private-ip', user='app_user', password='app_password', database='orders_db')
cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM orders')
count = cursor.fetchone()
print(f"Orders in database: {count[0]}")
conn.close()
EOFStart from load gen server:
locust -f locustfile.py -H http://web-server-public-ip:5000 --headless -u 10 -r 2 -t 5mMonitor on Grafana:
- Target CPU utilization: 10-20%
- Response times: 50-150ms
- Request rate: ~20 req/sec
- Error rate: <0.1%
Create modified locustfile for read-heavy workload (90% read, 10% write):
locust -f locustfile_read_heavy.py -H http://web-server-public-ip:5000 --headless -u 50 -r 5 -t 5mMonitor on Grafana:
- Web server CPU: 40-60%
- DB server CPU: 30-50%
- 95th percentile response time: <300ms
- Request rate: ~100 req/sec
locust -f locustfile_write_heavy.py -H http://web-server-public-ip:5000 --headless -u 30 -r 3 -t 5mMonitor:
- DB write operations increasing
- Disk I/O on DB server: 50-80 Mbps
- Memory usage increasing (order items buffering)
- Response times: 100-500ms
Start load test:
locust -f locustfile.py -H http://web-server-public-ip:5000 --headless -u 40 -r 4 -t 10m &In separate session on DB server, apply stress:
sudo stress-ng --cpu "$(nproc)" --cpu-load 95 --io 4 --vm 2 --vm-bytes 70% --timeout 10m --metrics-briefLighter option (safer for controlled demos):
sudo stress-ng --cpu 2 --cpu-load 70 --io 1 --vm 1 --vm-bytes 25% --timeout 5m --metrics-briefMonitor:
- DB CPU usage: 80-100%
- Response times degrading: 300-1000ms
- Request queuing visible
- Error rate increasing if load too high
- Graceful degradation verification
After each scenario:
- No error rate spikes above 5%
- Response times within acceptable bounds
- Database maintains consistency (check row counts)
- No resource exhaustion (CPU <95%, Memory <90%)
- Prometheus collecting metrics (check
/api/v1/targets) - Grafana dashboards updating in real-time
# On load gen server
sudo pkill -f stress-ng
pkill -f locust# On each server
sudo systemctl stop flask-app # Web server
sudo systemctl stop mysqld_exporter # DB server
sudo systemctl stop mysqld # DB server
sudo systemctl stop prometheus # Monitor server
sudo systemctl stop grafana-server # Monitor server# Terminate instances
aws ec2 terminate-instances --instance-ids i-xxxxx i-xxxxx i-xxxxx i-xxxxx
# Delete security groups
aws ec2 delete-security-group --group-id sg-xxxxx
# Delete VPC
aws ec2 delete-vpc --vpc-id vpc-xxxxx- 4 x t3.medium instances: ~$0.104/hour
- 1 x t3.small instance: ~$0.052/hour
- Total: ~$0.468/hour
- 8-hour demo cost: ~$3.74
# Check logs
sudo journalctl -u flask-app -n 50
# Test database connection
python3 -c "import pymysql; pymysql.connect(host='db-ip', user='app_user', password='app_password', database='orders_db')"# Check targets
curl http://localhost:9090/api/v1/targets
# Check if exporters are running
curl http://web-server-private-ip:8000/metrics
curl http://db-server-private-ip:9104/metrics# Verify data source connection
curl http://localhost:3000/api/datasources
# Check Prometheus for data
curl 'http://localhost:9090/api/v1/query?query=up'- Check if DB indexes are created
- Monitor DB connection pool (max_connections)
- Check network bandwidth between servers
- Verify no other processes consuming resources
- High Availability: Deploy in multi-AZ with RDS for database
- Scalability: Use Auto Scaling Groups and load balancers
- Security: Enable encryption, use private subnets, VPN for access
- Persistent Storage: Use EBS volumes with snapshots
- Backup: Regular database backups and disaster recovery testing
- Monitoring: Setup alerting rules and escalation policies
- Logging: Centralized logging with CloudWatch or ELK stack
- Performance: Enable query caching, optimize slow queries
- Prometheus Documentation
- Grafana Documentation
- Flask Documentation
- Locust Load Testing
- AWS EC2 Documentation
MIT License - Feel free to modify and distribute
For issues or questions, please open an issue on this repository.