Automate alerts with AWS CloudWatch: Stop guessing about infrastructure health
CloudWatch alarms monitor a single metric and trigger an action when it crosses a threshold. That is the foundation. The problem with a naive implementation is alert noise: an alarm on CPU utilisation that fires at 70% for any five-minute period will fire constantly on workloads with variable CPU, training your team to ignore the alerts. An alarm that fires only when CPU has been above 85% for three consecutive periods, combined with memory utilisation being elevated and error rates rising, is actionable.
This guide covers how to build CloudWatch alerting that signals real problems rather than generating noise.
Configure IAM permissions before starting
CloudWatch alarm management requires IAM permissions. Create a role or policy for monitoring administrators with at minimum:
- cloudwatch:PutMetricAlarm, cloudwatch:DeleteAlarms, cloudwatch:DescribeAlarms
- sns:CreateTopic, sns:Subscribe, sns:Publish for alert routing
- logs:CreateLogGroup, logs:PutLogEvents for log-based metrics
For operators who need to view alarms without modifying them, CloudWatch ReadOnly access is sufficient. Separate the viewer and modifier roles: on-call engineers need to view and acknowledge alarms; only platform engineers need to modify alarm configurations.
Tag all CloudWatch resources (alarms, log groups, dashboards) with the same tagging schema as the resources they monitor. This enables cost allocation for monitoring spend and makes it clear which team owns each alarm.
Create alarms on the metrics that matter
Not all metrics warrant alarms. The metrics worth alerting on are those where an unexpected value indicates a problem that requires human attention. CPU at 100% on a batch processing job is expected; CPU at 100% on a web server is a problem.
EC2 key metrics:
- CPUUtilization > 85% for 3 consecutive 5-minute periods
- StatusCheckFailed = 1 (instance or system check failed)
- NetworkOut spike above 3x the baseline (potential data exfiltration or misconfigured service)
RDS key metrics:
- FreeStorageSpace < 10% of provisioned storage (alert before the database runs out of disk)
- DatabaseConnections > 80% of max_connections (connection exhaustion is disruptive to debug under pressure)
- ReplicaLag > 30 seconds for read replicas (indicates replication falling behind)
Lambda key metrics:
- Errors count > 0 per minute sustained (function errors are usually silent without this)
- Throttles count > 10 per minute (Lambda hitting concurrency limits)
- Duration P99 approaching the function timeout (near-timeout executions indicate performance degradation)
ALB key metrics:
- HTTPCode_Target_5XX_Count > 0 sustained (5xx errors from backends)
- TargetResponseTime P99 > your SLO threshold
- UnHealthyHostCount > 0 (backend instance failing health checks)
Use composite alarms to reduce noise
A composite alarm combines multiple alarms into a single logical condition using AND/OR operators. Instead of alerting individually on every metric breach, a composite alarm fires only when multiple conditions are true simultaneously, eliminating false positives.
Example: a web server composite alarm that fires only when all three are true:
- CPUUtilization-High alarm is in ALARM state AND
- HTTPCode_Target_5XX-Elevated alarm is in ALARM state AND
- TargetResponseTime-P99-Elevated alarm is in ALARM state
Create this in CloudWatch under Alarms > Create composite alarm. Define the rule expression using the alarm ARNs:
ALARM("CPUUtilization-High") AND ALARM("HTTPCode_Target_5XX-Elevated") AND ALARM("TargetResponseTime-P99-Elevated")
This composite alarm fires only when the application is CPU-constrained, returning errors, and slow. A brief CPU spike without corresponding errors and latency increase does not trigger it. The composite alarm replaces three individual noisy alerts with one actionable one.
Composite alarms can also use OR logic for critical single-signal conditions: fire if disk is full OR instance status check fails, either condition demands immediate attention regardless of other metrics.
Enable anomaly detection for dynamic thresholds
Fixed thresholds are inappropriate for metrics with cyclical patterns. An e-commerce site has higher request rates on weekday lunchtimes and lower rates at 3am. A fixed alert threshold set for peak traffic will never fire during off-peak; a threshold set for off-peak traffic will fire constantly during normal peak hours.
CloudWatch Anomaly Detection trains a statistical model on a metric's historical data (2 weeks minimum) and dynamically adjusts the expected range based on the time of day and day of week. You set the band width (how many standard deviations outside the expected range triggers the alarm).
Enable anomaly detection on: - Request rates and response times (highly cyclical, fixed thresholds are wrong) - Error rates (anomalous spikes matter more than absolute levels) - Data transfer metrics (unexpected patterns may indicate a data leak or misconfiguration)
Anomaly detection alarms require a training period. Enable them during a representative period: not during a known unusual event like a product launch or a holiday with atypical traffic.
Route alerts to the right destination
CloudWatch alarms trigger actions via SNS topics. An SNS topic routes notifications to subscribers: email, SMS, Lambda function, HTTP/HTTPS endpoint, or SQS queue.
For operational teams using Slack, route SNS notifications to Slack via AWS Chatbot (for AWS Chatbot-supported notifications) or a Lambda function that formats the CloudWatch alarm payload into a Slack message and posts it to a webhook.
Route different severity alerts to different channels: - Critical alerts (instance down, database connection exhausted): primary on-call channel with PagerDuty or similar escalation - Warning alerts (elevated error rates, high CPU): engineering ops channel for awareness - Informational alerts (quota approaching, scheduled task completed): low-priority channel or email only
Do not route all alerts to the same channel. Alert routing discipline is what separates teams that respond to alerts from teams that are drowning in them and ignoring everything.
Where Critical Cloud comes in
CloudWatch alerting is foundational observability infrastructure, but it has well-known gaps: it operates at the infrastructure layer without application-layer context, composite alarms require careful design to avoid both noise and coverage gaps, and it does not correlate AWS infrastructure signals with application traces or business metrics. As the world's first Powered by Datadog accredited partner, we extend beyond CloudWatch to provide correlated infrastructure, application, and business-level observability in a single platform. CloudWatch signals feed into a broader operational picture rather than being the whole picture. See how Critical Support works.