Skip to main content

Amazon CloudWatch

Core concept: CloudWatch is AWS's central observability service โ€” metrics, logs, alarms, and events in one place.


Metricsโ€‹

Default Metrics (Free)โ€‹

Auto-collected from AWS services:

ServiceMetrics
EC2CPUUtilization, NetworkIn/Out, DiskRead/Write
LambdaInvocations, Errors, Duration, Throttles, ConcurrentExecutions
SQSNumberOfMessagesSent, ApproximateNumberOfMessages, ApproximateAgeOfOldestMessage
API GatewayCount, Latency, 4XXError, 5XXError
DynamoDBConsumedReadCapacityUnits, SuccessfulRequestLatency
EC2 โ€” RAM is NOT default

MemoryUtilization requires installing the CloudWatch Agent on the EC2 instance. This is a common exam trick question.

Custom Metricsโ€‹

CloudWatchClient cloudWatch = CloudWatchClient.create();

// Standard resolution (1-minute granularity) โ€” free
cloudWatch.putMetricData(PutMetricDataRequest.builder()
.namespace("MyApp/OrderProcessing")
.metricData(MetricDatum.builder()
.metricName("OrdersProcessed")
.value(42.0)
.unit(StandardUnit.COUNT)
.timestamp(Instant.now())
.dimensions(
Dimension.builder().name("Environment").value("prod").build(),
Dimension.builder().name("Service").value("order-service").build()
)
.build())
.build());
ResolutionGranularityCost
Standard1 minuteFree
High Resolution1 secondHigher

CloudWatch Alarmsโ€‹

Metric โ†’ Threshold โ†’ Alarm State โ†’ Action

Alarm Statesโ€‹

StateMeaning
OKMetric is within threshold
ALARMMetric breached threshold
INSUFFICIENT_DATANot enough data to determine (usually on start)

Alarm Actionsโ€‹

ActionExample
SNS notificationEmail ops team
Auto ScalingScale EC2 fleet
EC2 actionStop/reboot/recover/terminate instance
Lambda invokeCustom remediation

Composite Alarmsโ€‹

ALARM if:
(CPUAlarm AND MemoryAlarm)
OR
(ErrorRateAlarm AND LatencyAlarm)

Reduce alert noise by combining multiple alarms.


CloudWatch Logsโ€‹

Log Groups and Log Streamsโ€‹

Log Group: /aws/lambda/my-function
โ””โ”€โ”€ Log Stream: 2024/01/15/[$LATEST]abc123...
โ””โ”€โ”€ Log Stream: 2024/01/15/[$LATEST]def456...
  • Log Group = application / function (you define)
  • Log Stream = single instance / execution environment

Retention Policiesโ€‹

# CloudFormation
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/lambda/my-function
RetentionInDays: 30 # Never expires by default!
Default = Never Expire

By default, log groups never expire โ€” costs accumulate. Always set a retention policy!

Log Insights โ€” Query Languageโ€‹

-- Find Lambda errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

-- Lambda cold start detection
filter @message like /Init Duration/
| stats avg(@initDuration), count() by bin(5m)

-- P99 latency for Lambda
filter @type = "REPORT"
| stats pct(@duration, 99) as p99Latency by bin(1h)

Metric Filtersโ€‹

Create a CloudWatch Metric from log patterns:

Log: "[ERROR] Payment failed for order-456"
โ†“ Metric Filter: "ERROR"
CloudWatch Metric: ErrorCount (increments by 1)
โ†“
CloudWatch Alarm: ErrorCount > 5 per minute โ†’ SNS alert

CloudWatch Agent (EC2)โ€‹

For collecting:

  • Memory utilization
  • Disk space
  • Custom application logs
  • Process-level metrics
# Install and configure on EC2
sudo yum install amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
sudo systemctl start amazon-cloudwatch-agent

EventBridge (formerly CloudWatch Events)โ€‹

Schedule or react to AWS events:

# SAM โ€” trigger Lambda every 5 minutes
MyScheduledFunction:
Type: AWS::Serverless::Function
Properties:
Events:
ScheduleEvent:
Type: Schedule
Properties:
Schedule: rate(5 minutes)
# Or cron: cron(0 12 * * ? *) โ†’ every day at noon UTC

Lambda-Specific Metricsโ€‹

MetricDescriptionAlert on
ErrorsInvocation errors> 0
ThrottlesThrottled invocations> 0 in prod
DurationExecution time (ms)> 80% of timeout
ConcurrentExecutionsLive executionsNear account limit
IteratorAgeFor ESM โ€” age of Kinesis/DynamoDB recordsGrowing = consumer behind

๐Ÿงช Practice Questionsโ€‹

Q1. A developer notices that Lambda is being throttled in production. They want to be alerted when throttles exceed 10 per minute. What should they set up?

A) X-Ray tracing
B) CloudWatch Alarm on the Throttles metric with threshold > 10
C) CloudWatch Log Metric Filter on Lambda logs
D) SNS subscription to Lambda error notifications

โœ… Answer & Explanation

B โ€” The Lambda Throttles metric is a built-in CloudWatch metric. Create a CloudWatch Alarm with threshold > 10 and an SNS notification action for immediate alerting.


Q2. A developer needs to monitor memory usage on an EC2 instance. They set up a CloudWatch alarm on MemoryUtilization but no data appears. Why?

A) EC2 doesn't support memory metrics
B) The metric namespace is wrong
C) Memory is not a default EC2 metric โ€” the CloudWatch Agent must be installed
D) The IAM role doesn't allow CloudWatch access

โœ… Answer & Explanation

C โ€” EC2 default metrics only include CPU, Network, and Disk I/O. RAM/Memory requires the CloudWatch Agent to be installed and configured on the instance.


Q3. A Lambda function logs errors in JSON format. A developer wants to count the number of "status": "FAILED" occurrences per minute. What is the correct approach?

A) Use X-Ray to count errors
B) Create a CloudWatch Metric Filter on the log group matching FAILED, then create an alarm on the resulting metric
C) Query CloudWatch Log Insights every minute via cron
D) Use CloudWatch Contributor Insights

โœ… Answer & Explanation

B โ€” A Metric Filter continuously monitors log data and increments a custom metric whenever the pattern matches. This is more efficient than periodic Log Insights queries and supports real-time alarming.


Q4. What happens to CloudWatch Logs if no retention policy is set on a Log Group?

A) Logs are deleted after 90 days
B) Logs are archived to S3 after 30 days
C) Logs are kept indefinitely (never expire) โ€” costs accumulate
D) Logs are deleted after the default 7 days

โœ… Answer & Explanation

C โ€” By default, CloudWatch Log Groups have no expiration โ€” logs are stored forever and you pay for storage. Always set a RetentionInDays policy.


๐Ÿ”— Resourcesโ€‹