Amazon CloudWatch is a monitoring service for AWS cloud resources and applications customers run on AWS. It’s built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. Amazon CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing customers with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. Using AWS CloudWatch customers can collect and track metrics, collect and monitor log files, and set alarms.
- Amazon CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by customers applications and services, and any log files their applications generate.
- With CloudWatch, AWS customers gain system-wide visibility into resource utilization, application performance, and operational health.
- Using the metrics, customers can calculate statistics and then present the data graphically in the CloudWatch console.
- To provide additional scalability and reliability, each data center facility is located in a specific geographical area (Region). Each Region is designed to be completely isolated from the other Regions, to achieve the greatest possible failure isolation and stability.
Amazon CloudWatch Features
Amazon CloudWatch dashboards enable customers to create re-usable graphs and helps them visualize their cloud resources and applications in a unified view. Customers can correlate the log pattern of a specific metric and set alarms to be proactively alerted about performance and operational issues.
- This gives customers system-wide visibility into operational health and the ability to quickly troubleshoot issues, reducing Mean Time to Resolution (MTTR).
- Amazon CloudWatch alarms allow customers to set a threshold on metrics and trigger an action.
- Real-time alarm on metrics and events enables customers to minimize downtime and potential business impact.
Amazon CloudWatch correlates metrics and logs helps customers to quickly go from diagnosing the problem to understanding the root cause. Amazon CloudWatch Application Insights for .NET and SQL Server enables customers to monitor .NET and SQL Server applications, so that they can get visibility into the health of such applications.
- It helps identify and set up key metrics and logs across customers application resources and technology stack.
Using Amazon CloudWatch ServiceLens, customers can visualize and analyze the health, performance, and availability of their applications in a single place. CloudWatch ServiceLens ties together CloudWatch metrics and logs as well as traces from AWS X-Ray to give customers a complete view of the applications and their dependencies.
- This enables customers to quickly pinpoint performance bottlenecks, isolate root causes of application issues, and determine users impacted.
- Customers can gain visibility into their applications through three different ways: Through Infrastructure monitoring, Transaction monitoring, and End user monitoring.
Amazon CloudWatch Synthetics allows AWS customers to monitor application endpoints more easily. It runs tests on the endpoints every minute, 24×7, and alerts them as soon as their application endpoints don’t behave as expected.
- These tests can be customized to check for availability, latency, transactions, broken or dead links, step by step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications.
- It also can be used to isolate alarming application endpoints and map them back to underlying infrastructure issues to reduce mean time to resolution.
- Amazon CloudWatch Synthetics supports monitoring of customers REST APIs, URLs, and website content, checking for unauthorized changes from phishing, code injection and cross-site scripting.
Auto Scaling enables AWS customers to automate capacity and resource planning. They can set a threshold to alarm on a key metric and trigger an automated Auto Scaling action.
- Amazon CloudWatch Events provides a near real-time stream of system events that describe changes to customer AWS resources.
- It allows customers to respond quickly to operational changes and take corrective action.
The Amazon CloudWatch Logs service allows customers to collect and store logs from their resources, applications, and services in near real-time.
- There are three main categories of logs Vended logs, Logs that are published by AWS services, and Custom logs.
Amazon CloudWatch enables customers to collect default metrics from more than 70 AWS services, such as Amazon EC2, Amazon DynamoDB, Amazon S3, Amazon ECS, AWS Lambda, and Amazon API Gateway.
Using Amazon CloudWatch, customers can collect custom metrics from their own applications to monitor operational performance, troubleshoot issues, and spot trends. Container Insights simplifies the collection and aggregation of curated metrics and container ecosystem logs.
- It collects compute performance metrics such as CPU, memory, network, and disk information from each container as performance events.
AWS Batch allows customers to set up multiple queues with different priority levels. Batch jobs are stored in the queues until compute resources are available to execute the job. The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a queue based on the resource requirements of each job.
- The scheduler evaluates the priority of each queue and runs jobs in priority order on optimal compute resources such as memory vs CPU optimized, as long as those jobs have no outstanding dependenciesGP.
Container Insights provides automatic dashboards in the CloudWatch console. These dashboards summarize the compute performance, errors, and alarms by cluster, pod/task, and service.
- For Amazon EKS and k8s, dashboards are also available for nodes/EC2 instances and namespaces.
Amazon CloudWatch Anomaly Detection applies machine-learning algorithms to continuously analyze data of a metric and identify anomalous behavior. It enables customers to create alarms that auto-adjust thresholds based on natural metric patterns, such as time of day, day of week seasonality, or changing trends.
- AWS customers can visualize metrics with anomaly detection bands on dashboards. Which enables them to monitor, isolate, and troubleshoot unexpected changes in their metrics.
Amazon CloudWatch allows its customers to monitor trends and seasonality with 15 months of metric data (storage and retention). Which helps them to perform historical analysis to fine-tune resource utilization.
- Amazon CloudWatch Metric Math enables customers to perform calculations across multiple metrics for real-time analysis so that they can derive insights from the existing CloudWatch metrics.
- Amazon CloudWatch Logs Insights enables customers to drive actionable intelligence from their logs to address operational issues.
Container Insights simplifies the analysis of observable data from metrics, logs, and traces by simplifying deep linking from automatic dashboards to granular performance events, application logs (stdout/stderr), custom logs, predefined Amazon EC2 instance logs, Amazon EKS/k8s data plane logs and Amazon EKS control plane logs using CloudWatch Logs Insights’ advance query language.
Amazon CloudWatch Integrations
Amazon Simple Notification Service (Amazon SNS) coordinates and manages the delivery or sending of messages to subscribing endpoints or clients.
- Using Amazon SNS with CloudWatch customers can send messages when an alarm threshold has been reached.
Amazon EC2 Auto Scaling enables customers to automatically launch or terminate Amazon EC2 instances based on user-defined policies, health status checks, and schedules.
- AWS customers can use a CloudWatch alarm with Amazon EC2 Auto Scaling to scale your EC2 instances based on demand.
AWS CloudTrail enables customers to monitor the calls made to the Amazon CloudWatch API for their account, including calls made by the AWS Management Console, AWS CLI, and other services.
- When CloudTrail logging is turned on, CloudWatch writes log files to the Amazon S3 bucket that customers specified when they configured CloudTrail.
AWS Identity and Access Management (IAM) is a web service that helps AWS clients securely control access to AWS resources for your users.
- Using IAM, customers can control AWS resources (authentication) and what resources they can use in which ways (authorization).
Amazon CloudWatch is basically a metrics repository. An AWS service—such as Amazon EC2—puts metrics into the repository, and AWS customers retrieve statistics based on those metrics.
Amazon CloudWatch Concept
AWS Batch can be integrated with commercial and open-source workflow engines and languages such as:
Statistics are metric data aggregations over specified periods of time. CloudWatch provides statistics based on the metric data points provided by AWS clients custom data or provided by other AWS services to CloudWatch.
- Aggregations are made using the namespace, metric name, dimensions, and the data point unit of measure, within the time period you specify
Each statistic has a unit of measure, such as Bytes, Seconds, Count, and Percent. Customers can specify a unit when they create a custom metric. Units help provide conceptual meaning to your data. Though Amazon CloudWatch attaches no significance to a unit internally, other applications can derive semantic information based on the unit.
- Metric data points that specify a unit of measure are aggregated separately.
- If customers get statistics without specifying a unit, Amazon CloudWatch aggregates all data points of the same unit together.
- If there are two otherwise identical metrics with different units, two separate data streams are returned, one for each unit.
Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points as representing the values of that variable over time
- When AWS services send metrics to CloudWatch, customers add the data points in any order, and at any rate you choose.
- Metrics exist only in the Region in which they are created. Although Metrics cannot be deleted, they automatically expire after 15 months if no new data is published to them.
- Data points older than 15 months expire on a rolling basis; as new data points come in, data older than 15 months is dropped.
- Metrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point in a metric has a time stamped, and a unit of measure.
CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. Customers can only retrieve statistics using combinations of dimensions that they specifically published.
- When retrieving statistics, customers can specify the same values for the namespace, metric name, and dimension parameters that were used when the metrics were created.
- They can also specify the start and end times for CloudWatch to use for aggregation.
A period is the length of time associated with a specific Amazon CloudWatch statistic. Each statistic represents an aggregation of the metrics data collected for a specified period of time. Periods are defined in numbers of seconds, and valid values for period are 1, 5, 10, 30, or any multiple of 60.
- Only custom metrics that customers define with a storage resolution of 1 second support sub-minute periods.
- Even though the option to set a period below 60 is always available in the console, customers should rather select a period that aligns to how the metric is stored.
- Periods are important for CloudWatch alarms. When creating an alarm to monitor a specific metric, customers are asking CloudWatch to compare that metric to the threshold value that you specified.
- Customers can not only specify the period over which the comparison is made, but they can also specify how many evaluation periods are used to arrive at a conclusion.
A namespace is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics.
- There is no default namespace. Customers need to specify a namespace for each data point customers publish to CloudWatch.
- Customers can specify a namespace name during the creation of a metric. These names must contain valid XML characters, and be fewer than 256 characters in length.
Each metric data point must be associated with a time stamp. The time stamp can be up to two weeks in the past and up to two hours into the future. CloudWatch creates a time stamp for customers based on the time the data point was received if they didn’t provide one.
- Time stamps are dateTime objects, with the complete date plus hours, minutes, and seconds
- Amazon CloudWatch alarms check metrics based on the current time in UTC (Universal Time). Custom metrics sent to CloudWatch with time stamps other than the current UTC time can cause alarms to display the Insufficient Data state or result in delayed alarms.
A dimension is a name/value pair that is part of the identity of a metric, and AWS customers can assign up to 10 dimensions to a metric. Every metric has specific characteristics that describe it. Dimensions can be described as categories for those characteristics.
- Dimensions help customers design a structure for their statistics plan. Because dimensions are part of the unique identifier for a metric, whenever customers add a unique name/value pair to one of the metrics, by default they are creating a new variation of that metric.
- AWS services that send data to CloudWatch attach dimensions to each metric. You can use dimensions to filter the results that CloudWatch returns.
- For metrics produced by certain AWS services, such as Amazon EC2, CloudWatch can aggregate data across dimensions.
CloudWatch retains metric data as follows:
- Data points with a period of less than 60 seconds are available for 3 hours. These data points are high-resolution custom metrics.
- Data points with a period of 60 seconds (1 minute) are available for 15 days
- Data points with a period of 300 seconds (5 minute) are available for 63 days
- Data points with a period of 3600 seconds (1 hour) are available for 455 days (15 months)
A percentile indicates the relative standing of a value in a dataset. Percentiles help customers to get a better understanding of the distribution of their metric data. Percentiles are often used to isolate anomalies.
- Using percentiles, customers can monitor the 95th percentile of CPU utilization to check for instances with an unusually heavy load.
- Some CloudWatch metrics support percentiles as a statistic. For these metrics, customers can monitor your system and applications using percentiles as you would when using the other CloudWatch statistics.
AWS customers are able to use an alarm to automatically initiate actions on their behalf. An alarm watch is a single metric over a specified time period, and performs one or more specified actions, based on the value of the metric relative to a threshold over time.
- The action is a notification sent to an Amazon SNS topic or an Auto Scaling policy.
- Alarms invoke actions for sustained state changes only. Because they are in a particular state, CloudWatch alarms do not invoke actions. The state must have to change and must be maintained for a specified number of periods.
Amazon CloudWatch Dashboards
Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that customers can use to monitor their resources in a single view, even those resources that are spread across different Regions. Customers can use CloudWatch dashboards to create customized views of the metrics and alarms for the AWS resources.
Using dashboards, AWS customers can create the following:
- A single view for selected metrics and alarms to help you assess the health of the resources and applications across one or more regions, and select the color used for each metric on each graph, so that they can easily track the same metric across multiple graphs. Customers can create dashboards that display graphs and other widgets from multiple AWS accounts and multiple Regions. For more information, see Cross-Account Cross-Region CloudWatch Console.
- An operational playbook that provides guidance for team members during operational events about how to respond to specific incidents.
- A common view of critical resource and application measurements that can be shared by team members for faster communication flow during operational events.
You can create dashboards by using the console, the AWS CLI, or the
To access CloudWatch dashboards, you need one of the following:
- The AdministratorAccess policy
- The CloudWatchFullAccess policy
- A custom policy that includes one or more of these specific permissions:
cloudwatch:ListDashboardsto be able to view dashboards
cloudwatch:PutDashboardto be able to create or modify dashboards
cloudwatch:DeleteDashboardsto be able to delete dashboards