Amazon Batch

Batch computing is used by developers, scientists, and engineers to access large amounts of compute resources. Amazon batch service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. It automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. 

  • AWS Batch dynamically provisions the optimal quantity and type of compute resources of CPU or memory optimized instances based on the volume and specific resource requirements of the batch jobs submitted.
  • AWS Batch plans, schedules, and executes client’s batche computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances.

Batch features

AWS Batch efficiently and dynamically provisions and scales Amazon EC2 and Spot Instances based on the requirements of the jobs. Customers can configure their AWS Batch Managed Compute Environments with requirements such as:

  • Type of EC2 instances 
  • VPC subnet configurations
  • The min/max/desired vCPUs across all instances, and 
  • The amount they are willing to pay for Spot Instances as a % of the On-Demand Instance price.

If AWS customers elected to provision and manage their own compute resources within AWS Batch Unmanaged Compute Environments they need to use different configurations such as larger EBS volumes or a different operating system for their EC2 instances.

  • Once they select their compute environment, they need to provision EC2 instances that include the Amazon ECS agent and run supported versions of Linux and Docker.

AWS Batch supports multi-node parallel jobs, which enables users to run single jobs that span multiple EC2 instances. This feature allows AWS customers to use AWS Batch to easily and efficiently run workloads such as large-scale, tightly-coupled High Performance Computing (HPC) applications or distributed GPU model training. 

  • AWS Batch also supports Elastic Fabric Adapter, a network interface that enables users to run applications that require high levels of inter-node communication at scale on AWS.

AWS Batch enables EC2 Launch Templates, which help users to build customized templates for their compute resources, and allowing Batch to scale instances with those requirements. 

  • Users can specify their EC2 Launch Template to add storage volumes, specify network interfaces, or configure permissions, among other capabilities. 
  • EC2 Launch Templates reduce the number of steps required to configure Batch environments by capturing launch parameters within one resource.

AWS Batch enables customers to specify resource requirements, such as vCPU and memory, AWS Identity and Access Management (IAM) roles, volume mount points, container properties, and environment variables, to define how jobs are to be run. 

  • AWS Batch executes any  jobs as containerized applications running on Amazon ECS. 
  • Batch also enables them to define dependencies between different jobs. 
    • With dependencies, users can create three jobs with different resource requirements where each successive job depends on the previous job.

AWS Batch allows customers to set up multiple queues with different priority levels. Batch jobs are stored in the queues until compute resources are available to execute the job. The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a queue based on the resource requirements of each job. 

  • The scheduler evaluates the priority of each queue and runs jobs in priority order on optimal compute resources such as memory vs CPU optimized, as long as those jobs have no outstanding dependenciesGP.

AWS support customers who want to use GPU scheduling, which allows them to specify the number and type of accelerators their jobs require as job definition input variables in AWS Batch. 

  • Graphics Processing Unit(GPU) is a processor designed to handle graphics operations. This includes both 2D and 3D calculations, though GPUs primarily excel at rendering 3D graphics.
  • AWS Batch will scale up instances appropriate for the customers jobs based on the required number of GPUs and isolate the accelerators according to each job’s needs, so only the appropriate containers can access them.
  • All instance types in a compute environment that will run GPU jobs should be from the p2, p3, g3, g3s, or g4 instance families. If this is not done a GPU job could get stuck in the RUNNABLE status.

Compute resources

There are three ways to allocate compute resources. These methods will factor in throughput as well as price when deciding how AWS Batch should scale instances on their behalf.

  • Best Fit: In the case of Best Fit, AWS Batch selects an instance type that best fits the needs of the jobs with a preference for the lowest-cost instance type. If additional instances of the selected instance type are not available, AWS Batch will wait for the additional instances to be available.
  • Best Fit Progressive: In this case, AWS Batch will select additional instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types with a lower cost per unit vCPU. 
    • If additional instances of the previously selected instance types are not available, AWS Batch will select new instance types.
  • Spot Capacity Optimized: In this method AWS Batch will select one or more instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types that are less likely to be interrupted. 
    • This allocation strategy is only available for Spot Instance compute resources.


AWS Batch can be integrated with commercial and open-source workflow engines and languages such as:


Luig is a workflow management system to efficiently launch a group of tasks with defined dependencies between them.

  • Luig not only can be in a Python based API that builds and executes pipelines of Hadoop jobs, but it can also be used to create workflows with any external jobs written in R or Scala or Spark.


Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. 

  • Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

Apache Airflow

  • Apache Airflow also known as Airflow is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
    • Customers can use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. 
    • The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. 
    • Rich command line utilities make performing complex surgeries on DAGs a snap. 
    • It has a rich user interface, that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Pegasus WMS

Pegasus WMS is a scientific workflow management system that can manage the execution of complex workflows on distributed resources. It is funded by National Science Foundation

  • Pegasus WMS has been used in a number of scientific domains including astronomy, bioinformatics, earthquake science , gravitational wave physics, ocean science, limnology, and others.


Nextflow is a relatively light weighted java application that a single user can easily manage. Nextflow workflow is easy to run any analysis while transparently managing all of the issues that tend to crop up when running a shell script (missing dependencies, not enough resources, hard to tell where failures are coming from, not easily published or transferred to collaborators)

    • Nextflow has robust reporting features, real time status updates and workflow failure or completion handling and notifications via a user-definable on-completion process.

AWS Step Functions

AWS Step Functions enables customers to coordinate multiple AWS services into serverless workflows, so that they can build and update apps quickly. 

  • Using Step Functions, customers can design and run workflows that stitch together services, such as AWS Lambda, AWS Fargate, and Amazon SageMaker, into feature-rich applications. 
  • Workflows are made up of a series of steps, with the output of one step acting as input into the next.

Job States

  • STARTING These jobs have been scheduled to a host and the relevant container initiation operations are underway. After the container image is pulled and the container is up and running, the job transitions to RUNNING.
  • RUNNING The job is running as a container job on an Amazon ECS container instance within a compute environment. When the job’s container exits, the process exit code determines whether the job succeeded or failed. An exit code of 0 indicates success, and any non-zero exit code indicates failure. If the job associated with a failed attempt has any remaining attempts left in its optional retry strategy configuration, the job is moved to RUNNABLE again.
  • SUCCEEDED The job has successfully completed with an exit code of 0. The job state for SUCCEEDED jobs is persisted in AWS Batch for 24 hours.
  • FAILED The job has failed all available attempts. The job state for FAILED jobs is persisted in AWS Batch for 24 hours.

Jobs are the unit of work executed by AWS Batch, and it can be executed as containerized applications running on Amazon ECS container instances in an ECS cluster. Containerized jobs can reference a container image, command, and parameters.

When customers submit a job to an AWS Batch job queue, the job enters the SUBMITTED state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code). AWS Batch jobs can have the following states:

  • SUBMITTED A job that has been submitted to the queue, and has not yet been evaluated by the scheduler. The scheduler evaluates the job to determine if it has any outstanding dependencies on the successful completion of any other jobs. If there are dependencies, the job is moved to PENDING. If there are no dependencies, the job is moved to RUNNABLE.
  • PENDING A job that resides in the queue and is not yet able to run due to a dependency on another job or resource. After the dependencies are satisfied, the job is moved to RUNNABLE.
  • RUNNABLE A job that resides in the queue, has no outstanding dependencies, and is therefore ready to be scheduled to a host. Jobs in this state are started as soon as sufficient resources are available in one of the compute environments that are mapped to the job’s queue. However, jobs can remain in this state indefinitely when sufficient resources are unavailable.

Batch Components

Node Groups

A node group is an identical group of job nodes, where all nodes share the same container properties. AWS Batch lets customers specify up to five distinct node groups for each job. 

  • Each group can have its own container images, commands, environment variables, and so on.
  • In addition they also can use all of the nodes in their job as a single node group, and the application code can differentiate node roles from main node to child node.

Multi-node parallel jobs

Multi-node parallel jobs are used to run single jobs that span multiple Amazon EC2 instances. Batch multi-node parallel jobs can run large-scale, tightly coupled, high performance computing applications and distributed GPU model training without the need to launch, configure, and manage Amazon EC2 resources directly. An AWS Batch multi-node parallel job is compatible with any framework that supports IP-based, internode communication, such as Apache MXNet, TensorFlow, Caffe2, or Message Passing Interface (MPI).

  • Multi-node parallel jobs are submitted as a single job, or as a job submission node overrides, that specifies the number of nodes to create for the job and what node groups to create. 
  • Each multi-node parallel job contains a main node, which needs to be launched first. Once the main node is up and running, the child nodes will be launched and started. 
    • If the main node exits, the job is considered finished, and the child nodes are stopped. For more information, see Node Groups.
  • Multi-node parallel job nodes are single-tenant, meaning that only a single job container is run on each Amazon EC2 instance.
  • Each multi-node parallel job contains a main node. The main node is a single subtask that AWS Batch monitors to determine the outcome of the submitted multi node job. 
    • The main node is launched first and it moves to the STARTING status.

Array job

An array job is a job that shares common parameters, such as the job definition, vCPUs, and memory. It runs as a collection of related, yet separate, basic jobs that may be distributed across multiple hosts and may run concurrently. Array jobs are the most efficient way to execute embarrassingly parallel jobs such as Monte Carlo simulations, parametric sweeps, or large rendering jobs.

  • AWS Batch array jobs are submitted just like regular jobs. However, the array size needs to be between 2 and 10,000 to define how many child jobs should run in the array.
  • If the submitted job has an array size of 1000, a single job runs and spawns 1000 child jobs. The array job is a reference or pointer to manage all the child jobs. This allows customers to submit large workloads with a single query

Job Definitions

AWS Batch job definitions specify how jobs are to be run. While each job must reference a job definition, many of the parameters that are specified in the job definition can be overridden at runtime. The following are some of the attribute:

  • Which Docker image to use with the container in your job
  • How many vCPUs and how much memory to use with the container
  • The command the container should run when it is started
  • What (if any) environment variables should be passed to the container when it starts
  • Any data volumes that should be used with the container
  • What (if any) IAM role your job should use for AWS permissions


Job Queues

Jobs are submitted to a job queue, where they reside until they are able to be scheduled to run in a compute environment. Any AWS account can have multiple job queues. 

  • Customers can create a queue that uses Amazon EC2 On-Demand instances for high priority jobs and another queue that uses Amazon EC2 Spot Instances for low-priority jobs. 
  • Job queues have a priority that is used by the scheduler to determine which jobs in which queue should be evaluated for execution first.
  • The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.

Compute Environments

Job queues are generally mapped to one or more compute environments. The compute environments contain the Amazon ECS container instances that are used to run containerized batch jobs. Within a job queue, the associated compute environments each have an order that is used by the scheduler to determine where to place jobs that are ready to be executed. 

  • If the first compute environment has free resources, then the job is scheduled to a container instance within that compute environment. 
  • If the compute environment is unable to provide a suitable compute resource, the scheduler attempts to run the job on the next compute environment.

Unmanaged Compute Environments

Unmanaged Compute Environments environment, in this case customers are responsible for managing their own compute resources. 

  • Customers need to make sure that the AMI in use for their compute resources meets the Amazon ECS container instance AMI specification.
  • Once the unmanaged compute environment is created, customers can use the DescribeComputeEnvironments API operation to view the compute environment details. 
      • Find the Amazon ECS cluster that is associated with the environment and then manually launch your container instances into that Amazon ECS cluster.

Managed Compute Environments

Managed compute environments allow customers to describe their business requirements. In a managed compute environment, AWS Batch manages the capacity and instance types of the compute resources within the environment, based on the compute resource specification that they define when they create the compute environment.

  • AWS customers have two choices to use Amazon EC2: On-Demand Instances or Spot Instances.
  • Managed compute environments launch Amazon ECS container instances into the VPC and subnets that the clients specify when they created the compute environment.

Amazon Elastic Container Service (ECS)

Amazon Elastic Container Service (Amazon ECS) is one of the  compute services provided by Amazon, which is a highly scalable, fast, container management service that makes it easy to run, stop, and manage Docker containers on a cluster. Amazon ECS lets clients launch and stop container-based applications with simple API calls. Amazon ECS allows  customers to launch and stop container-based applications with simple API calls, that allows them to get the state of the cluster from a centralized service, and gives access to many familiar Amazon EC2 features. ECS is a great choice to run containers for several reasons. 

  • AWS customers are able to run their ECS clusters using AWS Fargate, which is serverless compute for containers.
  • it can natively integrate with other services such as Amazon Route 53, Secrets Manager, AWS Identity and Access Management (IAM), and Amazon CloudWatch providing you a familiar experience to deploy and scale your containers.
  • ECS is used extensively within Amazon to power services such as Amazon SageMaker, AWS Batch, Amazon Lex, and’s recommendation engine, ensuring ECS is tested extensively for security, reliability, and availability.


Images are built from a Dockerfile, that is a plain text file that specifies all of the components that are included in the container. These images are then stored in a registry from which they can be downloaded and run on the cluster.

  • A Docker container is a standardized unit of software development, containing everything that the client software application needs to run including code, runtime, system tools, system libraries, etc. Containers are created from a read-only template called an image. 
  • To deploy applications on Amazon ECS, clients application components need to be architected in order to run it in containers.

After creating a task definition for the application within Amazon ECS, customers can specify the number of tasks that will run on their cluster. A task is the instantiation of a task definition within a cluster. 

  • Each task that uses the Fargate launch type has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
  • Amazon ECS task scheduler is responsible for placing tasks within the cluster. 

Amazon ECS allows customers to define tasks through a declarative JSON template called a Task Definition. Within a Task Definition they can specify one or more containers that are required for the task, including the Docker repository and image, memory and CPU requirements, shared data volumes, and how the containers are linked to each other.

  • The API actions, which is provided b ECS allow customers to create and delete clusters, register and deregister tasks, launch and terminate Docker containers, and provide detailed information about the state of your cluster and its instances. 
  • Customers can upload a new version of their application task definition, and the Amazon ECS scheduler automatically starts new containers using the updated image and stop containers running the previous version. 
  • The Amazon ECS will automatically recover unhealthy containers to ensure that you have the desired number of containers supporting your application.

The task definition is a text file, in JSON format, that describes one or more containers, up to a maximum of ten, that form your application. It can be thought of as a blueprint for your application. Task definitions specify various parameters for your application.

  • The specific parameters available for the task definition depend on which launch type you are using.
  • In order, the application to run on Amazon ECS, customers need to create a task definition.


Amazon ECS container instance is an Amazon EC2 instance, which runs the Amazon ECS container agent. Amazon ECS ddownloadsthe clients container images from a registry that they specify, and runs those images within the cluster. When running tasks using Amazon ECS, users place them on a cluster, which is a logical grouping of resources. 

  • When using the Fargate launch type with tasks within the customers cluster, Amazon ECS manages the cluster resources. 
  • When using the EC2 launch type, then the customers clusters are a group of container instances they manage.

Amazon ECS is integrated with AWS Cloud Map, that helps customers  discover and connect  their containerized services with each other.  Cloud Map enables customers to define custom names for application resources, and it maintains the updated location of these dynamically changing resources. 

  • Service mesh makes it easy to build and run complex microservices applications by standardizing how every microservice in the application communicates.
  • Amazon Elastic Container Service supports Docker networking and integrates with Amazon VPC to provide isolation for containers. 
  • Amazon ECS is integrated with Elastic Load Balancing, allowing customers to distribute traffic across your containers using Application Load Balancers or Network Load Balancers.
  • Amazon ECS allows clients to specify an IAM role for each ECS task. This allows the Amazon ECS container instances to have a minimal role

Amazon Fargate

AWS Fargate platform versions are used to refer to a specific runtime environment for Fargate task infrastructure. It is a combination of the kernel and container runtime versions.

  • FireLens for Amazon ECS enables customers to use task definition parameters to route logs to an AWS service or AWS Partner Network (APN) destination for log storage and analytics.
  • Recycling for Fargate tasks, which is the process of refreshing tasks that are a part of an Amazon ECS service.
  • It has a definition parameters that enable customers to define a proxy configuration, dependencies for container startup and shutdown as well as a per-container start and stop timeout value
  • It supports injecting sensitive data into the containers that store in either AWS Secrets Manager secrets or AWS Systems Manager Parameter Store parameters and then referencing them in the container definition.
  • It enables CloudWatch Container Insights, and supports spot capacity provider.

Fargate Platform Version‐1.2.0 enables private registry authentication using AWS Secrets Manager.

Fargate Platform Version‐1.1.0

It has the Amazon ECS task metadata endpoint, supports Docker health checks in container definitions, and it also supports Amazon ECS service discovery. 

  • Clusters are Region-specific
  • A cluster may contain a mix of tasks using either the Fargate or EC2 launch types.
  • A cluster may contain a mix of both Auto Scaling group capacity providers and Fargate capacity providers, however when specifying a capacity provider strategy they may only contain one or the other but not both.
  • It allows customers to create custom IAM policies to allow or restrict user access to specific clusters.

AWS Fargate is a technology that is used with Amazon ECS to run containers without having to manage servers or clusters of Amazon EC2 instances. AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS)

The core task of Fragate is to provision and scale clusters, patch and update each server, task placement strategies including, manages the availability of containers All the user need to do is define the application’s requirements, select Fargate as the launch type in the console or CLI, and Fargate takes care of the rest.

  • Each Fargate task has its own isolation boundary, and it does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
  • Amazon ECS EC2 launch type enables customers to manage a cluster of servers and schedule placement of containers on the servers. 
  • In order to take full advantage of Fargate, customers required to do the following:
    • Amazon ECS task definitions for Fargate network mode need to be set to awsvpc. The awsvpc network mode provides each task with its own elastic network interface.
    • Customers need to specify CPU and memory at the task level. They can also specify CPU and memory at the container level for Fargate tasks, if they desire to do so. Most use cases are satisfied by only specifying these resources at the task level.
  • Individual ECS tasks or EKS pods each run in their own dedicated kernel runtime environment and do not share CPU, memory, storage, or network resources with other tasks and pods. This ensures workload isolation and improved security for each task or pod.
  • Fargate,is built-in integrations with other AWS services including Amazon CloudWatch Container Insights. With this customers gather metrics and logs for monitoring their applications through an extensive selection of third party tools with open interfaces.


The  type of instance that client specify determines the hardware of the host computer used for their instance. Each instance type offers different compute, memory, and storage capabilities and are grouped in instance families based on these capabilities. Each instance type provides higher or lower minimum performance from a shared resource.

ECS Cluster Auto Scaling

ECS Cluster Auto Scaling (CAS) is a service provided by AWS, and it has capability to manage the scaling ECS for EC2 Auto Scaling Groups (ASG). With CAS, customers can configure ECS to scale the ASG automatically. Each cluster has one or more capacity providers and an optional default capacity provider strategy. 

  • ECS will ensure the ASG scales in and out as needed with no further intervention required. CAS relies on ECS capacity providers, which provide the link between ECS cluster and the ASGs. Each ASG is associated with a capacity provider, and each such capacity provider has only one ASG, but many capacity providers can be associated with one ECS cluster.
  • When managed scaling is enabled, Amazon ECS manages the scale-in and scale-out actions of the Auto Scaling group. Amazon ECS creates an AWS Auto Scaling scaling plan with a target tracking scaling policy based on the target capacity value the customer specified.

Storage Optimised:– Storage optimized instances are designed for workloads that require high, sequential read and write access to very large data sets on local storage. They are optimized to deliver tens of thousands of low-latency, random I/O operations per second (IOPS) to applications.

  • H1 and D2 instances feature up to 16 TB 48 TB of HDD-based local storage respectively, both deliver high disk throughput, and a balance of compute and memory. D2 instances offer the lowest price per disk throughput performance on Amazon EC2.
  • I3 and I3en These instance family provides Non-Volatile Memory Express (NVMe) SSD-backed instance storage optimized for low latency, very high random I/O performance, high sequential read throughput (I3) and provide high IOPS, high sequential disk throughput (I3en), and offers the lowest price per GB of SSD instance storage on Amazon EC2.

Task definition is the recipe that  Amazon ECS use the customer cluster. Task definition written as JSON statements. A task definition is required to run Docker containers in Amazon ECS. Task definitions are split into separate parts: the task family, the IAM task role, the network mode, container definitions, volumes, task placement constraints, and launch types.

  • The family and container definitions are required in a task definition, while task role, network mode, volumes, task placement constraints, and launch type are optional.
  • Amazon ECS provides a GPU-optimized AMI that comes ready with pre-configured NVIDIA kernel drivers and a Docker GPU runtime.
  • Amazon ECS enables you to inject sensitive data into your containers by storing your sensitive data in either AWS Secrets Manager secrets or AWS Systems Manager Parameter Store parameters and then referencing them in your container definition.

Capacity Providers

Capacity Providers manage compute capacity for containers, that allow the application to define its requirements for how it uses the capacity. It can be used to define flexible rules for how containerized workloads run on different types of compute capacity, and manage the scaling of the capacity. Using Capacity Providers improve the availability, scalability, and cost of running tasks and services on ECS.

  • A capacity provider is used in association with a cluster to determine the infrastructure that a task runs on. For Amazon ECS on Amazon EC2 users, a capacity provider consists of a name, an Auto Scaling group, and the settings for managed scaling and managed termination protection.
  • A default capacity provider strategy is associated with each Amazon ECS cluster. Which determines the capacity provider strategy the cluster will use if no other capacity provider strategy or launch type is specified when running a task or creating a service. 
  • A capacity provider strategy gives customers control over how their tasks use one or more capacity providers. The capacity provider strategy consists of one or more capacity providers with an optional base and weight specified for each provider.
  • Capacity Providers work with both EC2 and Fargate, do that customers can create a Capacity Provider associated with an EC2 Auto Scaling Group (ASG)
  • Splitting running tasks and services across multiple Capacity Providers enables new capabilities such as running a service in a predefined split percentage across Fargate and Fargate Spot, or ensuring that a service runs an equal number of tasks in multiple availability zones without requiring the service to rebalance.

Compute Optimised

Compute Optimized instances are ideal for compute bound applications that benefit from high performance processors. Instances belonging to this family are well suited for batch processing workloads, media transcoding, high performance web servers, high performance computing (HPC), scientific modeling, dedicated gaming servers and ad server engines, machine learning inference and other compute intensive applications.

  • C5n instances are ideal for high compute applications (including High Performance Computing (HPC) workloads, data lakes, and network appliances such as firewalls and routers) that can take advantage of improved network throughput and packet rate performance. C5n instances offers up to 100 Gbps network bandwidth and increased memory over comparable C5 instances.
  • C5 instances are optimized for compute-intensive workloads and deliver cost-effective high performance at a low price per compute ratio. C5 instances offer a choice of processors based on the size of the instance.
    • C5 instances are ideal for applications where you prioritize raw compute power, such as gaming servers, scientific modeling, high-performance web servers, and media transcoding. 
  • C4 instances are the latest generation of Compute-optimized instances, featuring the highest performing processors and the lowest price/compute performance in EC2

Amazon Elastic Container Registry (ECR)

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that helps developers to store, manage, and deploy Docker container images, and it is secure, scalable, and reliable. Amazon ECR is integrated with Amazon ECS, which allows AWS customers to store, run, and manage container images for applications running on Amazon ECS.

  • Amazon ECR enables private Docker repositories with resource-based permissions using AWS IAM so that specific users or Amazon EC2 instances can access repositories and images.
  • Amazon ECR hosts clients images in a highly available and scalable architecture, allowing them to deploy containers for their applications.
  • Amazon ECR transfers container images over HTTPS and automatically encrypts those images at rest.

ECR Features

Amazon ECR supports Docker Registry HTTP API V2, that allows clients to use Docker CLI commands or any preferred Docker tools to interact with Amazon ECR. 

  • Docker is a software platform that allows customers to build, test, and deploy applications quickly. 
  • Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. 
  • Using Docker, customers can quickly deploy and scale applications into any environment and their code will run smoothly.

AWS Marketplace for Containers enables customers to find container products in AWS Marketplace and the Amazon Elastic Container Service (Amazon ECS) console. They can deploy container products from AWS Marketplace on Amazon Container Services such as Amazon ECS, Amazon Elastic Container Service for Kubernetes (Amazon EKS), and AWS Fargate. 

  • Customers can find software-as-a-service (SaaS) products that help manage, monitor and protect your container applications. 
  • With the new software delivery option in AWS Marketplace, customers can find free, bring-your-own-license (BYOL), and paid container products with both fixed monthly and usage-based pricing.

Amazon ECR automatically encrypts images at rest using S3 server side encryption and transfers customers container images over HTTPS. Customers can configure policies to manage permissions and control access to their images using AWS Identity and Access Management (IAM) users and roles.

  • The Amazon ECR automatically encrypts images at rest using Amazon S3 server-side encryption.
    • Amazon ECR stores customers container images in Amazon S3, then the images redundantly stored across multiple facilities and multiple devices in each facility.

Amazon ECR supports the ability to define and organize repositories in clients registry using namespaces. Which allows them to organize the repositories based on their team’s existing workflows. 

  • Customers can set which API actions another user may perform on their repository including create, list, describe, delete, and get) through resource-level policies.
  • Through IAM customers can define policies to allow users within the same AWS account or other accounts to access your container images.

AWS Container Competency Partners have a technology product or solution on AWS that offers support to run workloads on containers. The product or solution integrates with AWS services in a way that improves the AWS customer’s ability to run workloads using containers on AWS.

  • Customers can integrate Amazon ECR into their continuous integration and delivery process allowing them to maintain the existing development workflow.

Amazon ECR is integrated with third-party developer tools. AWS customers can integrate Amazon ECR into their continuous integration and delivery process allowing them to maintain their existing development workflow. This third party devlopers include:

  • Docker Enterprise: in collaboration with AWS, it has the ability to deliver a highly reliable and cost efficient way to quickly deploy, scale and manage business critical applications with containerization and cloud.
  • HashiCorp: HashiCorp Cloud Infrastructure Automation Consistent workflows to provision, secure, connect, and run any infrastructure for any application.
  • Others include  D2iQ: Mesosphere, Pivotal Cloud Foundry, Red Hat OpenShift, Spotinst Elastigroup, etc

ECR Components


Customers Docker client need authenticate to Amazon ECR registries as an AWS user in order to push and pull images. 

  • An authorization token represents your IAM authentication credentials and can be used to access any Amazon ECR registry that your IAM principal has access to.
  • An authorization token’s permission scope matches that of the IAM principal used to retrieve the authentication token. 
  • An authentication token is used to access any Amazon ECR registry that your IAM principal has access to and is valid for 12 hours.
  • The authorizationToken returned is a base64 encoded string that can be decoded and used in a docker login command to authenticate to a registry. The AWS CLI offers an get-login-password command that simplifies the login process.

Repository policy

Amazon ECR uses resource-based permissions to control access to repositories. Resource-based permissions let you specify which IAM users or roles have access to a repository and what actions they can perform on it. Customers can control access to the repositories and the images within these repository policies. 

  • Amazon ECR repository policies are a subset of IAM policies that are scoped for, and specifically used for, controlling access to individual Amazon ECR repositories. 
  • IAM policies are generally used to apply permissions for the entire Amazon ECR service but can also be used to control access to specific resources as well.


Amazon ECR registries host customers container images in a highly available and scalable architecture, allowing them to deploy containers to their applications. By default An Amazon ECR registry is provided to each AWS account; so that customers can create image repositories in the registry and store images in them.

  • It can be used as a registry to manage image repositories consisting of Docker and Open Container Initiative (OCI) images. 
  • Using AWS Management Console, AWS CLI, or the AWS SDKs customers can create and manage repositories. They can use those methods to perform some actions on images, including listing or deleting the images. 
  • Amazon ECR provides a Docker credential helper which allows to store and use Docker credentials when pushing and pulling images to Amazon ECR.


An Amazon ECR image repository contains customers Docker or Open Container Initiative (OCI) images.  ECR provides API operations to create, monitor, and delete image repositories and set permissions that control who can access them.  Amazon ECR also integrates with the Docker CLI allowing customers to push and pull images from your development environments to your repositories.

  • Amazon ECR uses resource-based permissions to control access to repositories. Resource-based permissions let customers specify which IAM users or roles have access to a repository and what actions they can perform on it. By default, only the repository owner has access to a repository.
  • Repositories can be controlled with both IAM user access policies and repository policies.
  • Repository names can support namespaces, which you can use to group similar repositories

Amazon Elastic Kubernetes Service (EKS)

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Amazon EKS runs Kubernetes control plane instances across multiple Availability Zones to ensure high availability. Amazon EKS automatically detects and replaces unhealthy control plane instances, and it provides automated version upgrades and patching for them.

  • It allows customers to run your EKS clusters using AWS Fargate, which is a serverless compute for containers. Fargate manages servers, the way customers specify it.
  • EKS is integrated with services including Amazon CloudWatch, Auto Scaling Groups, AWS Identity and Access Management (IAM), and Amazon Virtual Private Cloud (VPC). These integrations enable  seamless experience to monitor, scale, and load-balance your applications.
  • EKS integrates with AWS App Mesh and provides a Kubernetes native experience to consume service mesh features and bring rich observability, traffic controls and security features to applications.

EKS Features

Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes that are responsible for starting and stopping containers, scheduling containers on virtual machines, storing cluster data, and other tasks. Amazon EKS automatically detects and replaces unhealthy control plane nodes for each cluster.

  • Amazon EKS provides a scalable and highly-available control plane that runs across multiple AWS availability zones. 
  • Amazon EKS lets customers create, update, or terminate worker nodes for their cluster with a single command.

Kubernetes clusters are integrated with AWS services and technology partner solutions, that allows services such as IAM to provide fine-grained access control and Amazon VPC isolates the Kubernetes clusters from other customers.

  • Using AWS Cloud Map, a cloud resource discovery service, customers can define custom names for their application resources, then  Cloud Map maintains the updated location of these dynamically changing resources. 
  • Customers can also Service mesh to build and run complex microservices applications by standardizing how every microservice in the application communicates. AWS App Mesh is an AWS service that makes it easy to configure part of the application for end-to-end visibility and high-availability.
  • Since EKS clusters run in an Amazon VPC, customers can use their own VPC security groups and network ACLs. With that customers get a high level of isolation and help them use Amazon EKS to build highly secure and reliable applications. 
  • Customers can assign RBAC roles directly to each IAM entity, which gives them the ability to granularly control access permissions to Kubernetes masters. 
  • Amazon EKS enables customers to assign IAM permissions to their Kubernetes service accounts. Which helps them control access to other containerized services, AWS resources external to the cluster such as databases and secrets, or third party services and applications running outside of AWS.

Customers can use EKS on AWS Outposts to run containerized applications that require particularly low latencies to on-premises systems. AWS Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools to virtually any connected site. 

  • With EKS on Outposts, customers are  able to manage containers on-premises.


Amazon EKS is integrated with AWS CloudTrail to provide visibility and audit history of customer cluster and user activity. 

  • AWS CloudTrail can be used to view API calls to the Amazon EKS API. Amazon EKS also delivers Kubernetes control plane logs to Amazon CloudWatch.

An Amazon EKS cluster consists of two primary components: The Amazon EKS control plane and Amazon EKS worker nodes that are registered with the control plane

  • The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software, such as etcd and the Kubernetes API server. The control plane runs in an account managed by AWS, and the Kubernetes API is exposed via the Amazon EKS endpoint associated with the customers cluster. 
    • Each Amazon EKS cluster control plane is single-tenant and unique, and runs on its own set of Amazon EC2 instances.
  • Amazon EKS worker nodes run in your AWS account and connect to your cluster’s control plane via the API server endpoint and a certificate file that is created for your cluster.

Worker Nodes

Amazon EKS integrates Kubernetes with AWS Fargate by using controllers that are built by AWS using the upstream, extensible model provided by Kubernetes. These controllers run as part of the Amazon EKS managed Kubernetes control plane and are responsible for scheduling native Kubernetes pods onto Fargate. The Fargate controllers include a new scheduler that runs alongside the default Kubernetes scheduler in addition to several mutating and validating admission controllers. 

The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in customers cluster when pods fail to launch due to lack of resources or when nodes in the cluster are underutilized and their pods can be rescheduled onto other nodes in the cluster. There are several types of Kubernetes autoscaling supported in Amazon EKS:

  • Cluster Autoscaler:- The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail to launch due to lack of resources or when nodes in the cluster are underutilized and their pods can be rescheduled on to other nodes in the cluster.
  • Horizontal Pod Autoscaler:-  The Kubernetes Horizontal Pod Autoscaler automatically scales the number of pods in a deployment, replication controller, or replica set based on that resource’s CPU utilization.
  • Vertical Pod Autoscaler:- The Kubernetes Vertical Pod Autoscaler automatically adjusts the CPU and memory reservations for your pods to help “right size” your applications. 

Customers can deploy one or more worker nodes into a node group. Nodes are Worker machines in Kubernetes. Amazon EKS worker nodes run in customers’ AWS accounts, and it connects their cluster’s control plane via the cluster API server endpoint. A node group is one or more Amazon EC2 instances that are deployed in an Amazon EC2 Auto Scaling group. 

A cluster can contain several node groups, and each node group can contain several worker nodes. The managed node groups are able to have a maximum number of nodes. All instances in a node group must:

  • Be the same instance type
  • Be running the same Amazon Machine Image (AMI)
  • Use the same Amazon EKS Worker Node IAM Role.

Amazon EKS provides a specialized Amazon Machine Image (AMI) called the Amazon EKS-optimized AMI. This AMI is built on top of Amazon Linux 2, and is configured to serve as the base image for Amazon EKS worker nodes.

  • The AMI is configured to work with Amazon EKS out of the box, and it includes Docker, kubelet, and the AWS IAM Authenticator. The AMI also contains a specialized bootstrap script that allows it to discover and connect to the customers cluster’s control plane automatically.