Amazon Batch

Classify a Large Number of Images with Amazon Rekognition and AWS Batch |  AWS Machine Learning Blog

Amazon Batch computing is used by developers, scientists, and engineers to access large amounts of compute resources. Amazon batch service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. It automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads.

  • AWS Batch dynamically provisions the optimal quantity and type of compute resources of CPU or memory optimized instances based on the volume and specific resource requirements of the batch jobs submitted.
  • AWS Batch plans, schedules, and executes client’s batche computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances.

Batch features

AWS Batch efficiently and dynamically provisions and scales Amazon EC2 and Spot Instances based on the requirements of the jobs. Customers can configure their AWS Batch Managed Compute Environments with requirements such as:

  • Type of EC2 instances
  • VPC subnet configurations
  • The min/max/desired vCPUs across all instances, and
  • The amount they are willing to pay for Spot Instances as a % of the On-Demand Instance price.

If AWS customers elected to provision and manage their own compute resources within AWS Batch Unmanaged Compute Environments they need to use different configurations such as larger EBS volumes or a different operating system for their EC2 instances.

  • Once they select their compute environment, they need to provision EC2 instances that include the Amazon ECS agent and run supported versions of Linux and Docker.

AWS Batch supports multi-node parallel jobs, which enables users to run single jobs that span multiple EC2 instances. This feature allows AWS customers to use AWS Batch to easily and efficiently run workloads such as large-scale, tightly-coupled High Performance Computing (HPC) applications or distributed GPU model training.

  • AWS Batch also supports Elastic Fabric Adapter, a network interface that enables users to run applications that require high levels of inter-node communication at scale on AWS.

AWS Batch enables EC2 Launch Templates, which help users to build customized templates for their compute resources, and allowing Batch to scale instances with those requirements.

  • Users can specify their EC2 Launch Template to add storage volumes, specify network interfaces, or configure permissions, among other capabilities.
  • EC2 Launch Templates reduce the number of steps required to configure Batch environments by capturing launch parameters within one resource.

AWS Batch enables customers to specify resource requirements, such as vCPU and memory, AWS Identity and Access Management (IAM) roles, volume mount points, container properties, and environment variables, to define how jobs are to be run.

  • AWS Batch executes any  jobs as containerized applications running on Amazon ECS.
  • Batch also enables them to define dependencies between different jobs.
    • With dependencies, users can create three jobs with different resource requirements where each successive job depends on the previous job.

AWS Batch allows customers to set up multiple queues with different priority levels. Batch jobs are stored in the queues until compute resources are available to execute the job. The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a queue based on the resource requirements of each job.

  • The scheduler evaluates the priority of each queue and runs jobs in priority order on optimal compute resources such as memory vs CPU optimized, as long as those jobs have no outstanding dependenciesGP.

AWS support customers who want to use GPU scheduling, which allows them to specify the number and type of accelerators their jobs require as job definition input variables in AWS Batch.

  • Graphics Processing Unit(GPU) is a processor designed to handle graphics operations. This includes both 2D and 3D calculations, though GPUs primarily excel at rendering 3D graphics.
  • AWS Batch will scale up instances appropriate for the customers jobs based on the required number of GPUs and isolate the accelerators according to each job’s needs, so only the appropriate containers can access them.
  • All instance types in a compute environment that will run GPU jobs should be from the p2, p3, g3, g3s, or g4 instance families. If this is not done a GPU job could get stuck in the RUNNABLE status.

Compute resources

There are three ways to allocate compute resources. These methods will factor in throughput as well as price when deciding how AWS Batch should scale instances on their behalf.

  • Best Fit: In the case of Best Fit, AWS Batch selects an instance type that best fits the needs of the jobs with a preference for the lowest-cost instance type. If additional instances of the selected instance type are not available, AWS Batch will wait for the additional instances to be available.
  • Best Fit Progressive: In this case, AWS Batch will select additional instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types with a lower cost per unit vCPU.
    • If additional instances of the previously selected instance types are not available, AWS Batch will select new instance types.
  • Spot Capacity Optimized: In this method AWS Batch will select one or more instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types that are less likely to be interrupted.
    • This allocation strategy is only available for Spot Instance compute resources.

Integration

AWS Batch can be integrated with commercial and open-source workflow engines and languages such as:

LUIG

Luig is a workflow management system to efficiently launch a group of tasks with defined dependencies between them.

  • Luig not only can be in a Python based API that builds and executes pipelines of Hadoop jobs, but it can also be used to create workflows with any external jobs written in R or Scala or Spark.

METAFLOW

Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects.

  • Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

APACHE AIRFLOW

  • Apache Airflow also known as Airflow is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
    • Customers can use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks.
    • The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
    • Rich command line utilities make performing complex surgeries on DAGs a snap.
    • It has a rich user interface, that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

PEGASUS WMS

Pegasus WMS is a scientific workflow management system that can manage the execution of complex workflows on distributed resources. It is funded by National Science Foundation

  • Pegasus WMS has been used in a number of scientific domains including astronomy, bioinformatics, earthquake science , gravitational wave physics, ocean science, limnology, and others.

NEXTFLOW

Nextflow is a relatively light weighted java application that a single user can easily manage. Nextflow workflow is easy to run any analysis while transparently managing all of the issues that tend to crop up when running a shell script (missing dependencies, not enough resources, hard to tell where failures are coming from, not easily published or transferred to collaborators)

  • Nextflow has robust reporting features, real time status updates and workflow failure or completion handling and notifications via a user-definable on-completion process.

AWS STEP FUNCTIONS

AWS Step Functions enables customers to coordinate multiple AWS services into serverless workflows, so that they can build and update apps quickly.

  • Using Step Functions, customers can design and run workflows that stitch together services, such as AWS Lambda, AWS Fargate, and Amazon SageMaker, into feature-rich applications.
  • Workflows are made up of a series of steps, with the output of one step acting as input into the next.

Job States

  • STARTING These jobs have been scheduled to a host and the relevant container initiation operations are underway. After the container image is pulled and the container is up and running, the job transitions to RUNNING.
  • RUNNING The job is running as a container job on an Amazon ECS container instance within a compute environment. When the job’s container exits, the process exit code determines whether the job succeeded or failed. An exit code of 0 indicates success, and any non-zero exit code indicates failure. If the job associated with a failed attempt has any remaining attempts left in its optional retry strategy configuration, the job is moved to RUNNABLE again.
  • SUCCEEDED The job has successfully completed with an exit code of 0. The job state for SUCCEEDED jobs is persisted in AWS Batch for 24 hours.
  • FAILED The job has failed all available attempts. The job state for FAILED jobs is persisted in AWS Batch for 24 hours.

Jobs are the unit of work executed by AWS Batch, and it can be executed as containerized applications running on Amazon ECS container instances in an ECS cluster. Containerized jobs can reference a container image, command, and parameters.

When customers submit a job to an AWS Batch job queue, the job enters the SUBMITTED state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code). AWS Batch jobs can have the following states:

  • SUBMITTED A job that has been submitted to the queue, and has not yet been evaluated by the scheduler. The scheduler evaluates the job to determine if it has any outstanding dependencies on the successful completion of any other jobs. If there are dependencies, the job is moved to PENDING. If there are no dependencies, the job is moved to RUNNABLE.
  • PENDING A job that resides in the queue and is not yet able to run due to a dependency on another job or resource. After the dependencies are satisfied, the job is moved to RUNNABLE.
  • RUNNABLE A job that resides in the queue, has no outstanding dependencies, and is therefore ready to be scheduled to a host. Jobs in this state are started as soon as sufficient resources are available in one of the compute environments that are mapped to the job’s queue. However, jobs can remain in this state indefinitely when sufficient resources are unavailable.

Batch Components

NODE GROUPS

A node group is an identical group of job nodes, where all nodes share the same container properties. AWS Batch lets customers specify up to five distinct node groups for each job.

  • Each group can have its own container images, commands, environment variables, and so on.
  • In addition they also can use all of the nodes in their job as a single node group, and the application code can differentiate node roles from main node to child node.

MULTI-NODE PARALLEL JOBS

Multi-node parallel jobs are used to run single jobs that span multiple Amazon EC2 instances. Batch multi-node parallel jobs can run large-scale, tightly coupled, high performance computing applications and distributed GPU model training without the need to launch, configure, and manage Amazon EC2 resources directly. An AWS Batch multi-node parallel job is compatible with any framework that supports IP-based, internode communication, such as Apache MXNet, TensorFlow, Caffe2, or Message Passing Interface (MPI).

  • Multi-node parallel jobs are submitted as a single job, or as a job submission node overrides, that specifies the number of nodes to create for the job and what node groups to create.
  • Each multi-node parallel job contains a main node, which needs to be launched first. Once the main node is up and running, the child nodes will be launched and started.
    • If the main node exits, the job is considered finished, and the child nodes are stopped. For more information, see Node Groups.
  • Multi-node parallel job nodes are single-tenant, meaning that only a single job container is run on each Amazon EC2 instance.
  • Each multi-node parallel job contains a main node. The main node is a single subtask that AWS Batch monitors to determine the outcome of the submitted multi node job.
    • The main node is launched first and it moves to the STARTING status.

ARRAY JOB

An array job is a job that shares common parameters, such as the job definition, vCPUs, and memory. It runs as a collection of related, yet separate, basic jobs that may be distributed across multiple hosts and may run concurrently. Array jobs are the most efficient way to execute embarrassingly parallel jobs such as Monte Carlo simulations, parametric sweeps, or large rendering jobs.

  • AWS Batch array jobs are submitted just like regular jobs. However, the array size needs to be between 2 and 10,000 to define how many child jobs should run in the array.
  • If the submitted job has an array size of 1000, a single job runs and spawns 1000 child jobs. The array job is a reference or pointer to manage all the child jobs. This allows customers to submit large workloads with a single query

JOB DEFINITIONS

AWS Batch job definitions specify how jobs are to be run. While each job must reference a job definition, many of the parameters that are specified in the job definition can be overridden at runtime. The following are some of the attribute:

  • Which Docker image to use with the container in your job
  • How many vCPUs and how much memory to use with the container
  • The command the container should run when it is started
  • What (if any) environment variables should be passed to the container when it starts
  • Any data volumes that should be used with the container
  • What (if any) IAM role your job should use for AWS permissions

JOB QUEUES

Jobs are submitted to a job queue, where they reside until they are able to be scheduled to run in a compute environment. Any AWS account can have multiple job queues.

  • Customers can create a queue that uses Amazon EC2 On-Demand instances for high priority jobs and another queue that uses Amazon EC2 Spot Instances for low-priority jobs.
  • Job queues have a priority that is used by the scheduler to determine which jobs in which queue should be evaluated for execution first.
  • The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.

Compute Environments

Job queues are generally mapped to one or more compute environments. The compute environments contain the Amazon ECS container instances that are used to run containerized batch jobs. Within a job queue, the associated compute environments each have an order that is used by the scheduler to determine where to place jobs that are ready to be executed.

  • If the first compute environment has free resources, then the job is scheduled to a container instance within that compute environment.
  • If the compute environment is unable to provide a suitable compute resource, the scheduler attempts to run the job on the next compute environment.

UNMANAGED COMPUTE ENVIRONMENTS

Unmanaged Compute Environments environment, in this case customers are responsible for managing their own compute resources.

  • Customers need to make sure that the AMI in use for their compute resources meets the Amazon ECS container instance AMI specification.
  • Once the unmanaged compute environment is created, customers can use the DescribeComputeEnvironments API operation to view the compute environment details.
    • Find the Amazon ECS cluster that is associated with the environment and then manually launch your container instances into that Amazon ECS cluster.

MANAGED COMPUTE ENVIRONMENTS

Managed compute environments allow customers to describe their business requirements. In a managed compute environment, AWS Batch manages the capacity and instance types of the compute resources within the environment, based on the compute resource specification that they define when they create the compute environment.

  • AWS customers have two choices to use Amazon EC2: On-Demand Instances or Spot Instances.
  • Managed compute environments launch Amazon ECS container instances into the VPC and subnets that the clients specify when they created the compute environment.