Amazon.com Inc.

08/28/2024 | Press release | Distributed by Public on 08/28/2024 12:04

Announcing AWS Parallel Computing Service to run HPC workloads at virtually any scale

Today we are announcing AWS Parallel Computing Service (AWS PCS), a new managed service that helps customers set up and manage high performance computing (HPC) clusters so they seamlessly run their simulations at virtually any scale on AWS. Using the Slurm scheduler, they can work in a familiar HPC environment, accelerating their time to results instead of worrying about infrastructure.

In November 2018, we introduced AWS ParallelCluster, an AWS supported open-source cluster management tool that helps you to deploy and manage HPC clusters in the AWS Cloud. With AWS ParallelCluster, customers can also quickly build and deploy proof of concept and production HPC compute environments. They can use AWS ParallelCluster Command-Line interface, API, Python library, and the user interface installed from open source packages. They are responsible for updates, which can include tearing down and redeploying clusters. Many customers, though, have asked us for a fully managed AWS service to eliminate operational jobs in building and operating HPC environments.

AWS PCS simplifies HPC environments managed by AWS and is accessible through the AWS Management Console, AWS SDK, and AWS Command-Line Interface (AWS CLI). Your system administrators can create managed Slurm clusters that use their compute and storage configurations, identity, and job allocation preferences. AWS PCS uses Slurm, a highly scalable, fault-tolerant job scheduler used across a wide range of HPC customers, for scheduling and orchestrating simulations. End users such as scientists, researchers, and engineers can log in to AWS PCS clusters to run and manage HPC jobs, use interactive software on virtual desktops, and access data. You can bring their workloads to AWS PCS quickly, without significant effort to port code.

You can use fully managed NICE DCV remote desktops for remote visualization, and access job telemetry or application logs to enable specialists to manage your HPC workflows in one place.

AWS PCS is designed for a wide range of traditional and emerging, compute or data-intensive, engineering and scientific workloads across areas such as computational fluid dynamics, weather modeling, finite element analysis, electronic design automation, and reservoir simulations using familiar ways of preparing, executing, and analyzing simulations and computations.

Getting started with AWS Parallel Computing Service
To try out AWS PCS, you can use our tutorial for creating a simple cluster in the AWS documentation. First, you create a virtual private cloud (VPC) with an AWS CloudFormation template and shared storage in Amazon Elastic File System (Amazon EFS) within your account for the AWS Region where you will try AWS PCS. To learn more, visit Create a VPC and Create shared storage in the AWS documentation.

1. Create a cluster
In the AWS PCS console, choose Create cluster, a persistent resource for managing resources and running workloads.

Next, enter your cluster name and choose the controller size of your Slurm scheduler. You can choose Small (up to 32 nodes and 256 jobs), Medium (up to 512 nodes and 8,192 jobs), or Large (up to 2,048 nodes and 16,384 jobs) for the limits of cluster workloads. In the Networking section, choose your created VPC, subnet to launch the cluster, and security group applied to your cluster.

Optionally, you can set the Slurm configuration such as an idle time before compute nodes will scale down, a Prolog and Epilog scripts directory on launched compute nodes, and a resource selection algorithm parameter used by Slurm.

Choose Create cluster. It takes some time for the cluster to be provisioned.

2. Create compute node groups
After creating your cluster, you can create compute node groups, a virtual collection of Amazon Elastic Compute Cloud (Amazon EC2) instances that AWS PCS uses to provide interactive access to a cluster or run jobs in a cluster. When you define a compute node group, you specify common traits such as EC2 instance types, minimum and maximum instance count, target VPC subnets, Amazon Machine Image (AMI), purchase option, and custom launch configuration. Compute node groups require an instance profile to pass an AWS Identity and Access Management (IAM) role to an EC2 instance and an EC2 launch template that AWS PCS uses to configure EC2 instances it launches. To learn more, visit Create a launch template And Create an instance profile in the AWS documentation.

To create a compute node group in the console, go to your cluster and choose the Compute node groups tab and the Create compute node group button.

You can create two compute node groups: a login node group to be accessed by end users and a job node group to run HPC jobs.

To create a compute node group running HPC jobs, enter a compute node name and select a previously-created EC2 launch template, IAM instance profile, and subnets to launch compute nodes in your cluster VPC.

Next, choose your preferred EC2 instance types to use when launching compute nodes and the minimum and maximum instance count for scaling. I chose the hpc6a.48xlarge instance type and scale limit up to eight instances. For a login node, you can choose a smaller instance, such as one c6i.xlarge instance. You can also choose either the On-demand or Spot EC2 purchase option if the instance type supports. Optionally, you can choose a specific AMI.

Choose Create. It takes some time for the compute node group to be provisioned. To learn more, visit Create a compute node group to run jobs and Create a compute node group for login nodes in the AWS documentation.

3. Create and run your HPC jobs
After creating your compute node groups, you submit a job to a queue to run it. The job remains in the queue until AWS PCS schedules it to run on a compute node group, based on available provisioned capacity. Each queue is associated with one or more compute node groups, which provide the necessary EC2 instances to do the processing.

To create a queue in the console, go to your cluster and choose the Queues tab and the Create queue button.

Enter your queue name and choose your compute node groups assigned to your queue.

Choose Create and wait while the queue is being created.

When the login compute node group is active, you can use AWS Systems Manager to connect to the EC2 instance it created. Go to the Amazon EC2 console and choose your EC2 instance of the login compute node group. To learn more, visit Create a queue to submit and manage jobs and Connect to your cluster in the AWS documentation.

To run a job using Slurm, you prepare a submission script that specifies the job requirements and submit it to a queue with the sbatch command. Typically, this is done from a shared directory so the login and compute nodes have a common space for accessing files.

You can also run a message passing interface (MPI) job in AWS PCS using Slurm. To learn more, visit Run a single node job with Slurm or Run a multi-node MPI job with Slurm in the AWS documentation.

You can connect a fully-managed NICE DCV remote desktop for visualization. To get started, use the CloudFormation template from HPC Recipes for AWS GitHub repository.

In this example, I used the OpenFOAMmotorBike simulation to calculate the steady flow around a motorcycle and rider. This simulation was run with 288 cores of three hpc6a instances. The output can be visualized in the ParaView session after logging in to the web interface of DCV instance.

Finally, after you are done HPC jobs with the cluster and node groups that you created, you should delete the resources that you created to avoid unnecessary charges. To learn more, visit Delete your AWS resources in the AWS documentation.

Things to know
Here are a couple of things that you should know about this feature:

  • Slurm versions - AWS PCS initially supports Slurm 23.11 and offers mechanisms designed to enable customers to upgrade their Slurm major versions once new versions are added. Additionally, AWS PCS is designed to automatically update the Slurm controller with patch versions. To learn more, visit Slurm versions in the AWS documentation.
  • Capacity Reservations - You can reserve EC2 capacity in a specific Availability Zone and for a specific duration using On-Demand Capacity Reservations to make sure that you have the necessary compute capacity available when you need it. To learn more, visit Capacity Reservations in the AWS documentation.
  • Network file systems - You can attach network storage volumes where data and files can be written and accessed, including Amazon FSx for NetApp ONTAP, Amazon FSx for OpenZFS, and Amazon File Cache as well as Amazon EFS and Amazon FSx for Lustre. You can also use self-managed volumes, such as NFS servers. To learn more, visit Network file systems in the AWS documentation.

Now available
AWS Parallel Computing Service is now available in the US East (N. Virginia), AWS US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm), AWS GovCloud (US-East) and AWS GovCloud (US-West) Regions.

AWS PCS launches all resources in your AWS account. You will be billed appropriately for those resources. For more information, see the AWS PCS Pricing page.

Give it a try and send feedback to the AWS re:Post for AWS PCS or through your usual AWS Support contacts.

- Channy

P.S. Special thanks to Matthew Vaughn, a principal developer advocate at AWS for his contribution in creating a HPC testing environment.