Deploying CryoSPARC on AWS
Version 1.0 (May 10, 2021)
Last updated
Version 1.0 (May 10, 2021)
Last updated
This deployment guide is based on version 2 of AWS ParallelCIuster. For an updated configuration using a newer version of AWS ParallelCluster, please see this AWS sample.
This Deployment Guide provides end-to-end sample instructions for deploying CryoSPARC™, a state-of-the-art scientific software platform for cryo-EM, on AWS using AWS ParallelCluster. CryoSPARC is developed by Structura Biotechnology Inc. Additional information about CryoSPARC, including licensing, is available at guide.cryosparc.com.
Cryo-electron microscopy (cryo-EM) is a biophysical technique that allows scientists to determine the structure of biological macromolecules and assemblies. This technology that was awarded the 2017 Nobel Prize in Chemistry uses advanced microscopes to reveal 3D structures of biomolecules in near-native states. Cryo-EM is rapidly becoming the go-to technique for protein structure determination in life-sciences and drug discovery. Just recently, cryo-EM was used to produce the first atomic-level 3D structure of the spike protein responsible for the COVID-19 virus.
Storing the micrographs (images produced by the microscope) requires enormous data storage and the processing workflow requires massive computing power. This workload is therefore an ideal use case for High-Performance Computing in Amazon Web Services.
A typical cryo-EM workflow involves biological sample preparation, data collection, and finally computation. Images of a sample are collected with a transmission electron microscope. The raw data, which is generally 5-10 TB in size, is then processed to reconstruct a three-dimensional structure of the protein of interest from the two-dimensional images. Resolving a 3D structure often requires multiple iterations through the entire processing pipeline, or parts of the pipeline, for a near-atomic resolution result. In many cases, the entire workflow starting from data collection must be repeated to achieve state-of-the-art results.
This guide is accompanied by a Performance Benchmarks document outlining various steps in running a CryoSPARC workflow, timings and cost estimates:
The benchmark was performed on the EMPIAR-10288 dataset of approximately 2800 micrographs totalling 476GB. For the deployment guide, we recommend running the smaller T20S tutorial that ships with CryoSPARC. Approximate costs for the T20S example on different EC2 instance types are:
p4d.24xlarge: $37 USD
p3dn.24xlarge: $42 USD
g4dn.metal: $10 USD
NOTE: This guide serves as an example of possible installation options, performance and cost, but each user’s results may vary. Performance and costs may scale differently depending on the specific compute setup, data being processed, how long AWS compute resources are being used, specific steps used in processing, etc.
This deployment guide uses Amazon EC2, Amazon Simple Storage Service (S3), Amazon FSx for Lustre, and Amazon CloudFormation. See the architecture diagram in section 11.3 for details.
This deployment guide assumes minimal AWS knowledge; however, there are a few prerequisites. The first step is to create an AWS account, as described here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html#sign-up-for-aws
You will also need the following:
A computer with internet running macOS or Linux. For Windows users, a terminal emulator.
An Internet browser such as Chrome or Firefox
Familiarity with Linux terminal commands
Time to Request EC2 Service Quota increases (at least 24 hours)
IAM Permissions
To log into the AWS Management Console, navigate to this address: https://aws.amazon.com/console/ and click on the link labelled “Already have an account? Sign in.” You will be prompted for an Account ID or alias, IAM user name, and Password. If you have just created a new account you will need to sign up with your root user email and password.
Once you are logged into the AWS Management Console, spend some time becoming familiar with the interface. This page is a central place you can use to find and learn about AWS services as well as to manage and monitor your account. Some key items to note are the Search Bar and the AWS region menu. The latter allows you to select the geographical region you would like your AWS resources to be located. AWS currently has 24 regions, each of which contains multiple Availability Zones (AZ), each containing one or more data centers, spread out across the globe where you can run your cryo-EM analysis.
Note that some newer compute instances may not be available in all AZs. Please select the AZ that has all the resources required for your workflow.
The AWS Command Line Interface (AWS CLI) is an open-source tool that enables you to interact with AWS services from your command-line shell. Download AWS CLI using the commands below:
If you do not have superuser (sudo) permissions, AWS CLI can also be installed using pip with regular user permissions. You can find more instructions here.
Verify that the AWS shell was installed correctly with the following commands (example outputs included)
In order to access services associated with your AWS account, you must first configure the AWS CLI. You will need to provide an Access Key ID and a Secret access key. For more information on configuring the AWS CLI, see this article.
Whether you are using a new or existing AWS account to deploy CryoSPARC in the cloud, it’s best to create a new IAM user specifically for this purpose. Doing so allows you to give the IAM user-specific policies that are scoped to the resources and actions needed to complete the deployment.
This article explains how to create a new IAM user in your AWS account. The type of access required is “Programmatic Access”, as this IAM user will only be used for the deploy-cryosparc.sh
script via the CLI.
During testing, the following AWS managed policies were attached to the IAM user deploying CryoSPARC:
AmazonEC2FullAccess
AmazonFSxFullAccess
AmazonS3FullAccess
AmazonDynamoDBFullAccess
CloudWatchLogsFullAccess
AmazonRoute53FullAccess
AWSCloudFormationFullAccess
AWSLambda_FullAccess
as well as the following custom managed policy:
See this article for more details on creating custom IAM policies.
Amazon EC2 is a service that provides scalable and flexible computing capacity in the AWS cloud. Using Amazon EC2 eliminates the need to invest in computing hardware and enables quick development and deployment of applications. It allows many virtual servers to be configured with security, networking and storage management. EC2 virtual servers, also known as instances, are the building blocks for supercomputing on AWS.
The EC2 dashboard displays resources and provides the ability to launch an instance. On the left-hand side of the dashboard, there are links to EC2 limits, instances, AMIs, Security Groups and keys.
All accounts initially have a lower limit to protect against fraud. Increasing these limits requires a simple request based on your region and what instances you need. Please verify that GPU-based p* and g* instances are available in your region. This deployment guide uses the us-east-1 (US East - N. Virginia) region.
From the AWS console, search for Service Quotas (AWS Services – Service Quotas) to go to the Elastic Compute Cloud section. Type in ‘on-demand’ to filter. Select the type of instance for which you wish to increase the vCPU limits and click ‘Request Limit Increase’. For this workload, request a limit increase for p*, g*, and c* instance types. Select the desired region. Enter a case description on why you want the limits to be increased and click Submit. You should receive a response from AWS Support within 24 hours confirming your limits have been increased. You can read more about service limits here.
A key pair, consisting of a private key and a public key, is a set of security credentials required to prove your identity when connecting to an instance. Create an EC2 key pair in the region you plan to deploy the CryoSPARC cluster. This can be done in two ways:
Go to the EC2 dashboard in your AWS console. Select Key pairs, and select Create key pair. Give an appropriate name, download the key pair.
Open your terminal and type the following command:
Substitute name-of-region for the region where you want to deploy your cluster. For this guide, enter us-east-1
Your key is stored in the ~/.ssh
directory and is called key-cryoSPARC
.
Amazon Virtual Private Cloud (Amazon VPC) can launch AWS resources in a virtual network that you have defined. A VPC is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks. A subnet is a range of IP addresses in your VPC in which you can launch AWS resources. This virtual network closely resembles a traditional network operating in an on-premises data center, with the benefits of using the scalable infrastructure of AWS.
A network access control list (ACL) is an optional layer of security for your VPC. that acts as a firewall, controlling traffic in and out of one or more subnets. A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. Security groups operate at the instance level, not the subnet level. Therefore, each instance in a subnet in your VPC can be assigned to a different set of security groups.
Before proceeding further, download the zip folder containing all the files required for this deployment here: https://github.com/cryoem-uoft/aws-deployment-guide/archive/refs/tags/v1.0.zip
The folder contains 5 files:
README.md
cryosparc-pcluster.config.template
deploy-cryosparc_v1.sh
install-cryosparc_v1.sh
vpc-cryosparc.template
Optionally, browse the code on GitHub https://github.com/cryoem-uoft/aws-deployment-guide.
Amazon Simple Storage Service (Amazon S3) is a global object storage service for storing data and securing it from unauthorized access. Use Amazon S3 to store raw images for cryoSPARC to analyze.
A single bucket is required for this guide. In the following instructions, replace the given example bucket name with your own, as bucket names must be globally unique. The name must not contain any uppercase characters. Since the bucket will store raw data, make sure your bucket is private.
S3 buckets and compute resources required to run a cryoSPARC workflow must be in the same AZ (see Section 10.2). Note that some newer compute instances may not be available in all AZs. Please create the bucket in an AZ that has all the resources required for your workflow.
Create the bucket cryosparc-test-data-np
for raw data. This S3 bucket will be linked to the Amazon FSx for Lustre service for high-performance read and write storage operations.
To create the bucket and upload the raw .tif
and .mrc
movies (multi-frame micrographs) to cryosparc-test-data-np
:
From the S3 management console, verify that all files were successfully uploaded to the bucket.
Amazon FSx for Lustre is a fully managed service that provides cost-effective, high-performance storage for compute workloads. Many workloads such as machine learning, high-performance computing (HPC), video rendering and financial simulations depend on compute instances accessing the same set of data through high-performance shared storage. Amazon FSx for Lustre file systems can also be linked to Amazon S3 buckets, enabling access and process data concurrently from a high-performance file system. Amazon FSx for Lustre can also be configured to back up data to Amazon S3, and further to Amazon S3 Glacier to optimize costs for data backup.
You can choose between two file systems when using Amazon FSx for Lustre:
Persistent File Systems - these are ideal for long-term storage.
Scratch File Systems - these are ideal for temporary and short-term storage.
This guide uses the Scratch file system. This provides the most performant and cost-effective option. If you require data resilience, consider the persistent deployment guide.
AWS CloudFormation provides a way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code. A CloudFormation template describes your desired resources and their dependencies so you can launch and configure them together as a stack. You can use a template to create, update and delete an entire stack as a single unit, as often as you need to, instead of managing resources individually. You can manage and provision stacks across multiple AWS accounts and AWS Regions.
The script vpc-cryosparc.template
is a CloudFormation template that deploys the VPC and subnets. As is, the script creates a VPC and two subnets. The subnet for the head instance is public, and the compute instances are placed in a private subnet. Outside the VPC, you can only log into the head instance in the public subnet (via SSH).
AWS ParallelCluster enables you to quickly build an HPC environment on AWS. It automatically sets up the required compute resources, a shared filesystem, and offers a variety of batch schedulers. You define all the resources you need in a config file.
This deployment will use ParallelCluster 2.10.1. CryoSPARC requires the multiple queues feature in ParallelCluster, which is only supported by versions 2.9.0 and later.
To install ParallelCluster 2.10.1, run the following command on your local machine:
This enables access to the terminal command-line tool pcluster
. First, confirm that you have the correct version installed.
Open cryosparc-pcluster.config.template
in a text editor; here, provide the details of the cluster required to deploy and run the CryoSPARC workflow.
An AWS ParallelCluster configuration is defined in multiple sections. A section starts with the section name in square brackets, followed by parameters and configuration. See this page for more information about the config file. At the time of deployment, a cryosparc.config
file is created from the cryosparc.config.template
file (more on this later).
Look through the values that are not explicitly defined in cryosparc-pcluster.config.template
.
aws_region_name
- the name of the region
key_name
- key pair name
post_install
- the path to the install script that runs after the cluster is created
s3_read_resource
- the S3 bucket with raw data
vpc_id
- the VPC id
master_subnet_id
- the subnet id where the head node resides
compute_subnet_id
- the subnet id where the compute node resides
You will provide the region, key pair name S3 bucket names and CryoSPARC license ID when you deploy the cluster. The deploy-cryosparc.sh
fetches the remaining values about the networking infrastructure created by vpc-cryosparc.template
.
This config file uses an Amazon EC2 c5n.9xlarge instance as the head node. The c5n.9xlarge instance uses an Intel Xeon Platinum (Skylake) processor with 36 vCPUs. All instances use Amazon Linux 2, a Linux server operating system used by Amazon Web Services (AWS), as their operating system. None of the compute instances are running yet; they will run as needed. Three computing queues are also defined: gpu-large, gpu-med, and gpu-small. Each queue can host multiple EC2 instances. For example, the gpu-large queue is made of p4d.24xlarge, p3.16xlarge, and g4dn.metal instances. Multiple queues are very useful since different steps of the CryoSPARC workflow run better with different instances. See the attached benchmarking guide for instance recommendations.
In your local directory, verify that the three provided scripts exist:
cryosparc-pcluster.config.template
deploy-cryosparc.sh
vpc-cryosparc.template
Before launching the cluster, choose the Availability Zone (AZ) where the cluster will run. All EC2 instance types you choose to instantiate should be available within the same AZ in the region your S3 bucket is in. If a required EC2 instance type is not available in any of the AZs in your region, re-create your S3 bucket in a region that has them available.
Run the command below to see where a given instance type is available.
--region
Region you want to deploy the cluster in
Values=
type of instance
For example - this command outputs the AZs in the us-east-1 region where g4dn.12xlarge instances are available.
Finally, to deploy the cluster
--region
Region in which to deploy the cluster
--cluster-name
Name of the cluster for identification purposes
--az
AZ in which to deploy the cluster. Make sure that instance is available in the specified AZ.
--data-bucket
Existing S3 bucket created earlier. This will be linked to the EC2 instance with Amazon FSx for Lustre. All movies will be uploaded here.
--aws-key
Name (not the path) of the SSH key you created earlier
--cryosparc-license-id
The license ID provided by Structura Bio for your cryoSPARC instance
During deployment, a cryosparc.config
file is created from the cryosparc.config.template
file with all the values for the specified variables. This cryosparc.config
is specific to the cluster you just deployed. Check the cryosparc.config
file and ensure all the information is correct. Pay particular attention to the cluster cryosparc
, vpc cryosparc-vpc
and fsx cryosparc-fsx
sections.
Retain the vpc-cryosparc.template
for later use. The deployment will take about 30 minutes.
The script deploys a VPC in the region you selected with two subnets. A cluster called cryosparc with c5n.9xlarge as the head node is also deployed. The head node resides in the public subnet and the compute node (launched as needed) resides in the private subnet. The head node hosts both the cryoSPARC web interface and the database. Additionally, the script creates and mounts an EBS volume of type gp2 as a shared file system and an Amazon FSx for Lustre file system.
As the process completes, navigate to the AWS Management Console and take a look at which resources were deployed. In your terminal, look for instructions to connect to CryoSPARC’s web interface. Following the prompts, log into the head node of the cluster over SSH. The prompt will look like:
Once logged in, create a new CryoSPARC user.
The password should not contain special characters. When finished, log out of the head node. Set up an SSH tunnel to the CryoSPARC head node to connect to the CryoSPARC web interface:
Open a web browser to http://localhost:45000 and log into CryoSPARC using the username and password created.
Before you start your production workload, familiarize yourself with the CryoSPARC environment. Follow the instructions here and run your first CryoSPARC workload. More information about CryoSPARC configuration and management is available here.
After the workload is completed, download the data from the Amazon FSx for Lustre file system to Amazon S3. This uses a data repository task.
Use the create-data-repository-task
command to export data from your Amazon FSx for Lustre file system to back to the data bucket:
--file-system-id
The id for the Lustre file system created. Find this in the AWS console, under Amazon FSx.
--paths
The paths of the directory or file you want to export relative to the mount point of the file system. If the mount point is /mnt/fsx
and /mnt/fsx/path1
is a directory or file on the file system you want to export, then provide path1
.
Unless you plan to run more analysis immediately, we recommend tearing down the cluster to avoid incurring costs. The cluster scales down based on your config file. The current config file ensures the head node and the Amazon FSx for Lustre file system continue running. You can also completely destroy the cluster and spin up a new cluster as needed.
From the AWS command line, execute the following:
to delete the cluster. This includes instances, attached volumes, FSx file system, etc.
to delete all networking-related infrastructure. This deletes the VPC and subnets.
These two commands should delete the infrastructure initialized in this guide. However, they retain the S3 buckets with data and installation scripts. Delete the buckets if no longer needed, or archive them for a minimal price.
After 15 minutes, log into your AWS console. Check for resources that are no longer required and delete them.
Version 1.0 (May 10, 2021)