Performance Benchmarks

Version 1.0 (May 10, 2021)

This Benchmark Guide accompanies the Deploying cryoSPARC on AWS Guide. This guide provides an overview of benchmarks performed on a sample cryo-EM data processing workflow in cryoSPARC on AWS ParallelCluster. A typical workflow in cryoSPARC involves multiple steps, and it’s important to understand the computational requirements of each step to build a cluster on AWS that is both performant and economical.

In addition to benchmarking results, this guide also presents best practices around EC2 instance selection, cryoSPARC configuration, and file system considerations.

NOTE: This guide serves as an example of possible installation options, performance and cost, but each user’s results may vary. Performance and costs may scale differently depending on the specific compute setup, data being processed, how long AWS compute resources are being used, specific steps used in processing, etc.

Benchmark Data

The dataset used for benchmarking is EMPIAR-10288 (cannabinoid receptor 1-G protein complex). It’s composed of 2756 TIFF images constituting 476 GB of data. The dataset is moderate in size (compared to production cryo-EM workloads) but was chosen because the size meant a large number of simulations could be run to test a range of different architecture options. Later benchmarking efforts will build on the analysis presented here and will be applied to larger datasets.

Cluster Configuration

AWS ParallelCluster 2.10.0 was used to create an HPC cluster on which to benchmark cryoSPARC. The main requirements for cryo-EM workloads are access to large numbers of GPUs (and CPUs for some applications) as well as a high-performance file system. AWS ParallelCluster provides a simple-to-use mechanism to create a cluster that meets those requirements.

The specific cluster architecture for these benchmarks is as follows (deployed in us-east-1):

Main Node

  • EC2 instance: c5n.9xlarge

  • 36 vCPUs

  • 96 GB memory

  • Network Bandwidth: 50 Gbps

  • 100 GB local storage (EBS gp2)

Compute Nodes

Multiple queues were configured to dynamically provision the following instance types:

  • g4dn.16xlarge

    • Intel Cascade Lake CPU (32 vCPUs, 256GB memory)

    • 1 x NVIDIA T4 GPU

    • 1 x 900 GB NVMe local storage

  • g4dn.metal

    • Intel Cascade Lake CPU (32 vCPUs, 256GB memory)

    • 8 x NVIDIA T4 GPUs

    • 2 x 900 GB NVMe local storage

  • p3.2xlarge

    • Intel Broadwell CPU (8 vCPUs, 61GB memory)

    • 1 x NVIDIA Tesla V100 GPU

  • p3.8xlarge

    • Intel Broadwell CPU (32 vCPUs, 244GB memory)

    • 4 x NVIDIA Tesla V100 GPUs

  • p3.16xlarge

    • Intel Broadwell CPU (64 vCPUs, 488GB memory)

    • 8 x NVIDIA Tesla V100 GPUs

  • p3dn.24xlarge

    • Intel Broadwell CPU (96 vCPUs, 768GB memory)

    • 8 x NVIDIA Tesla V100 GPUs

    • 2 x 900 GB NVMe local storage

  • p4d.24xlarge

    • Intel Cascade Lake CPU (96 vCPUs, 1152GB memory)

    • 8 x NVIDIA A100 GPUs

    • 8 x 1 TB NVMe local storage

Further details about the above EC2 instance types can be found here.

File Systems

  • /fsx

    • 12 TB FSx for Lustre (2.4 GB/s throughput)

    • Used as the primary working directory for all cryoSPARC jobs

  • /shared

    • 100 GB EBS volume mounted on cluster head node

    • Shared with compute nodes via NFS

    • Used as application installation directory

  • /scratch

    • Local storage on compute nodes

    • Only used to test cache performance in certain cryoSPARC steps

    • Only available on specific EC2 instances (those with additional NVMe or SSD)

Storage architecture of a p4d.24xlarge in AWS ParallelCluster

Software

  • cryoSPARC v3.0.0

  • AWS ParallelCluster v2.10.0

  • Slurm 20.02.4

  • CUDA 11.0

  • NVIDIA Driver 450.80.02

EMPIAR 10288 Pipeline

The pipeline used to benchmark the EMPIAR-10288 dataset is composed of the following cryoSPARC steps:

  • Import Movies

  • Patch Motion Correction

  • Patch CTF Estimation

  • Blob Picker

  • Template Picker

  • Extract From Micrographs

  • 2D Classification

  • Ab-initio Reconstruction

  • Non-Uniform Refinement

Performance Analysis

Each stage was run on 1, 2, 4 or 8 GPUs on each of the listed EC2 instance types (certain stages only make use of a single GPU and are noted in the results). Each pipeline step was run on the attached FSx for Lustre filesystem, and several of the steps were also run using local NVMe drives as a cache to compare performance.

The total runtime for the EMPIAR-10288 pipeline on each instance type is shown below:

The p4d instance provides the best overall performance, but the analysis pipeline doesn’t make use of all 8 GPUs the entire time. Also, the cost of running the entire pipeline on a single p4d instance may not be ideal for some users:

The g4dn.metal instance provides the most cost-effective option if we were to use it for the entire analysis.

An ideal approach is to match EC2 instance types to cryoSPARC pipeline stages, allowing us to make efficient use of the compute resources and keep compute costs down. The benchmarks for each step are listed below and will be used to help identify which instance types to use for which step.

The movie import step does not make use of GPUs, so the performance is determined by the host CPU. The g4 and p4 instances have Intel Cascade Lake processors, the p3dn.24xlarge has a Skylake processor, and the p3 instances have Broadwell processors.

The Patch Motion and Patch CTF Estimation steps show good scaling, and so an instance with 8 GPUs is recommended for these stages.

The Blob Picker and Template Picker stages make use of a single GPU, and there is minimal difference in performance between GPU types. Here a low-cost, single-GPU EC2 instance is recommended (e.g. g4dn.16xlarge).

The Extract from Micrographs step shows scaling only up to 2 GPUs. Currently, there are no EC2 instances that offer only 2 GPUs, so a 4-GPU instance is recommended (e.g. p3.8xlarge). The performance difference is minimal across GPU architectures, so price should be the driving factor in choosing an instance for this stage.

The 2D Classification step also shows scaling only up to 2 GPUs. At present, EC2 instances are available with 1, 4, or 8 GPUs, so a 4-GPU instance is recommended (e.g. p3.8xlarge). Optionally, a lower-cost instance like the g4dn.metal (with 8 GPUs) may be used, with the expectation that scaling may be limited. The performance difference is minimal across GPU architectures, so price should be the driving factor in choosing an instance for this stage.

The Ab-initio Reconstruction step makes use of a single GPU, so the g4dn.16xlarge instance is the best option in terms of price and performance. The larger p3 and p4 instances offer the best performance but increase the overall cost.

The Non-Uniform Refinement stage makes use of 1 GPU and contributes the most to the total runtime. For this stage, a user should consider whether cost or time-to-solution is more important; a p4d.24xlarge will produce results 3.4x faster than a g4dn.16xlarge, but at a higher cost.

The following data shows a breakdown of each stage as a percentage of the total runtimes.

We can see several different aspects of a simulation to consider when choosing instance types:

  • The Patch Motion Correction and Patch CTF Estimation stages scale, and so an 8-GPU instance is recommended.

  • Ab-initio Reconstruction and Non-Uniform Refinement dominate the runtime as single-GPU stages.

  • 2D Classification benefits somewhat from scaling, but the short run time for this stage means cost should be a deciding factor in choosing an instance.

As cryoSPARC involves an analysis pipeline, with each stage making use of compute resources in a different way, care must be given to the selection of instance type.

Storage Performance

Another key component in the cryoSPARC pipeline is the filesystem. It is recommended to make use of a local, fast storage on the compute node such as a cache (e.g. NVMe). AWS EC2 instances do offer NVMe local drives, but only a subset of the instances. Generally, they are available on larger instances (e.g., p4d.24xlarge, p3dn.24xlarge, and g4dn.metal). In order to provide a cost-optimal solution and allow for smaller GPU instance types, FSx for Lustre was benchmarked to determine if it could provide the necessary performance.

Below is benchmark data for the 2D classification step (one of the steps in cryoSPARC that can make use of a local cache). It shows the percentage decrease in runtime using FSx for Lustre.

In every case except for one, FSx for Lustre provided better performance than using local storage as a cache.

The main reason for this is how the local filesystem is created. Larger instances have multiple drives (e.g. the p4d.24xlarge has 8 NVMe drives). AWS ParallelCluster creates a single, logical volume from those drives, and that logical volume affects the performance. Additionally, the 2D classification step can make use of multiple GPUs, and so we see a performance loss when multiple cryoSPARC tasks are using the same filesystem (as opposed to those tasks using FSx for Lustre, a file system designed to support IO for a large number of processes).

Cost Analysis

Using a single EC2 instance type for an entire cryoSPARC workload is not an optimal solution. Instances like the p4d.24xlarge will provide the best performance, but with no consideration to cost. General guidelines were given above as to what instances to pick for what stage. Below is an example of how a user may implement those recommendations, along with the total cost (including storage), and runtime. Prices are listed in USD for the us-east-1 region.

Configuration 1

Configuration 2

Pipeline Stage

Instance Type

Cost (USD)

Runtime (Min.)

Instance Type

Cost (USD)

Runtime (Min.)

Patch Motion Correction

p4d.24xlarge

$12.25

22.4

g4dn.metal

$3.97

30.5

Patch CTF Estimation

p4d.24xlarge

$7.46

13.7

g4dn.metal

$1.44

11.1

Blob picker

g4dn.16xlarge

$0.72

9.9

g4dn.16xlarge

$0.72

9.9

Template Picker

g4dn.16xlarge

$1.77

24.4

g4dn.16xlarge

$1.77

24.4

Micrograph Extraction

p3.8xlarge

$1.65

8.25

g4dn.16xlarge

$0.90

12.5

2D Classification

g4dn.metal

$1.76

13.5

g4dn.metal

$1.76

13.5

Ab-initio Reconstruction

g4dn.16xlarge

$2.56

35.3

p3.16xlarge

$6.83

23.3

Non-Uniform Refinement

g4dn.16xlarge

$9.43

130

p4d.24xlarge

$33.66

82.5

Compute Cost

$37.60

$51.05

Lustre Cost

(12 TB Scratch)

$9.52

$8.03

Total

$47.12

257.55

(4.29 hrs)

$59.08

206.8

(3.45 hrs)

For comparison, here are costs and runtimes if we ran the entire benchmark on a single EC2 instance type:

  • p4d.24xlarge

    • 2.79 hours

    • $91.40

  • g4dn.metal

    • 4.8 hours

    • $37.60

  • p3.16xlarge

    • 3.4 hours

    • $83.00

Conclusion

The benchmark results presented demonstrate how AWS can be used to accelerate cryo-EM workloads. By making use of AWS ParallelCluster, users can easily create an HPC cluster with a range of GPU instance types, allowing them to best match compute resources with the requirements of each analysis step.

Benchmark data presented here was done with a single user; in practice, it’s likely multiple users will use the cluster for analysis. Further cost optimization can be achieved by running multiple jobs on a larger instance. For example, while a p4d.24xlarge (with 8 A100 GPUs) may have a higher cost, running multiple, single-GPU stages like Non-Uniform Refinement at the same time will help amortize the higher cost of the p4d instance.

Summarized below are the key points a user should consider when creating a cryoEM cluster.

  • Amazon FSx for Lustre provides a high-performance file system that can meet the requirements of a cryo-EM analysis pipeline. This also allows flexibility in choosing instance types (e.g. those that lack local, fast NVMe storage).

  • A range of GPU instances should be employed. AWS ParallelCluster can be used to easily create different queues with different EC2 instances for this purpose.

  • p3dn.24xlarge instances are not recommended. They can provide excellent performance, but the p4d.24xlarge instance is priced very closely to the p3dn.24xlarge. The time-to-solution for the p4d.24xlarge is fast enough that in most cases a processing stage will use less compute time and, thus, cost less.

  • g4dn instances will likely make up the bulk of the compute resources. They provide performance at an excellent price point.