Performance Benchmarks
Version 1.0 (May 10, 2021)
This Benchmark Guide accompanies the Deploying CryoSPARC on AWS Guide. This guide provides an overview of benchmarks performed on a sample cryo-EM data processing workflow in CryoSPARC on AWS ParallelCluster. A typical workflow in CryoSPARC involves multiple steps, and it’s important to understand the computational requirements of each step to build a cluster on AWS that is both performant and economical.
In addition to benchmarking results, this guide also presents best practices around EC2 instance selection, CryoSPARC configuration, and file system considerations.
NOTE: This guide serves as an example of possible installation options, performance and cost, but each user’s results may vary. Performance and costs may scale differently depending on the specific compute setup, data being processed, how long AWS compute resources are being used, specific steps used in processing, etc.
Benchmark Data
The dataset used for benchmarking is EMPIAR-10288 (cannabinoid receptor 1-G protein complex). It’s composed of 2756 TIFF images constituting 476 GB of data. The dataset is moderate in size (compared to production cryo-EM workloads) but was chosen because the size meant a large number of simulations could be run to test a range of different architecture options. Later benchmarking efforts will build on the analysis presented here and will be applied to larger datasets.
There are a few files in the raw dataset that need to be discarded before processing, due to a mismatch in shapes when compared to the rest of the dataset. These files are:
CB1__00004_Feb18_23.33.18.tif
CB1__00005_Feb18_23.34.19.tif
CB1__00724_Feb19_12.00.25.tif
The raw dataset comes with two gain reference files in the .dm4
format, which at this time, CryoSPARC does not support. Use the following file as a gain reference for the entire dataset, which has been converted to .mrc
from one of the .dm4
files:
https://structura-assets.s3.amazonaws.com/files/CountRef_CB1__00826_Feb19_14.19.25.mrc
Processing Steps
1. Import Movies
Movies data Path:
.../10288/data/*.tif
Gain reference path:
...10288/data/CountRef_CB1__00826_Feb19_14.19.25.mrc
Flip gain ref in Y?
True
Raw pixel size (Å):
0.86
Accelerating Voltage (kV):
300
Spherical Aberration (mm):
2.7
Total exposure dose (e/Å^2):
58
2. Patch Motion Correction
3. Patch CTF Estimation
4. Blob Picker
Minimum particle diameter (Å):
100
Maximum particle diameter (Å):
150
Use circular blob:
True
Use elliptical blob:
True
Number of micrographs to process:
100
5. Extract from Micrographs
Extraction Box Size:
320
Fourier crop to box size:
64
6. 2D Classification 7. Select 2D (INTERACTIVE) Select all views that look resolvable; these classes will be used to identify particle locations on the full dataset. Try to select the following views:
8. Template Picker
Particle Diameter:
160
9. Inspect Picks (INTERACTIVE) Move sliders to match the following values:
NCC score >
0.340
Local power >
936.000
Local power <
1493.000
10. Extract from Micrographs
Extraction Box Size:
360
Fourier crop to box size:
256
11. 2D Classification
12. Select 2D (INTERACTIVE)
13. Ab-Initio Reconstruction
14. Non-Uniform Refinement The Non-Uniform Refinement should yield a 3Å resolution structure, at which point the data processing pipeline is complete.
Your tree view should look like this:
Cluster Configuration
AWS ParallelCluster 2.10.0 was used to create an HPC cluster on which to benchmark CryoSPARC. The main requirements for cryo-EM workloads are access to large numbers of GPUs (and CPUs for some applications) as well as a high-performance file system. AWS ParallelCluster provides a simple-to-use mechanism to create a cluster that meets those requirements.
The specific cluster architecture for these benchmarks is as follows (deployed in us-east-1):
Main Node
EC2 instance: c5n.9xlarge
36 vCPUs
96 GB memory
Network Bandwidth: 50 Gbps
100 GB local storage (EBS gp2)
Compute Nodes
Multiple queues were configured to dynamically provision the following instance types:
g4dn.16xlarge
Intel Cascade Lake CPU (32 vCPUs, 256GB memory)
1 x NVIDIA T4 GPU
1 x 900 GB NVMe local storage
g4dn.metal
Intel Cascade Lake CPU (32 vCPUs, 256GB memory)
8 x NVIDIA T4 GPUs
2 x 900 GB NVMe local storage
p3.2xlarge
Intel Broadwell CPU (8 vCPUs, 61GB memory)
1 x NVIDIA Tesla V100 GPU
p3.8xlarge
Intel Broadwell CPU (32 vCPUs, 244GB memory)
4 x NVIDIA Tesla V100 GPUs
p3.16xlarge
Intel Broadwell CPU (64 vCPUs, 488GB memory)
8 x NVIDIA Tesla V100 GPUs
p3dn.24xlarge
Intel Broadwell CPU (96 vCPUs, 768GB memory)
8 x NVIDIA Tesla V100 GPUs
2 x 900 GB NVMe local storage
p4d.24xlarge
Intel Cascade Lake CPU (96 vCPUs, 1152GB memory)
8 x NVIDIA A100 GPUs
8 x 1 TB NVMe local storage
Further details about the above EC2 instance types can be found here.
File Systems
/fsx
12 TB FSx for Lustre (2.4 GB/s throughput)
Used as the primary working directory for all CryoSPARC jobs
/shared
100 GB EBS volume mounted on cluster head node
Shared with compute nodes via NFS
Used as application installation directory
/scratch
Local storage on compute nodes
Only used to test cache performance in certain CryoSPARC steps
Only available on specific EC2 instances (those with additional NVMe or SSD)
Software
CryoSPARC v3.0.0
AWS ParallelCluster v2.10.0
Slurm 20.02.4
CUDA 11.0
NVIDIA Driver 450.80.02
EMPIAR 10288 Pipeline
The pipeline used to benchmark the EMPIAR-10288 dataset is composed of the following CryoSPARC steps:
Import Movies
Patch Motion Correction
Patch CTF Estimation
Blob Picker
Template Picker
Extract From Micrographs
2D Classification
Ab-initio Reconstruction
Non-Uniform Refinement
Performance Analysis
Each stage was run on 1, 2, 4 or 8 GPUs on each of the listed EC2 instance types (certain stages only make use of a single GPU and are noted in the results). Each pipeline step was run on the attached FSx for Lustre filesystem, and several of the steps were also run using local NVMe drives as a cache to compare performance.
The total runtime for the EMPIAR-10288 pipeline on each instance type is shown below:
The p4d instance provides the best overall performance, but the analysis pipeline doesn’t make use of all 8 GPUs the entire time. Also, the cost of running the entire pipeline on a single p4d instance may not be ideal for some users:
The g4dn.metal instance provides the most cost-effective option if we were to use it for the entire analysis.
An ideal approach is to match EC2 instance types to CryoSPARC pipeline stages, allowing us to make efficient use of the compute resources and keep compute costs down. The benchmarks for each step are listed below and will be used to help identify which instance types to use for which step.
The movie import step does not make use of GPUs, so the performance is determined by the host CPU. The g4 and p4 instances have Intel Cascade Lake processors, the p3dn.24xlarge has a Skylake processor, and the p3 instances have Broadwell processors.
The Patch Motion and Patch CTF Estimation steps show good scaling, and so an instance with 8 GPUs is recommended for these stages.
The Blob Picker and Template Picker stages make use of a single GPU, and there is minimal difference in performance between GPU types. Here a low-cost, single-GPU EC2 instance is recommended (e.g. g4dn.16xlarge).
The Extract from Micrographs step shows scaling only up to 2 GPUs. Currently, there are no EC2 instances that offer only 2 GPUs, so a 4-GPU instance is recommended (e.g. p3.8xlarge). The performance difference is minimal across GPU architectures, so price should be the driving factor in choosing an instance for this stage.
The 2D Classification step also shows scaling only up to 2 GPUs. At present, EC2 instances are available with 1, 4, or 8 GPUs, so a 4-GPU instance is recommended (e.g. p3.8xlarge). Optionally, a lower-cost instance like the g4dn.metal (with 8 GPUs) may be used, with the expectation that scaling may be limited. The performance difference is minimal across GPU architectures, so price should be the driving factor in choosing an instance for this stage.
The Ab-initio Reconstruction step makes use of a single GPU, so the g4dn.16xlarge instance is the best option in terms of price and performance. The larger p3 and p4 instances offer the best performance but increase the overall cost.
The Non-Uniform Refinement stage makes use of 1 GPU and contributes the most to the total runtime. For this stage, a user should consider whether cost or time-to-solution is more important; a p4d.24xlarge will produce results 3.4x faster than a g4dn.16xlarge, but at a higher cost.
We can see several different aspects of a simulation to consider when choosing instance types:
The Patch Motion Correction and Patch CTF Estimation stages scale, and so an 8-GPU instance is recommended.
Ab-initio Reconstruction and Non-Uniform Refinement dominate the runtime as single-GPU stages.
2D Classification benefits somewhat from scaling, but the short run time for this stage means cost should be a deciding factor in choosing an instance.
As CryoSPARC involves an analysis pipeline, with each stage making use of compute resources in a different way, care must be given to the selection of instance type.
Storage Performance
Another key component in the CryoSPARC pipeline is the filesystem. It is recommended to make use of a local, fast storage on the compute node such as a cache (e.g. NVMe). AWS EC2 instances do offer NVMe local drives, but only a subset of the instances. Generally, they are available on larger instances (e.g., p4d.24xlarge, p3dn.24xlarge, and g4dn.metal). In order to provide a cost-optimal solution and allow for smaller GPU instance types, FSx for Lustre was benchmarked to determine if it could provide the necessary performance.
Below is benchmark data for the 2D classification step (one of the steps in CryoSPARC that can make use of a local cache). It shows the percentage decrease in runtime using FSx for Lustre.
In every case except for one, FSx for Lustre provided better performance than using local storage as a cache.
The main reason for this is how the local filesystem is created. Larger instances have multiple drives (e.g. the p4d.24xlarge has 8 NVMe drives). AWS ParallelCluster creates a single, logical volume from those drives, and that logical volume affects the performance. Additionally, the 2D classification step can make use of multiple GPUs, and so we see a performance loss when multiple cryoSPARC tasks are using the same filesystem (as opposed to those tasks using FSx for Lustre, a file system designed to support IO for a large number of processes).
Cost Analysis
Using a single EC2 instance type for an entire CryoSPARC workload is not an optimal solution. Instances like the p4d.24xlarge will provide the best performance, but with no consideration to cost. General guidelines were given above as to what instances to pick for what stage. Below is an example of how a user may implement those recommendations, along with the total cost (including storage), and runtime. Prices are listed in USD for the us-east-1 region.
Configuration 1 | Configuration 2 | |||||
Pipeline Stage | Instance Type | Cost (USD) | Runtime (Min.) | Instance Type | Cost (USD) | Runtime (Min.) |
Patch Motion Correction | p4d.24xlarge | $12.25 | 22.4 | g4dn.metal | $3.97 | 30.5 |
Patch CTF Estimation | p4d.24xlarge | $7.46 | 13.7 | g4dn.metal | $1.44 | 11.1 |
Blob picker | g4dn.16xlarge | $0.72 | 9.9 | g4dn.16xlarge | $0.72 | 9.9 |
Template Picker | g4dn.16xlarge | $1.77 | 24.4 | g4dn.16xlarge | $1.77 | 24.4 |
Micrograph Extraction | p3.8xlarge | $1.65 | 8.25 | g4dn.16xlarge | $0.90 | 12.5 |
2D Classification | g4dn.metal | $1.76 | 13.5 | g4dn.metal | $1.76 | 13.5 |
Ab-initio Reconstruction | g4dn.16xlarge | $2.56 | 35.3 | p3.16xlarge | $6.83 | 23.3 |
Non-Uniform Refinement | g4dn.16xlarge | $9.43 | 130 | p4d.24xlarge | $33.66 | 82.5 |
Compute Cost | $37.60 | $51.05 | ||||
Lustre Cost (12 TB Scratch) | $9.52 | $8.03 | ||||
Total | $47.12 | 257.55 (4.29 hrs) | $59.08 | 206.8 (3.45 hrs) |
For comparison, here are costs and runtimes if we ran the entire benchmark on a single EC2 instance type:
p4d.24xlarge
2.79 hours
$91.40
g4dn.metal
4.8 hours
$37.60
p3.16xlarge
3.4 hours
$83.00
Conclusion
The benchmark results presented demonstrate how AWS can be used to accelerate cryo-EM workloads. By making use of AWS ParallelCluster, users can easily create an HPC cluster with a range of GPU instance types, allowing them to best match compute resources with the requirements of each analysis step.
Benchmark data presented here was done with a single user; in practice, it’s likely multiple users will use the cluster for analysis. Further cost optimization can be achieved by running multiple jobs on a larger instance. For example, while a p4d.24xlarge (with 8 A100 GPUs) may have a higher cost, running multiple, single-GPU stages like Non-Uniform Refinement at the same time will help amortize the higher cost of the p4d instance.
Summarized below are the key points a user should consider when creating a cryoEM cluster.
Amazon FSx for Lustre provides a high-performance file system that can meet the requirements of a cryo-EM analysis pipeline. This also allows flexibility in choosing instance types (e.g. those that lack local, fast NVMe storage).
A range of GPU instances should be employed. AWS ParallelCluster can be used to easily create different queues with different EC2 instances for this purpose.
p3dn.24xlarge instances are not recommended. They can provide excellent performance, but the p4d.24xlarge instance is priced very closely to the p3dn.24xlarge. The time-to-solution for the p4d.24xlarge is fast enough that in most cases a processing stage will use less compute time and, thus, cost less.
g4dn instances will likely make up the bulk of the compute resources. They provide performance at an excellent price point.
Last updated