Guide: Performance Benchmarking (v4.3+)
This guide covers the new benchmarking tool in CryoSPARC that allows for benchmarking a worker’s filesystem, CPUs and GPUs. Available in CryoSPARC v4.3.0+.
Last updated
This guide covers the new benchmarking tool in CryoSPARC that allows for benchmarking a worker’s filesystem, CPUs and GPUs. Available in CryoSPARC v4.3.0+.
Last updated
After installing CryoSPARC and verifying the instance is working correctly (see Installation Testing), use the Performance Benchmarking job to measure the performance of your system and compare it against references provided by Structura and your own past benchmarks.
The new “Benchmark” job is available in the CryoSPARC job builder and can be run on any worker lane connected to your CryoSPARC instance.
The “Benchmark” job will make sure the benchmark data exists in the right location (and downloads it if it doesn’t), and runs the three benchmark tests in serial (CPU, Filesystem and GPU benchmarks) as specified.
The benchmark data (17GB, compressed) is required to be downloaded and extracted into a location accessible by the job in order to run the benchmarks. As a convenience, this is automatically done by the Benchmark job when the required data does not exist in the project directory. The benchmark data can also be manually downloaded via the link provided below. Once manually downloaded and extracted, the absolute path to the folder can be specified in the “Benchmark Data Directory” parameter.
Click here to download the benchmark data package directly from cloud storage.
The benchmark data package contains movies, particles and volumes required for each of the tests. An abridged directory listing can be seen below:
Particles previously processed in CryoSPARC from a subset of movies in EMPIAR-10025:
TIFF: 3x K3 Super-resolution (11520, 8184) 70 frames: 1.16GB each from EMPIAR-10721:
MRC: 3x K2 (3710, 3838) 44 Frames 1.2GB each from EMPIAR-10249:
EER: 3x Falcon 4 (4096, 4096) 48 Frames: 500MB each from EMPIAR-10612:
In each job, there is a parameter (”Share benchmark data with Structura Biotechnology”, disabled by default) to allow uploading of benchmark data to Structura’s servers. The data sent includes timings and hardware information, but does not include any user identifiable information.
An example of the data uploaded can be seen below:
The data sent includes timings and hardware information, but does not include any user identifiable information. Structura will use this data to maintain aggregate statistics about CryoSPARC performance in the wild and help us focus our optimization efforts on the jobs and codepaths with the most benefit to users. Users who do upload benchmark data should not expect any direct response from Structura.
The filesystem benchmark employs a sequential read test for movies, and both a sequential and random read test for particles simulating real CryoSPARC workflows to benchmark the filesystem where the benchmark data exists.
Turn off the parameter “Use SSD for Tests” to disable the use of the caching system when performing the particle read tests. The job will instead report the time it takes to read the particles in a sequential and random pattern from the project directory instead of a local cache device.
On Linux, the “page cache” is an area of unused memory that is used to store data that the OS reads for later rapid retrieval. For example, when you read a 1GB file twice, the second access of the file will be faster, since the file blocks come directly from the cache in memory instead of the hard disk or SSD. The OS automatically frees up data stored in the page cache as more memory is requested by other applications. The Benchmark job attempts to drop files that it uses from the page cache by using the posix_fadvise
function to declare that the files “will not be accessed in the near future” (POSIX_FADV_DONTNEED
). Doing so allows subsequent runs of the Benchmark job to be reproducible (meaning that the numbers reported by the job won’t be skewed by faster read times) without having to manually drop the page cache.
Dropping the page cache: The following two commands first instruct the kernel to write dirty pages to disk, then drop the page cache:
sudo bash -c 'sync; echo 1 > /proc/sys/vm/drop_caches'
Note that there still may be other caches in play if your data is hosted on other machines (e.g., a storage cluster’s cache).
To benchmark sequential reading, which is relevant in the early stages of data processing, three different types of movies (TIFF, EER and MRC) are read and timed. The test reports the averages of the total I/O time taken. This measures the performance of the storage volume on which the movies are located. To benchmark a storage volume that is different from the project directory, copy the benchmark data to the new location and specify it in the “Benchmark Data Directory” parameter. For more information on the sources of each of the movies, see Benchmark Data.
For TIFF and EER movies, only the time it takes for the system to read the data into memory is recorded. The time it takes for the movies to be decompressed (which is always performed when reading TIFF and EER movies) is timed but not recorded.
To benchmark sequential particle reads, which are relevant during some parts of particle processing, a small particle stack (50,000 particles with shape (256,256) across 500 files) is randomly generated using numpy.random.randn
, written to the project directory, cached onto the cache device (if available and enabled), then read (in a sequential pattern) back into memory. The time it takes the system to cache the particles (particle_cache_time
), read the particles sequentially (particle_sequential_read_time
) and the rate at which the particles are read (particle_sequential_read_rate
) are recorded.
To benchmark random reads, which are relevant during most parts of particle processing, the same particle stack created during the sequential read test is used, but this time the particles are read in a random pattern. The time it takes the system to read the particles randomly (particle_random_read_time
) and the rate at which the particles are read (particle_random_read_rate
) are recorded.
The CPU Benchmark reads the same TIFF and EER movies from the Filesystem test (see Sequential Read Test), but instead reports the time it takes to decompress the movies, which is heavily dependent on CPU and Memory performance.
Note that in order to measure decompression time, the CryoSPARC environment variable CRYOSPARC_TIFF_IO_SHM
must be set and turned on (which by default, it is). See Environment Variables. This parameter tells the IO system to first copy the contents of TIFF and EER files to /dev/shm
(a temporary file storage system backed by RAM) before decompressing it, allowing the system to distinguish IO time from decompression time. Note that this parameter also increases performance on some networked file systems.
Decompression time is averaged from three runs of different movies and reported as tiff
and eer
in the CPU tab of the Benchmark viewer.
The GPU benchmark executes a collection of functions from CryoSPARC jobs on each of the worker’s GPUs (unless the “Number of GPUs to benchmark” parameter is specified, in which case only the specified number of GPUs are benchmarked), and times them. The tests include:
FSC Calculations using different masks:
Spherical Mask (fsc_spherical
)
Loose Mask (fsc_loose
)
Tight Mask (fsc_tight
)
Noise Sub Mask (fsc_noisesub
)
Non-Uniform Refinement’s core algorithm (matched_cv_filter_estimation
)
Particle picking’s core algorithm (picking
)
CryoSPARC’s core alignment and reconstruction algorithm (”Engine”), tested with various parameter combinations:
Particles in cache
Using 1 CPU thread
Using a trilinear interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test A (disk_single_linear10_max_C1
)
Using a tricubic interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test B (disk_single_linear20_max_C1
)
Using 2 CPU threads
Using a trilinear interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test C (disk_multi_linear10_max_C1
)
Using a tricubic interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test D (disk_multi_linear20_max_C1
)
Particles in memory
Using 1 CPU thread
Using a trilinear interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test E (memory_single_linear10_max_C1
)
Using Pose Marginalization
Test F (memory_single_linear10_marg_C1
)
Using D7 symmetry
Using Pose Maximization
Test G (memory_single_linear10_max_D7
)
Using a tricubic interpolation kernel
Using C1 symmetry
Using Pose Maximization
Test H (memory_single_linear20_max_C1
)
Using Pose Marginalization
Test I (memory_single_linear20_marg_C1
)
As of CryoSPARC v4.4+, memory_multi_*
(particles in memory + multithreaded) tests have been removed.
The core algorithm used in Non-Uniform Refinement performs multiple data transfers to and from the GPU, while performing hundreds of GPU-accelerated FFTs. This test stresses the memory performance of the GPU and PCIe bandwidth of the CPU, and is limited by the performance of a single CPU core.
The different parameter combinations specified for the core reconstruction algorithm tests code paths used by various CryoSPARC jobs including Homogeneous Refinement, Non Uniform Refinement, 3D Classification, and more. The test name (e.g., memory_multi_linear20_marg_C1
) is comprised of the following parameters used to perform the test:
<particle location><number of CPU threads><interpolation kernel><pose assignment method><symmetry operator>
Particle Location:
Particles can either be stored on the cache device (SSD) if caching is enabled, or read into memory. When particles are in memory, IO time becomes negligible.
Number of CPU Threads:
CryoSPARC’s core algorithm can be run with one or two threads. Most of the time, it’s run with two threads, so that particle IO and GPU computation is performed concurrently. Note that in tests using 2 CPU threads, some timing numbers are not accurate due to the concurrency of the computation, which is why only the overall
time is reported. In these cases, it’s best to compare the timings from the corresponding single-threaded test.
Interpolation Kernel:
CryoSPARC’s refinement and classification algorithms use two main interpolation kernels: trilinear (linear10
) and tricubic (linear20
) to interpolate values of the 3D density in Fourier space. Interpolation is necessary when rotating and projecting the 3D density, which is used in the orientation search step in most refinement/classification/variability jobs.
Trilinear interpolation is significantly less computationally expensive than tricubic interpolation, requiring only 8 array accesses (vs. 64) of the underlying 3D density. Trilinear interpolation is also hardware-accelerated on NVIDIA GPUs through CUDA, whereas tricubic interpolation is not.
Pose Assignment Method:
Non-Uniform Refinement supports either pose “max
imization” or “marg
inalization” during the reconstruction of the 3D density from the particle images. Maximization means that each particle is assigned a single 3D pose and shift during reconstruction. Alternatively, marginalization allows each particle to be assigned multiple 3D poses and shifts, each being weighted by their relative likelihoods under the image formation model. Maximization is usually sufficient, but for small particles or noisy datasets, marginalization helps to account for uncertainty in estimating the poses. When reconstructing the 3D density, maximization only has to insert each particle image into the 3D reconstruction once; marginalization is more computationally expensive because it requires inserting each image into the reconstruction multiple times.
Symmetry Operator:
CryoSPARC’s core reconstruction algorithm supports many symmetry operators, but C1 and D7 were chosen for these benchmarks as a way to turn “off” and “on” the code paths respectively that enable symmetry.
To view and compare previous Benchmark results and reference benchmarks provided by Structura, navigate to the “Benchmarks” tab inside the “Manage” panel.
Under each sub-tab (CPU, File System, GPU, Extensive Validation), there will be reference benchmarks provided by Structura which can be used as a comparison against benchmarks run on the current instance.
To compare multiple references, select them from the table using the checkboxes and click the “Compare” button on the top right side of the screen.
In the comparison view, each benchmark is a column, and their timings are listed as rows. The overall time that the benchmark took is listed under the “Time” sub-column (A), and the portion of how long it took relative to the other timings is represented as a percentage in the “Pct” sub-column.
When a benchmark (column) is selected, it becomes the base “Reference” B1 for the “Speedup” columns B2, which are available for all other benchmarks in the comparison view. The “Speedup” is calculated as
which helps to easily glean how much faster or slower a timing is in comparison to the reference.
When a timing is hovered over, its details will be displayed in the “Benchmark Details” section on the right side of the page (C).
To view more detailed timings (available for the GPU benchmark only), click on the “+” button to expand a row (D). These sub-timings are the low-level functions that get called inside of CryoSPARC’s core reconstruction algorithm. The “Tags” column (E) indicates what hardware component each function’s speed is most dependent on.
For example, for the setup_scales
sub-timing, the relevant component tags are “PCIe Latency/Bandwidth Speed” and “GPU/CPU Memory Allocation” because the function allocates space for a float32 array in CPU and GPU memory, fills it with data in CPU Memory, then downloads the contents of the array from CPU memory to it’s corresponding location on GPU memory. When a “download” happens, this occurs over the PCIe lanes that connect the CPU to the GPU, where the link speed (determined by e.g., PCIe Gen. 3 on most GPUs and PCIe Gen. 4 on NVIDIA Ampere and Ada architectures) matters the most in determining how fast this happens.
Tag Name | Most likely bottleneck |
---|---|
CPU Performance | - single core clock speed |
GPU Performance | - float32 performance and memory bandwidth |
PCIe Latency/Bandwidth Speed | - PCIe generation (3, 4) and number of lanes per slot (x8, x16) |
GPU/CPU Memory Allocation | - general cpu/gpu performance |
Input/Output Speed | - random read speeds of the storage device where particle images are located |
At the end of the benchmark job, results are saved as a JSON and CSV in the job directory. The exact path of the files can be seen at the end of each test in the job’s Event Log.
To view the original Benchmark job that a benchmark was created from, right click on the column header and select “Show job in sidebar”. The JSON and CSV results can also be downloaded from this context menu.
The Extensive Workflow job is now called the Extensive Validation job (v4.3.0+).
The Extensive Validation job is a job that creates and queues other jobs in a pre-defined workflow. Workflows available are for the EMPIAR-10025 and EMPIAR-10305 datasets, which are downloaded when the job is run. If you run the Extensive Validation job in "Benchmark" mode, each job defined in the workflow will run in sequence. This will allow you to compare the overall performance of each job in the Benchmark UI, along with the CPU, Filesystem, and GPU performance benchmarks.
First, create an “Extensive Validation” job and select “Benchmark” as the value for the “Run Mode” parameter:
In “Benchmark Mode”, jobs that support multi-GPU parallelization (such as Patch Motion Correction, Patch CTF Estimation, and 2D Classification) can be allocated multiple GPUs. To allocate multiple GPUs, specify a number greater than 1 for the “Number of GPUs to use” parameter field, and either select a lane or specify the exact GPUs using the “Run on specific GPUs” tab in the Resource Selection panel.
For more information on the jobs that are launched by the Extensive Validation job in benchmark mode, see the Extensive Validation documentation here:
Guide: Verify CryoSPARC Installation with the Extensive Validation Job (v4.3+)First, instruct the kernel to write dirty pages to disk, then drop the page cache:
sudo bash -c 'sync; echo 1 > /proc/sys/vm/drop_caches'