Hardware and System Requirements

Description of cryoSPARC HPC software system architecture, typical setups (e.g., workstation, cluster).

cryoSPARC System Architecture Overview

CryoSPARC is a backend and frontend high performance computing software system that provides data processing and image analysis capabilities for single particle cryo-EM, along with a rich browser-based user interface and command line tools.

Master-worker pattern

The system is based on a master-worker pattern.

  • The master processes (web application, core application and MongoDB database) run together on one node (master node). The master node requires relatively lightweight resources (2+ CPUs, 16GB+ RAM)

  • Worker processes can be spawned on any available/configured node that has GPUs (worker node). The worker is responsible for all actual computation and data handling, and is dispatched by the master node. Note: The same node can function as both master and worker.

The master-worker architecture allows cryoSPARC to be installed and scaled up in a flexible manner, on a variety of hardware, including a single workstation, groups of workstations, cluster nodes, HPC clusters, cloud nodes, etc.

Core components included in the cryoSPARC system

Typical cryoSPARC System Setups

CryoSPARC is able to support a heterogeneous mixture of all typical setups in a single instance. This means you can start off with installing cryoSPARC on a single workstation, then connect a worker node or cluster whenever your data processing needs increase.

Single Workstation

Single cryoSPARC workstation example, where the master and worker processes run on a single machine.

Both the cryoSPARC master and cryoSPARC worker processes can be run on the same machine. The only requirement is that GPU resources are available for the cryoSPARC worker processes. This is the simplest setup.

Master-Worker

Single-master, multiple-worker example. All nodes must have access to a shared file system.

In the master-worker setup, the cryoSPARC master is installed on a lightweight machine, and the worker processes are installed on one or more GPU servers. This is the most flexible setup for installing cryoSPARC. There are three main requirements for this setup, which are also explained in greater detail in the Installation sections of this document:

1) All nodes have access to a shared file system. This file system is where the project directories are located, allowing all nodes to read and write intermediate results as jobs start and complete.

2) The master node has password-less SSH access to each of the worker nodes. SSH is used to execute jobs on the worker nodes from the master node.

3) All worker nodes have TCP access to 10 consecutive ports on the master node (default ports are 39000-39010). These ports are used for metadata communication via HTTP Remote Procedure Call (RPC) based API requests.

It is also possible to use one of the worker nodes as the master, in which case no standalone master node is necessary. This can sometimes lead to instability though, if the GPU worker node hangs or runs out of RAM, the master processes running the web application, database, etc will also hang.

Clusters

CryoSPARC cluster integration example where both nodes have access to a shared file system

The master node can also spawn or submit jobs to a cluster scheduler system (e.g., Slurm Workload Manager). This integration is transparent, and works in a similar fashion to the master-worker setup explained above, except all resource scheduling is handled by the cluster scheduler, and cryoSPARC's scheduler is only used for orchestration and management of jobs. Similar requirements are present:

1) All nodes have access to a shared file system. This file system is where the project directories are located, allowing all nodes to read and write intermediate results as jobs start and complete.

2) All worker nodes have TCP access to 10 consecutive ports on the master node (default ports are 39000-39010). These ports are used for metadata communication via HTTP Remote Procedure Call (RPC) based API requests.

For a cluster setup, the master node can be a regular cluster node (or even a login node) if this makes networking requirements easier, but the cryoSPARC master processes must be run continuously. If the master is to be run on a regular cluster node, the node may need to be requested from your scheduler in interactive mode or for an indefinitely running job.

Supported cluster schedulers

CryoSPARC supports most cluster schedulers, including SLURM, SGE and PBS. Please see here for example cluster configurations for popular schedulers.

cryoSPARC System Requirements

The following are requirements for every master and worker node in the system unless otherwise specified. The user account cryosparcuser is a service account for hosting the cryoSPARC master process and running cryoSPARC jobs on worker nodes. You can in fact use any user account or name (other than root) but we recommend the creation of a user account specifically to be the cryoSPARC service account.

Component

Requirement

Operating System

Any Modern Linux OS (Ubuntu, CentOS)

Shell

Bash

User Account

cryosparcuser

Software

CUDA ≥9.2, ≤10.2 (worker nodes only)

GCC

curl (optional)

Filesystem

Shared file system across all nodes

Component

Requirement

Recommended

CPU

4+ cores

8+ cores at 2.8GHz+

RAM

16GB+

32GB DDR4

Fast Local Storage

Not Required

Not Required

GPU

Not Required

Not Required

Network

1Gbps link to storage servers

10Gbps link to storage servers

A 10Gbps connection is recommended to the storage servers as raw cryo-EM movies can be several TB in size, and I/O bottlenecks may be more of a concern than processing power for pre-processing jobs in cryoSPARC. Also, even though a CPU with a higher core count is recommended, having a CPU with a faster clock rate would be more advantageous due to the way the master processes are built.

Worker Node/Cluster Worker Minimum Requirements

Component

Requirement

Recommended

CPU

2+ cores per GPU

4 cores per GPU

RAM

16GB+ per GPU

32GB DDR4 per GPU

Fast Local Storage

1TB SSD

2TB PCIe SSD

GPU

1+ NVIDIA GPU with CC 3.5+, 11GB+ VRAM

1+ NVIDIA Tesla V100, RTX2080Ti, etc

Network

1Gbps link to storage servers

10Gbps link to storage servers

System RAM is very important for worker nodes, and should scale proportionately to the number of GPUs available for processing on the system. Fast local storage is also necessary as reconstruction jobs require random access to particle images. SSDs can provide high throughput in this context.

Disks and compression

Fast disks are a necessity for processing cryo-EM data efficiently. Fast sequential read/write throughput is needed during pre-processing stages (e.g., motion correction) where the volume of data is very large (tens of TB) while the amount of computation is relatively low (sequential processing for motion correction, CTF estimation, particle picking, etc.)

Spinning disk arrays in a RAID configuration are used to store large raw data files, and often cluster file systems are used for larger systems. As a rule of thumb, to saturate a 4-GPU machine during pre-processing, sustained sequential read of 1000MB/s is required.

Compression can greatly reduce the amount of data stored in movie files, and also greatly speeds up preprocessing because decompression is actually faster than reading uncompressed data straight from disk. Typically, counting-mode movie files are stored in LZW compressed TIFF format without gain correction, so that the gain reference file is stored separately and must be applied on-the-fly during processing (which is supported by cryoSPARC). Compressing gain corrected movies can often result in much worse compression ratios than compressing pre-gain corrected (integer count) data.

CryoSPARC supports LZW compressed TIFF format and BZ2 compressed MRC format natively. In either case, the gain reference must be supplied as an MRC file. Both TIFF and BZ2 compression are implemented as multi-core decompression streams on-the-fly.

Solid State Storage (SSDs)

SSD space is optional on a per-worker node basis, but is highly recommended for worker nodes that will be running refinements and reconstructions using particle images. Nodes reserved for pre-processing (motion correction, particle picking, CTF estimation, etc) do not need to have an SSD.

CryoSPARC particle processing algorithms rely on random-access patterns and multiple passes through the data, rather than sequentially reading the data at once. Using a storage medium that allows for fast random reads will speed up processing dramatically.

CryoSPARC manages the SSD cache on each worker node transparently. Files are automatically cached, re-used across the same project and deleted if more space is needed. Please see the SSD Caching tutorial for more information.

Graphical Processing Units (GPUs)

At least one worker node must have GPUs available to be able to run the complete set of cryoSPARC jobs, but non-GPU workers can also be connected to run CPU-only jobs. The worker nodes connected to a cryoSPARC instance can be running different CUDA versions. This can be useful if machines with older GPUs (that require older versions of CUDA) are still being used. To keep installation simple, it's best to connect worker nodes that all use the same version of the CUDA Toolkit.

The GPU memory (VRAM) in each GPU limits the maximum particle box size able to be reconstructed. Typically, a GPU with 12GB VRAM can handle a box size of up to 700^3, and up to 1024^3 in some job types.

Notable GPUS

Older GPUs can often perform almost equally as well as the newest, fastest GPUs because most computations in cryoSPARC are not bottlenecked by GPU compute speed, but rather by GPU memory bandwidth and disk I/O speed.

Browser Requirements

The cryoSPARC web interface works best on the latest version of Google Chrome. Firefox and Safari are also an option, although some features may not work as intended. Internet Explorer is not supported.

Additional Configuration Notes

Root Access

The cryoSPARC system is specifically designed not to require root access to install or use. The reason for this is to avoid security vulnerabilities that can occur when a network application (web interface, database, etc.,) is hosted as the root user. For this reason, the cryoSPARC system must be installed and run as a regular UNIX user (cryosparcuser), and all input and output file locations must be readable and writable as this user. In particular, this means that project input and output directories that are stored within a regular user's home directory need to be accessible by cryosparcuser, or else (more commonly) another location on a shared file system must be used for cryoSPARC project directories.

Multi-user environment

If you are installing the cryoSPARC system for use by a number of users (for example within a lab), there are two options:

Using UNIX Groups

Create a new regular user (cryosparcuser) and install and run cryoSPARC as this user. Create a cryoSPARC project directory (on a shared file system) where project data will be stored, and create sub-directories for each lab member. If extra security is necessary, use UNIX group privileges to make each sub-directory read/writeable only by cryosparcuser and the appropriate lab member's UNIX account. Within the cryoSPARC command-line interface, create a cryoSPARC user account for each lab member, and have each lab member create their projects within their respective project directories. This method relies on the cryoSPARC web application for security to limit each user to seeing only their own projects. This is not guaranteed security, and malicious users who try hard enough will be able to modify the system to be able to see the projects and results of other users.

Using Separate cryoSPARC Instances

If each user must be guaranteed complete isolation and security of their projects, each user must install cryoSPARC independently within their own home directories. Projects can be kept private within user home directories as well, using UNIX permissions. Multiple single-user cryoSPARC master processes can be run on the same master node, and they can all submit jobs to the same cluster scheduler system. This method relies on the UNIX system for security and is more tedious to manage but provides stronger access restrictions. Each user will need to have their own cryoSPARC license ID in this case.

Hardware Vendors and Example Systems

We do not currently partner with any specific hardware vendors to sell machines with cryoSPARC pre-installed, but if you email info@structura.bio we can point you to a number of vendors we have worked with and who offer compatible turnkey systems.

Example Hardware Systems

We have provided details for example workstations which meet or exceed the minimum requirements specified above, including those we use internally for development and testing.

The "medium" workstation example is a great starting point for processing cryo-EM data using cryoSPARC, whereas the "large" workstation example details an ideal hardware setup.

Structura Bio Testing Hardware
Medium Workstation
Large Workstation
Structura Bio Testing Hardware

Component

Hardware Product

CPU

AMD Threadripper 2950X

Memory

128GB DDR4 @ 2933MHz

Storage

2TB PCIe SSD (cache); 100TB RAID 6 storage server via 10Gbps link (raw movies)

GPU

4x NVIDIA GV100

Medium Workstation

Component

Hardware Product

CPU

16 Cores (base clock 3.0GHz+)

Memory

128GB DDR4

Storage

1TB PCIe SSD (cache) ; HDD storage server in RAID configuration (raw movies)

GPU

4x NVIDIA GTX 2080Ti

Large Workstation

Component

Hardware Product

CPU

32 Cores (base clock 2.8GHz+)

Memory

256GB DDR4

Storage

4TB PCIe SSD (cache); HDD storage server in RAID configuration (raw movies)

GPU

4x NVIDIA Tesla V100 or 4x NVIDIA RTX 8000