Guide: SSD Particle Caching in CryoSPARC

Overview of how SSD particle caching works, how much SSD space you need, configuration options and troubleshooting.

Why is particle caching effective?

For classification, refinement, and reconstruction jobs that deal with particles, having local SSDs on worker nodes can significantly speed up computation: Many cryo-EM algorithms rely on random-access patterns and multiple passes though the data, rather than sequentially reading the data once. When you install CryoSPARC, you have the option of adding an ssd_path, which is a fast drive location on the worker node that particles will be copied to and read from when being processed. CryoSPARC manages the SSD cache on each worker node transparently.

When you run jobs that have the Cache particle images on SSD option turned on, particles will be automatically copied to and read from the SSD path specified. Furthermore, if multiple jobs within the same project require the same particles, the cache will be re-used and the copying step is skipped. If more space is needed, previously cached data will be automatically deleted. Setting up an SSD cache is optional on a per-worker node basis, but it is highly recommended. Nodes reserved for pre-processing (motion correction, CTF estimation, particle picking, etc.) do not need to have an SSD.

Hardware

The size of your typical cryo-EM single particle datasets will inform the size of SSD you choose to use. To store the largest of particle stacks, we recommend 2TB SSDs. You can calculate the exact size of a particle dataset with the following calculation:

Dataset\ Size =(4*(box\_size^2)+nsymbt+header\_length)*num\_particles

For example: A 1,000,000 particle dataset with box size 256 will have a total size of 263.3 GB

(4*(256^2)+128+1024)*1,000,000=263,296,000,000 \ bytes

For example: A 2,000,000 particle dataset with box size 432 will have a total size of 1.5 TB

(4*(432^2)+128+1024)*2,000,000=1,495,296,000,000 \ bytes

Configuration

Installation

When installing CryoSPARC, you can use the parameter --ssdpath to specify the path of your SSD drive when you connect your worker to your instance. If you don't want to configure an SSD cache for a workstation node, specify the --nossd option.

bin/cryosparcw connect 
  --worker <worker_hostname> 
  --master <master_hostname> 
  --port <port_num>   
  --ssdpath <ssd_path>             : path to directory on local ssd

By default, if you specify the SSD path then the cache will be enabled with no quota or reserve.

Advanced Parameters

You can specify two advanced parameters to fine-tune your SSD cache:

--ssdquota: The maximum amount of space that CryoSPARC can use on the SSD (MB)

--ssdreserve: The minimum amount of free space to leave on the SSD (MB)

The above options are useful when you're setting up CryoSPARC on a common compute node that will share the SSD with other applications.

Updating Configuration

You can always update the SSD configuration at any time by running the connect command with the --update flag:

bin/cryosparcw connect
  --worker <worker_hostname>
  --master <master_hostname>
  --port <port_num>
  --update                         : update an existing worker configuration
  [--nossd]                        : connect worker with no SSD
  [--ssdpath <ssd_path> ]          : path to directory on local ssd
  [--ssdquota <ssd_quota_mb> ]     : quota of how much SSD space to use (MB)
  [--ssdreserve <ssd_reserve_mb> ] : minimum free space to leave on SSD (MB)

Use

Use the caching system when running a job

When you are running jobs that process particles (for example: Ab-Initio, Homogeneous Refinement, 2D Classification, 3D Variability), you will find a parameter at the bottom of the job builder under "Compute Settings" called Cache particle images on SSD. Turn this option off to load raw data from their original location instead.

Set a default parameter for the project

By default, the Cache particle images on SSD parameter is always on for every job you build, but if you'd like to keep this option off across all jobs in a project, you can set a project-level default.

In v2.15+, the parameter can be adjusted from the sidebar when a project is selected.

In earlier versions of CryoSPARC, you can adjust this parameter by running the following command in a shell on the master node:

cryosparcm cli "set_project_param_default('PX', 'compute_use_ssd', False)"

where 'PX' is the Project UID you'd like to set the default for (e.g., 'P2')

You can undo this setting by running:

cryosparcm cli "unset_project_param_default('PX', 'compute_use_ssd')"

Tips and Tricks

Consolidating a Particle Stack

When caching a particle stack that is larger than space available on your SSD, you may optionally consolidate your particle stack. This option works if the current particle stack is a subset of the original particle stack. For example, when the cache reports how much data it's requesting to copy (SSD cache : cache requires 1000000.00 MB more on the SSD for files to be downloaded. & SSD cache : cache successfully requested to check 2000000 files.) and the sizes it reports seem much larger than you expected, you can consolidate your particle stack such that only the particle subset you care about is cached.

You might run into this situation if you ran an "Inspect Picks" job after an "Extract From Micrographs" job, and you modified the picking thresholds of your particles to include a smaller subset than the original stack.

You might also run into this situation after a round of 2D Classification. When you select classes, you create metadata that specifies which subset of the particle stack to use. When using this particle subset in further processing, the caching system will require the entire stack of particles to be cached, even though only the smaller subset is required.

To consolidate your particle stack, build a "Downsample Particles" job, connect your particles, and run the job. There is no need to change any parameters - nothing will change about your particle dataset except for the .cs metafile that will be recreated to reflect the smaller subset. You can use this smaller dataset to continue processing.

Dynamic SSD Cache Paths

On some systems it is not possible to know the SSD cache path ahead of time. Instead, a dynamically-generated path is available for jobs to use at run-time.

To prompt a job to use this path, make the path available via a system-defined environment variable. Open cryosparc_worker/config.sh for editing and set the value of CRYOSPARC_SSD_PATH in the worker environment config to this variable:

# cryosparc_worker/config.sh
export CRYOSPARC_LICENSE_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/usr/local/cuda"
export CRYOSPARC_SSD_PATH="$CUSTOM_DYNAMIC_SSD_PATH"

Increase or Reduce Cache Files Lifetime

As of CryoSPARC v3.3, jobs that require cache automatically remove cache files that have not been accessed in over 30 days. If projects on your instance get actively worked on for more or less time, you may change the cache file lifetime by adding the following line to cryosparc_master/config.sh:

export CRYOSPARC_SSD_CACHE_LIFETIME_DAYS=15

Substitute 15 with the number of days your projects typically get worked on.

Leveraging Multiple Threads to Copy Particles

In CryoSPARC v4.3.0+, multiple threads (default 2) are used to copy particles from the project directory to the local cache device. To modify the number of threads used, add the following line to cryosparc_worker/config.sh, where num_threads is the number of threads (e.g., 12) to spawn to copy files:

export CRYOSPARC_CACHE_NUM_THREADS=num_threads

Specify export CRYOSPARC_CACHE_NUM_THREADS=1 to turn off multithreading and copy particle files sequentially in the main process.

Troubleshooting

`SSD cache : cache waiting for requested files to become unlocked.`

This temporary message usually means the files this job is trying to access are currently being cached by another job. For example, if you started two different Refinement jobs at the same time on the same node (Job A and Job B) using the same particle stack that haven't been cached on SSD before, both jobs try to first copy all particles onto the SSD. If Job A acquires the lock for the files first, it starts copying them and Job B shows this message. When Job A finishes copying the files, it unlocks them. Job B is unlocked and finds that the particles are already on the SSD, so it skips over the copy step.

`SSD cache : cache does not have enough space for download... but there are no files that can be deleted.`

This message means that there is another CryoSPARC job or another application on the workstation taking up space on the SSD. If the former, the job showing this message will try to free up space as soon as it can, and it will continue processing. If there are files on the SSD that are not owned by CryoSPARC, it will not be able to delete them. It may be necessary to delete them manually.

FAQ

Is it safe to manually delete cache files for completed or unqueued/cleared jobs? Also, can I pre-cache with symlinks to skip caching?

Yes, it is safe to delete cache files any time (it’s a read-only cache) and yes, the cache checks to see if files exist just based on path/size/modification date so symlinks should cause it to skip. Though it may be easier to just set the SSD Cache parameter to False in each job that you queue up.

Source:

How to clear the cache in v2?

PreviousGuide: Configuring Custom Variables for Cluster Job Submission Scripts NextGuide: Data Management in CryoSPARC (v4.0+)

Last updated 1 year ago