Guide: Data Management in CryoSPARC (≤v3.3)

An overview of all data management utilities and common use cases.

The information in this guide applies to CryoSPARC ≤v3.3. For information about managing project directories in newer versions, see Guide: Data Management in CryoSPARC (v4.0+)

Single particle cryo-EM projects and labs continue to operate at increasing scales. In CryoSPARC v2.11+, we introduce several critical tools and features for dealing with cleaning up, archiving, transferring, exporting, and importing projects. We have also created tools for exporting and importing individual jobs and individual results of various types (particle stacks, exposure stacks, volumes, etc). These tools together allow for workflows like:

  • Exporting a project from one CryoSPARC instance and importing it into another instance

  • Sending a project initially started at a centralized facility along with a user who is going to continue processing at home in their own instance

  • Reducing the disk space used by a project by removing intermediate results created by jobs

  • Archiving a project to remote/slow storage for later retrieval and resurrection

  • Uploading the final results of a CryoSPARC job to online repositories (EMDB/EMPIAR/etc) in a self-contained format that other users can download and import

  • Sharing a project or job with another lab or a friend

  • Advanced manipulation of CryoSPARC metadata at a low level, or programmatically, by exporting a result (e.g., a particle stack), manually modifying the associated .cs files, and importing it again

We have tried to simplify the experience of performing any of the above actions by ensuring that many of the steps involved are automatic or background processes.

1. Self-consistent, self-contained project directories (i.e. no need to export projects)

CryoSPARC workflows are naturally divided into projects. Each project should contain the work and jobs for one or more related data collection sessions that are associated with a given sample/target. Project boundaries are strict, in the sense that files and results from one project cannot be directly used in another project.

In v2.11+, a project directory always contains all the information needed to define that project, and therefore all the information required to import that project into a CryoSPARC instance. This means that a project directory can, at any time, be transferred/renamed/copied and imported as a valid project in any (other) CryoSPARC instance that can read the files. This is true even if the original CryoSPARC instance that created the project is no longer functional. Note that this is different form previous CryoSPARC versions, where project metadata was only stored in the database, and a project directory was not sufficient to import the project.

Maintenance of valid project directories is accomplished via "continuous project export" functionality in v2.11+. Project directories are constantly being updated without interruption for the user. During the initial update to v2.11, a one-time migration will occur to update all projects with their metadata. Thereafter, certain changes that happen within a project triggers an update of the on-disk project information.

Similarly, jobs inside the project are stored in a self-contained format, and this is updated whenever the jobs are created, modified, or completed. Note: jobs that are in launched, started, running, waiting, killed or failed status will not be updated on disk until they enter completed status - either by actually completing, or by the user choosing the mark as completed option in the Job Details panel. This action will ensure that the job directory is updated, and therefore that importing the project into another instance will contain the latest information about the job.

CryoSPARC v2.11+ also contains a new notification system, which will show system notifications to users in an unobtrusive manner.

2. Ability to import projects (from any valid, intact project directory)

Any intact CryoSPARC project directory (from v2.11+) can be imported, regardless of whether the CryoSPARC instance that originally created the project is running. No action needs to be taken to prepare a project to be imported (see section 1 above).

To Import a project, click the "Import" button on the project page to import a project. You will need to provide the path to the project directory.

Once the import process begins, you will see a new project appear in the projects page. This project will have a new project UID (possibly distinct from the project UID of the original project). Once import is complete, users can begin to interact with the project and continue processing.

By default, imported projects will have a title in the form: 'Imported: ...' which can be later changed by the user. The timestamp associated with the imported project will be replaced so that project sorting remains consistent.

3. Ability to view instance storage statistics

View sizes of all jobs and projects in CryoSPARC, updated automatically. Storage statistics are available from the Project Details and Job Details, as well as through the newly enhanced Resource Manager. Use these metrics to determine which projects and jobs should be cleaned up to free up the most disk space.

4. Ability to clear intermediate results

Unused intermediate results can be cleared from iterative jobs to save space. This action can be executed at a job level (by clicking the "Clear Intermediate Results" button on the Job Details panel) or at a project level for every job (by clicking the "Clear Intermediate Results" button on the Project Details panel). This function will remove all unused outputs created by iterative jobs that save raw data at every iteration. Final results for every result slot will be retained, whether they have been used elsewhere or not.

5. Ability to export and import individual jobs

In v2.11+, project directories are self-contained and can be imported any time. Job directories (which live inside projects) however, are not necessarily self-contained, as they may contain symlinks and references to other files in the project.

Therefore, any individual job can be exported, for sharing, manipulation, or archiving. Jobs must be exported manually in order to create a "consolidated" exported-job directory which is then importable.

To export a job, click "Export Job" in the job details panel. This will consolidate all job metadata, event log, image files, and the raw output data into a single folder for easy sharing. All exported jobs are created inside of their corresponding project directory, in a folder named PXX/exports/jobs.

Any job exported this way can be imported back into the same or different instance by clicking the "Import Job" button from the header while inside a workspace. To import, specify the absolute job directory containing the job to import. See the detailed use cases below for a guide that explains how to send/compress/tar a job directory to send to another instance or user.

6. Ability to export and import low-level output groups

CryoSPARC jobs create various outputs as they run. These outputs are registered in the CryoSPARC system as "Output groups". An output group is a collection of items, of a given type (e.g., a stack of particles, a set of movies, a set of volumes, etc). Each item in the output group has associated metadata (e.g., CTF parameters for a particle stack, microscope parameters for a set of movies, etc). All of this per-item metadata is stored in binary tabular form in .cs files in cryoSPARC. Along with the .cs file, a separate text file in .csg format describes the overall metadata for the group.

In v2.11+, it is possible to export and import individual output groups from a given job, without exporting the entire job. This is particularly convenient for sharing data with others, uploading to online repositories, and for advanced users wishing to make manual modifications to metadata.

Any individual output group can be exported for sharing, manipulation, or archiving. Clicking "Export" in the "Outputs" tab of the job card will consolidate all the result group's outputs so that they are in one place for easy sharing. All exported groups are created inside of their corresponding project directory, in a folder named PXX/exports/groups.

Output groups that are exported can be imported back into the same or different instance by running the "Import Result Group" job, available from the Job Builder. Simply specify the absolute path to the .csg file that was created during export when creating the import job.

Common Use Cases

Use Case: Clear up space used by your instance by removing unused intermediate results

Click the "Clear Intermediate Results" button on iteration-based jobs to save substantial amounts of space used by jobs that output intermediate results after each iteration (2D Classification, Refinements, Ab-Initio, etc.). For example, in the "Outputs" tab of an Ab-Initio job, intermediate versions of particles and the reconstructed volumes are outputted every few hundred or so iterations:

Clearing intermediate results for this job will remove any data from unused output result versions. It will always keep the final iteration's data so that the job can always be reused. You can also clear intermediate results at a project level (found on the project details panel) which will execute the function on every job inside the project.

Use Case: Archive a project directory, then delete the project to free up space used by your instance

Since jobs, workspaces and projects themselves are continuously exported to the file system, no further action needs to be taken to export a project directory for later re-use. The steps to archive a project directory are:

  1. Ensure there are no active jobs in your project using the Resource Manager

  2. Find the project directory, then move or tar/compress it and store it for later resurrection (see below for an example)

  3. Only after the project directory has been moved or compressed, delete the project in CryoSPARC. DO NOT delete the project without moving or compressing the project directory as this will delete the entire project and you will lose your work.

In step 2), when compressing your project folder, you need to consider if you want to "dereference" symbolic links. From tar's manual:

"When --dereference(-h) is used with --create(-c), tar archives the files symbolic links point to, instead of the links themselves."

Import jobs in CryoSPARC create symbolic links to the raw data that you imported. If you use the -h option, these links will be copied into the .tar file as actual files, using up more disk space. If you do not use the -h option, you will need to make sure you separately archive the raw data from the project if it was stored outside the project directory.

An example command to compress and consolidate a project directory is:

cd /u/cryosparcuser/cryosparc_projects/
tar -cvhf P47.tar ./P47

Once the project archive has been successfully created and moved to a secure archive location, you can delete the project in CryoSPARC. Note that you can use any method you choose to archive/transfer/store the project directory, as long as the entire contents remain intact.

Use Case: Share an entire project with another user in a different instance

To share a project with another user, follow the same steps as the section above for archiving a project, but do not delete the project in your instance. Once you have created a .tar file, you can send this by any available means to another user or machine/system. The receiving user should decompress the .tar file into a directory accessible by CryoSPARC. To keep things tidy, create a new folder inside the parent directory of CryoSPARC's projects called imports.

You can rename the project directory to any name you like.

An example of decompressing a project:

cd /u/cryosparcuser/cryosparc_projects/
mkdir imports
cd imports
tar -xvf P47.tar ./

Once this is complete, on the CryoSPARC projects page, click "Import" and specify the absolute directory of the newly extracted project directory

This will create a new project inside the receiving instance and import all workspaces, jobs and results from the extracted project. You will be able to continue processing within the newly imported project.

Use Case: Share a particular job with another user

Though jobs are continuously exported when changes are made to the project (more details below in FAQs: When is your project/workspace/job exported?), it is still necessary to consolidate a job's outputs into a single folder in order to share the job with another user outside of your instance. To do this, click the "Export Job" button in the job details panel while a job is selected. This will export all images, streamlog events, create .csg files (more details below in FAQ: What are .cs and .csg files?) and symbolically link all of the job's outputs into a folder inside the project directory. The exported job will be found at $PROJECT_DIRECTORY/exports/jobs/<project_uid>_<job_uid>_<job_type>

For example when exporting a Homogeneous Refinement:

cryosparcdev@cryoem5:~/cryosparc_projects/P11/exports/jobs/P11_J87_homo_refine$ ls -al
total 172
drwxr-xr-x 6 cryosparcdev cryosparcdev     8 Sep  9 14:38 .
drwxr-xr-x 7 cryosparcdev cryosparcdev     7 Sep  9 14:37 ..
-rw-r--r-- 1 cryosparcdev cryosparcdev 90618 Sep  9 14:37 events.bson
drwxr-xr-x 2 cryosparcdev cryosparcdev     3 Sep  9 14:37 gridfs_data
-rw-r--r-- 1 cryosparcdev cryosparcdev 24671 Sep  9 14:37 job.json
drwxr-xr-x 3 cryosparcdev cryosparcdev     5 Sep  9 14:38 P11_J87_mask
drwxr-xr-x 4 cryosparcdev cryosparcdev     6 Sep  9 14:38 P11_J87_particles
drwxr-xr-x 3 cryosparcdev cryosparcdev     5 Sep  9 14:38 P11_J87_volume

Each output result group (particles, exposures, mask, volume) will be contained inside its own directory. These can also be imported independently using the "Import Result Group" job.

events.bson

This file contains all the job's streamlog events (all text and references to images seen inside the "Overview" tab of a job). It is encoded in BSON to save space and maintain MongoDB data formats.

job.json

This file contains the job document itself, which has been stripped of personal information (queued_to_lane, resources_allocated, interactive_hostname, ui_layouts, parents, children, output_result_groups, output_results).

gridfs_data/gridfsdata_0

A binary file containing all the job's images (images referenced by the streamlog, tile images, output group images, etc.)

P11_J87_volume

The folder containing the details of the output result group "volume":

This folder will always contain at least two items: a .cs file and a .csg file. The .csg file is a YAML file that contains metadata information related to the output group itself (including the name of the .cs file, type of results, and number of items). The .cs file is a highly-optimized array-based file containing specific metadata for every item in the group. Alongside these files, the consolidated data will be symbolically linked into this folder.

**cryosparcdev@cryoem5:~/cryosparc_projects/P11/exports/jobs/P11_J87_homo_refine/P11_J87_volume$ ls -al**
total 51
drwxr-xr-x 3 cryosparcdev cryosparcdev    5 Sep  9 14:38 .
drwxr-xr-x 6 cryosparcdev cryosparcdev    8 Sep  9 14:38 ..
drwxr-xr-x 2 cryosparcdev cryosparcdev   10 Sep  9 14:38 J87
-rw-r--r-- 1 cryosparcdev cryosparcdev 1353 Sep  9 14:38 P11_J87_volume_exported.cs
-rw-r--r-- 1 cryosparcdev cryosparcdev  922 Sep  9 14:38 P11_J87_volume_exported.csg

**cryosparcdev@cryoem5:~/cryosparc_projects/P11/exports/jobs/P11_J87_homo_refine/P11_J87_volume$ ls -al J87**
total 21
drwxr-xr-x 2 cryosparcdev cryosparcdev 10 Sep  9 14:38 .
drwxr-xr-x 3 cryosparcdev cryosparcdev  5 Sep  9 14:38 ..
lrwxrwxrwx 1 cryosparcdev cryosparcdev 87 Sep  9 14:38 cryosparc_P11_J87_006_volume_map_half_A.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_map_half_A.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 87 Sep  9 14:38 cryosparc_P11_J87_006_volume_map_half_B.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_map_half_B.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 80 Sep  9 14:38 cryosparc_P11_J87_006_volume_map.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_map.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 86 Sep  9 14:38 cryosparc_P11_J87_006_volume_map_sharp.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_map_sharp.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 90 Sep  9 14:38 cryosparc_P11_J87_006_volume_mask_fsc_auto.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_mask_fsc_auto.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 85 Sep  9 14:38 cryosparc_P11_J87_006_volume_mask_fsc.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_mask_fsc.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 88 Sep  9 14:38 cryosparc_P11_J87_006_volume_mask_refine.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_mask_refine.mrc
lrwxrwxrwx 1 cryosparcdev cryosparcdev 86 Sep  9 14:38 cryosparc_P11_J87_006_volume_precision.mrc -> /u/cryosparcdev/cryosparc_projects/P11/J87/cryosparc_P11_J87_006_volume_precision.mrc

Compress the folder (e.g. ~/cryosparc_projects/P11/exports/jobs/P11_J87_homo_refine) to create an archive of the job. Ensure to pass the "dereference" -h argument if you want to include all the referenced raw files.

cd ~/cryosparc_projects/P11/exports/jobs/
tar -cvhf P11_J87_homo_refine.tar ./P11_J87_homo_refine

For the receiver: To keep things tidy, extract the job archive into the folder imports/jobs inside the project directory you want to import the job into. This could be in the same or a different instance.

cd /u/cryosparcuser/cryosparc_projects/P5/imports/jobs/
tar -xvf P11_J87_homo_refine.tar ./

To import the job into your project, click the "Import Job" button inside a workspace in the project, and specify the absolute path to the extracted job archive.

Use Case: Upload your particle stack to EMPIAR

To share only a specific output result group, you can click the "Export" button under the "Actions" section of an output result group in the "Outputs" tab. This will consolidate the outputs of the result group into a folder named <project_uid>_<job_uid>_<output_result_group_name> inside the PXX/exports/groups folder of the project. The functionality is similar to what happens when you export a job.

The resulting exported group directory will contain symlinks to the raw data files referred to by the group (eg. .mrc files for a particle stack). This export directory can be consolidated using tar or by uploading to a repository. Ensure that the symbolic links are followed during upload or consolidation by "dereferencing."

For a user receiving the exported group: The exported result group can be imported as-is by using the "Import Result Group" job found under the "Imports" section of the job builder. Specify the absolute path to the .csg file of the exported output result group, which will be inside of the export directory. This will import the group of items (eg. particle stack) with all associated metadata, so that it can be used for further processing.

Use Case: Manually modify CryoSPARC outputs and metadata for continued experimentation

Please read the FAQs section for a more detailed explanation of the .cs and .csg file formats used by cryoSPARC. More details about these and source code for reading/writing them are forthcoming.

As an advanced user, you may often wish to write scripts or interface with other programs that create or modify metadata associated with output groups from CryoSPARC. For example, you may wish to apply a transformation operator to particle alignments. Or you may wish to re-center particle picks. Or you may wish to create an integration for a third-party particle picking tool.

All of these tasks can be used by creating, updating, and importing .cs and .csg files.

.csg files contain text-format high level information about an importable group of items in CryoSPARC. The individual metadata about each item (e.g., particle alignments) are stored in .cs files which are binary and efficient.

When a job completes processing in CryoSPARC, it creates .csg and .cs files describing all of its outputs. These files are also created when a job or output group is exported.

You can make copies of these files, modify them, and then import them again using the "Import Result Group" job type. For example, to apply a transformation to particle coordinates:

  1. Locate the job directory of the job that created particle locations.

  2. Find and make copies of the .csg and .cs file of the particle output group.

  3. Edit your new copy of the .cs file using python/numpy (See Guide: Manipulating .cs Files Created by CryoSPARC) to apply the desired transformation.

  4. Edit your new copy of the .csg file with any text editor to ensure that it points to the new .cs file path.

    Note: a file path starting with > means that the path should be relative to the .csg file itself.

  5. In CryoSPARC, create an "Import result groups" job and point it to the new .csg file that you copied. This job will import the particle stack, preserving the identity of the particles (i.e., the uid column in the .cs file) but with new alignments that you have manually modified.

You can also opt to create .csg and .cs files yourself from scratch and import these as well.

FAQs

What are .cs and .csg files?

CryoSPARC files (.cs) are numpy-array wrapped data structures used to store metadata for millions of input and output files in CryoSPARC. Code and details for easily dealing with .cs files are forthcoming. You can load and read a .cs file using the numpy.load function (see here).

The structure of a .cs file can be visualized in a tabular manner:

CryoSPARC Group files (.csg) are YAML-formatted (for readability) text files that hold metadata about the output result groups themselves:

**$ cat P11_J87_mask_exported.csg**

created: 2019-09-09 18:38:02.835041
group:
  description: Refinement mask that was used.
  name: mask
  title: Mask
  type: mask
results:
  mask_refine:
    metafile: '>P11_J87_mask_exported.cs'
    num_items: 1
    type: volume.blob
version: develop

Advanced users creating their own .csg files should use the above format.

When is your project/workspace/job exported?

Projects are exported when any updates are made to them.

Workspaces are exported when any updates are made to them.

Jobs are automatically exported at specific times:

  • job creation

  • setting & clearing parameters

  • connecting & disconnecting inputs

  • after job completion

  • after job clearing

  • when job is marked as completed

Jobs can be manually exported using "Export Job" from the Job Details panel.

Outputs can be manually exported using "Export" from the Outputs section of the job card.

What is Database Migration?

When an instance is first updated to v2.11, you will notice a “database migration” notification appear a few minutes after the instance is started. Database migration refers to the background process that will consolidate and ensure the consistency of all your existing projects so that they can be imported correctly subsequently. Migration will write all previously created project metadata to disk (including all workspaces and jobs).

Database migration will export jobs in status running, launched, waiting and queued as killed jobs, since these jobs have not finished yet. Once the jobs finish, they will written to disk as completed jobs. All other jobs will retain their status as-is and can be imported in another instance immediately once migration is complete.

Database migration can be safely stopped at any time by turning off CryoSPARC using cryosparcm stop - it will resume from where it left off the next time CryoSPARC is turned on.

Progress of the migration, including any errors, will be shown as notifications (which are new to cryoSPARC v2.11). Previous notifications can be viewed in the Notification Manager found inside the Resource Manager.

How are project and job directory sizes calculated?

Currently, for job sizes, the value seen in the job detail panel is calculated by walking through every file inside the directory (including symlinks) and accumulating all sizes reported by the system's inode data. We keep track of all inode numbers and use the st_size value for each file.

For project sizes, we sum up all job directory sizes and report the total. In a future update, an explicit calculation of the project directory size will be used instead, in case users add extra folders to the project directory that CryoSPARC does not keep track of.

Users may find discrepancies in file sizes reported by the filesystem. Since the value calculated by CryoSPARC follows symlinks, use the -L argument with du to dereference symbolic links to get a similar value. Please note you will also get discrepancies based on how du calculates system block sizes.

Last updated