Guide: Data Management in CryoSPARC (≤v3.3)
An overview of all data management utilities and common use cases.
Last updated
An overview of all data management utilities and common use cases.
Last updated
The information in this guide applies to CryoSPARC ≤v3.3. For information about managing project directories in newer versions, see Guide: Data Management in CryoSPARC (v4.0+)
Single particle cryo-EM projects and labs continue to operate at increasing scales. In CryoSPARC v2.11+, we introduce several critical tools and features for dealing with cleaning up, archiving, transferring, exporting, and importing projects. We have also created tools for exporting and importing individual jobs and individual results of various types (particle stacks, exposure stacks, volumes, etc). These tools together allow for workflows like:
Exporting a project from one CryoSPARC instance and importing it into another instance
Sending a project initially started at a centralized facility along with a user who is going to continue processing at home in their own instance
Reducing the disk space used by a project by removing intermediate results created by jobs
Archiving a project to remote/slow storage for later retrieval and resurrection
Uploading the final results of a CryoSPARC job to online repositories (EMDB/EMPIAR/etc) in a self-contained format that other users can download and import
Sharing a project or job with another lab or a friend
Advanced manipulation of CryoSPARC metadata at a low level, or programmatically, by exporting a result (e.g., a particle stack), manually modifying the associated .cs
files, and importing it again
We have tried to simplify the experience of performing any of the above actions by ensuring that many of the steps involved are automatic or background processes.
CryoSPARC workflows are naturally divided into projects. Each project should contain the work and jobs for one or more related data collection sessions that are associated with a given sample/target. Project boundaries are strict, in the sense that files and results from one project cannot be directly used in another project.
In v2.11+, a project directory always contains all the information needed to define that project, and therefore all the information required to import that project into a CryoSPARC instance. This means that a project directory can, at any time, be transferred/renamed/copied and imported as a valid project in any (other) CryoSPARC instance that can read the files. This is true even if the original CryoSPARC instance that created the project is no longer functional. Note that this is different form previous CryoSPARC versions, where project metadata was only stored in the database, and a project directory was not sufficient to import the project.
Maintenance of valid project directories is accomplished via "continuous project export" functionality in v2.11+. Project directories are constantly being updated without interruption for the user. During the initial update to v2.11, a one-time migration will occur to update all projects with their metadata. Thereafter, certain changes that happen within a project triggers an update of the on-disk project information.
Similarly, jobs inside the project are stored in a self-contained format, and this is updated whenever the jobs are created, modified, or completed. Note: jobs that are in launched
, started
, running
, waiting
, killed
or failed
status will not be updated on disk until they enter completed
status - either by actually completing, or by the user choosing the mark as completed
option in the Job Details panel. This action will ensure that the job directory is updated, and therefore that importing the project into another instance will contain the latest information about the job.
CryoSPARC v2.11+ also contains a new notification system, which will show system notifications to users in an unobtrusive manner.
Any intact CryoSPARC project directory (from v2.11+) can be imported, regardless of whether the CryoSPARC instance that originally created the project is running. No action needs to be taken to prepare a project to be imported (see section 1 above).
To Import a project, click the "Import" button on the project page to import a project. You will need to provide the path to the project directory.
Once the import process begins, you will see a new project appear in the projects page. This project will have a new project UID (possibly distinct from the project UID of the original project). Once import is complete, users can begin to interact with the project and continue processing.
By default, imported projects will have a title in the form: 'Imported: ...' which can be later changed by the user. The timestamp associated with the imported project will be replaced so that project sorting remains consistent.
View sizes of all jobs and projects in CryoSPARC, updated automatically. Storage statistics are available from the Project Details and Job Details, as well as through the newly enhanced Resource Manager. Use these metrics to determine which projects and jobs should be cleaned up to free up the most disk space.
Unused intermediate results can be cleared from iterative jobs to save space. This action can be executed at a job level (by clicking the "Clear Intermediate Results" button on the Job Details panel) or at a project level for every job (by clicking the "Clear Intermediate Results" button on the Project Details panel). This function will remove all unused outputs created by iterative jobs that save raw data at every iteration. Final results for every result slot will be retained, whether they have been used elsewhere or not.
In v2.11+, project directories are self-contained and can be imported any time. Job directories (which live inside projects) however, are not necessarily self-contained, as they may contain symlinks and references to other files in the project.
Therefore, any individual job can be exported, for sharing, manipulation, or archiving. Jobs must be exported manually in order to create a "consolidated" exported-job directory which is then importable.
To export a job, click "Export Job" in the job details panel. This will consolidate all job metadata, event log, image files, and the raw output data into a single folder for easy sharing. All exported jobs are created inside of their corresponding project directory, in a folder named PXX/exports/jobs
.
Any job exported this way can be imported back into the same or different instance by clicking the "Import Job" button from the header while inside a workspace. To import, specify the absolute job directory containing the job to import. See the detailed use cases below for a guide that explains how to send/compress/tar a job directory to send to another instance or user.
CryoSPARC jobs create various outputs as they run. These outputs are registered in the CryoSPARC system as "Output groups". An output group is a collection of items, of a given type (e.g., a stack of particles, a set of movies, a set of volumes, etc). Each item in the output group has associated metadata (e.g., CTF parameters for a particle stack, microscope parameters for a set of movies, etc). All of this per-item metadata is stored in binary tabular form in .cs
files in cryoSPARC. Along with the .cs
file, a separate text file in .csg
format describes the overall metadata for the group.
In v2.11+, it is possible to export and import individual output groups from a given job, without exporting the entire job. This is particularly convenient for sharing data with others, uploading to online repositories, and for advanced users wishing to make manual modifications to metadata.
Any individual output group can be exported for sharing, manipulation, or archiving. Clicking "Export" in the "Outputs" tab of the job card will consolidate all the result group's outputs so that they are in one place for easy sharing. All exported groups are created inside of their corresponding project directory, in a folder named PXX/exports/groups
.
Output groups that are exported can be imported back into the same or different instance by running the "Import Result Group" job, available from the Job Builder. Simply specify the absolute path to the .csg
file that was created during export when creating the import job.
Click the "Clear Intermediate Results" button on iteration-based jobs to save substantial amounts of space used by jobs that output intermediate results after each iteration (2D Classification, Refinements, Ab-Initio, etc.). For example, in the "Outputs" tab of an Ab-Initio job, intermediate versions of particles and the reconstructed volumes are outputted every few hundred or so iterations:
Clearing intermediate results for this job will remove any data from unused output result versions. It will always keep the final iteration's data so that the job can always be reused. You can also clear intermediate results at a project level (found on the project details panel) which will execute the function on every job inside the project.
Since jobs, workspaces and projects themselves are continuously exported to the file system, no further action needs to be taken to export a project directory for later re-use. The steps to archive a project directory are:
Ensure there are no active jobs in your project using the Resource Manager
Find the project directory, then move or tar/compress it and store it for later resurrection (see below for an example)
Only after the project directory has been moved or compressed, delete the project in CryoSPARC. DO NOT delete the project without moving or compressing the project directory as this will delete the entire project and you will lose your work.
In step 2), when compressing your project folder, you need to consider if you want to "dereference" symbolic links. From tar's manual:
"When --dereference
(-h
) is used with --create
(-c
), tar archives the files symbolic links point to, instead of the links themselves."
Import jobs in CryoSPARC create symbolic links to the raw data that you imported. If you use the -h
option, these links will be copied into the .tar
file as actual files, using up more disk space. If you do not use the -h
option, you will need to make sure you separately archive the raw data from the project if it was stored outside the project directory.
An example command to compress and consolidate a project directory is:
Once the project archive has been successfully created and moved to a secure archive location, you can delete the project in CryoSPARC. Note that you can use any method you choose to archive/transfer/store the project directory, as long as the entire contents remain intact.
To share a project with another user, follow the same steps as the section above for archiving a project, but do not delete the project in your instance. Once you have created a .tar
file, you can send this by any available means to another user or machine/system. The receiving user should decompress the .tar
file into a directory accessible by CryoSPARC. To keep things tidy, create a new folder inside the parent directory of CryoSPARC's projects called imports
.
You can rename the project directory to any name you like.
An example of decompressing a project:
Once this is complete, on the CryoSPARC projects page, click "Import" and specify the absolute directory of the newly extracted project directory
This will create a new project inside the receiving instance and import all workspaces, jobs and results from the extracted project. You will be able to continue processing within the newly imported project.
Though jobs are continuously exported when changes are made to the project (more details below in FAQs: When is your project/workspace/job exported?), it is still necessary to consolidate a job's outputs into a single folder in order to share the job with another user outside of your instance. To do this, click the "Export Job" button in the job details panel while a job is selected. This will export all images, streamlog events, create .csg
files (more details below in FAQ: What are .cs and .csg files?) and symbolically link all of the job's outputs into a folder inside the project directory. The exported job will be found at $PROJECT_DIRECTORY/exports/jobs/<project_uid>_<job_uid>_<job_type>
For example when exporting a Homogeneous Refinement:
Each output result group (particles, exposures, mask, volume) will be contained inside its own directory. These can also be imported independently using the "Import Result Group" job.
events.bson
This file contains all the job's streamlog events (all text and references to images seen inside the "Overview" tab of a job). It is encoded in BSON to save space and maintain MongoDB data formats.
job.json
This file contains the job document itself, which has been stripped of personal information (queued_to_lane
, resources_allocated
, interactive_hostname
, ui_layouts
, parents
, children
, output_result_groups
, output_results
).
gridfs_data/gridfsdata_0
A binary file containing all the job's images (images referenced by the streamlog, tile images, output group images, etc.)
P11_J87_volume
The folder containing the details of the output result group "volume":
This folder will always contain at least two items: a .cs
file and a .csg
file. The .csg
file is a YAML file that contains metadata information related to the output group itself (including the name of the .cs
file, type of results, and number of items). The .cs
file is a highly-optimized array-based file containing specific metadata for every item in the group. Alongside these files, the consolidated data will be symbolically linked into this folder.
Compress the folder (e.g. ~/cryosparc_projects/P11/exports/jobs/P11_J87_homo_refine) to create an archive of the job. Ensure to pass the "dereference" -h
argument if you want to include all the referenced raw files.
For the receiver: To keep things tidy, extract the job archive into the folder imports/jobs
inside the project directory you want to import the job into. This could be in the same or a different instance.
To import the job into your project, click the "Import Job" button inside a workspace in the project, and specify the absolute path to the extracted job archive.
To share only a specific output result group, you can click the "Export" button under the "Actions" section of an output result group in the "Outputs" tab. This will consolidate the outputs of the result group into a folder named <project_uid>_<job_uid>_<output_result_group_name>
inside the PXX/exports/groups
folder of the project. The functionality is similar to what happens when you export a job.
The resulting exported group directory will contain symlinks to the raw data files referred to by the group (eg. .mrc
files for a particle stack). This export directory can be consolidated using tar
or by uploading to a repository. Ensure that the symbolic links are followed during upload or consolidation by "dereferencing."
For a user receiving the exported group: The exported result group can be imported as-is by using the "Import Result Group" job found under the "Imports" section of the job builder. Specify the absolute path to the .csg
file of the exported output result group, which will be inside of the export directory. This will import the group of items (eg. particle stack) with all associated metadata, so that it can be used for further processing.
Please read the FAQs section for a more detailed explanation of the .cs
and .csg
file formats used by cryoSPARC. More details about these and source code for reading/writing them are forthcoming.
As an advanced user, you may often wish to write scripts or interface with other programs that create or modify metadata associated with output groups from CryoSPARC. For example, you may wish to apply a transformation operator to particle alignments. Or you may wish to re-center particle picks. Or you may wish to create an integration for a third-party particle picking tool.
All of these tasks can be used by creating, updating, and importing .cs
and .csg
files.
.csg
files contain text-format high level information about an importable group of items in CryoSPARC. The individual metadata about each item (e.g., particle alignments) are stored in .cs
files which are binary and efficient.
When a job completes processing in CryoSPARC, it creates .csg
and .cs
files describing all of its outputs. These files are also created when a job or output group is exported.
You can make copies of these files, modify them, and then import them again using the "Import Result Group" job type. For example, to apply a transformation to particle coordinates:
Locate the job directory of the job that created particle locations.
Find and make copies of the .csg
and .cs
file of the particle output group.
Edit your new copy of the .cs
file using python/numpy (See Guide: Manipulating .cs Files Created by CryoSPARC) to apply the desired transformation.
Edit your new copy of the .csg
file with any text editor to ensure that it points to the new .cs
file path.
Note: a file path starting with >
means that the path should be relative to the .csg
file itself.
In CryoSPARC, create an "Import result groups" job and point it to the new .csg
file that you copied. This job will import the particle stack, preserving the identity of the particles (i.e., the uid
column in the .cs
file) but with new alignments that you have manually modified.
You can also opt to create .csg
and .cs
files yourself from scratch and import these as well.
.cs
and .csg
files?CryoSPARC files (.cs
) are numpy-array wrapped data structures used to store metadata for millions of input and output files in CryoSPARC. Code and details for easily dealing with .cs
files are forthcoming. You can load and read a .cs
file using the numpy.load
function (see here).
The structure of a .cs
file can be visualized in a tabular manner:
CryoSPARC Group files (.csg
) are YAML-formatted (for readability) text files that hold metadata about the output result groups themselves:
Advanced users creating their own .csg
files should use the above format.
Projects are exported when any updates are made to them.
Workspaces are exported when any updates are made to them.
Jobs are automatically exported at specific times:
job creation
setting & clearing parameters
connecting & disconnecting inputs
after job completion
after job clearing
when job is marked as completed
Jobs can be manually exported using "Export Job" from the Job Details panel.
Outputs can be manually exported using "Export" from the Outputs section of the job card.
When an instance is first updated to v2.11, you will notice a “database migration” notification appear a few minutes after the instance is started. Database migration refers to the background process that will consolidate and ensure the consistency of all your existing projects so that they can be imported correctly subsequently. Migration will write all previously created project metadata to disk (including all workspaces and jobs).
Database migration will export jobs in status running
, launched
, waiting
and queued
as killed
jobs, since these jobs have not finished yet. Once the jobs finish, they will written to disk as completed
jobs. All other jobs will retain their status as-is and can be imported in another instance immediately once migration is complete.
Database migration can be safely stopped at any time by turning off CryoSPARC using cryosparcm stop
- it will resume from where it left off the next time CryoSPARC is turned on.
Progress of the migration, including any errors, will be shown as notifications (which are new to cryoSPARC v2.11). Previous notifications can be viewed in the Notification Manager found inside the Resource Manager.
Currently, for job sizes, the value seen in the job detail panel is calculated by walking through every file inside the directory (including symlinks) and accumulating all sizes reported by the system's inode data. We keep track of all inode numbers and use the st_size
value for each file.
For project sizes, we sum up all job directory sizes and report the total. In a future update, an explicit calculation of the project directory size will be used instead, in case users add extra folders to the project directory that CryoSPARC does not keep track of.
Users may find discrepancies in file sizes reported by the filesystem. Since the value calculated by CryoSPARC follows symlinks, use the -L
argument with du
to dereference symbolic links to get a similar value. Please note you will also get discrepancies based on how du
calculates system block sizes.