Guide: Data Management in CryoSPARC (v4.0+)
An overview of all data management utilities and common use cases.
Single particle cryo-EM projects and labs continue to operate at increasing scales. In CryoSPARC v4.0, we introduce improved workflows and tools for dealing with archiving, transferring, exporting, and importing projects. Tools introduced in previous versions of CryoSPARC for exporting and importing individual jobs and individual results of various types (particle stacks, exposure stacks, volumes, etc.) remain available.
In CryoSPARC v4.0, the most important changes that have been made are:
- CryoSPARC projects are now explicitly locked (or attached) to a single CryoSPARC instance at a time. In previous versions, it was possible for a project directory to be imported into and accidentally modified by two instances at the same time, causing metadata corruption. Now, each project directory contains a lock file that marks the projects as in-use by (i.e., attached to) a particular instance.
- CryoSPARC project directories are now named based on the project title at creation time, rather than a numeric project-UID (e.g., "P12"). Numeric UIDs are only used to refer to each project within a single instance. This serves to ensure that project directories are user-recognizable, and that the numeric UID is not retained when a project moves from one instance to another.
- The life-cycle of a CryoSPARC project is now separated from the CryoSPARC instance(s) that interact with the project. The following lifecycle actions can be taken on a project:
- Create: an instance can create a new project, and that project lives in a unique and self-contained project directory on disk. The project directory is created at creation time of the project. The project is attached to the instance that created it (and therefore there is a lock file present in the project directory). Within the instance to which the project is attached, the project has a unique numeric UID.
- Detach: a user can opt to detach a project from the instance to which it is currently attached. This action ensures that no jobs or background processes are running in the project, and then removes the lock file from the project directory. In the UI, the project that was previously attached displays as "Detached" and can no longer be interacted with.
- Attach: a project directory that has previously been detached (and therefore has no lock file present) can be attached to an instance. When attachment is performed, all the workspaces, jobs, and Live sessions within the project directory are imported into the attaching instance, and the project becomes usable within this instance and is given a new numeric UID. A lock file is written to the project directory. Attach takes the place of the previous Import action.
- Archive: A project that is attached to an instance can be "archived" without detaching the project. This instructs CryoSPARC that the project directory is no longer available for reading and writing at it's current location, but will become available (possibly at a new location) at some future time. A user should archive a project before moving the project directory to a different location on disk, for example a different filesystem, a backup location, tape archive, or cold-storage. Archiving ensures that no jobs or background processes are running in the project and marks the project as archived, but does not remove the lock file from the project. Once archived, the project can still be browsed in the UI, but can not be modified.
- Unarchive: A project that has been archived can be resurrected in the same instance from which it was archived. When unarchiving, the user is prompted to provide the (possibly changed) location of the project directory on disk. For example, a project can be archived and then the project directory moved to a cold unaccessible backup. Later, when needed, the project directory can be restored to an accessible filesystem, and the project can be unarchived pointing at the new project directory location. This makes the project available once again for further processing.
- As a minor change, output files of CryoSPARC jobs that are stored in job directories no longer have the
cryosparc_PXX_prefix, since the numeric project UID can change when a project moves from one instance to another. In order to retain the existing behaviour of the CryoSPARC UI and limit confusion between different files, when CryoSPARC results are downloaded through the browser in the UI, the prefix
cryosparc_PXX_is added to the local filename of the download in the browser, using the then-current numeric project UID.
These changes make it much simpler to perform the following actions:
- Detaching a project from one CryoSPARC instance and attaching it to another instance
- Sending a project initially started at a centralized facility to a user who is going to continue processing at home in their own instance, by detaching and attaching
- Archiving a project to remote/slow storage for later retrieval and resurrection
- Copying a project directory to make a complete clone
- Changing the name or location of a project directory on disk, by archiving and unarchiving
The following actions remain possible, with no change in behaviour in v4.0:
- Reducing the disk space used by a project by removing intermediate results created by jobs
- Uploading the final results of a CryoSPARC job to an online repository
- Advanced manipulation of CryoSPARC metadata at a low level or programatically, by exporting a results (e.g., a particle stack), manually modifying the associated .cs files, and importing again
The following sections describe specific aspects of data management in CryoSPARC in more detail.
CryoSPARC workflows are naturally divided into projects. Each project should contain the work and jobs for one or more related data collection sessions that are associated with a given sample/target. Project boundaries are strict, in the sense that files and results from one project cannot be directly used in another project. Project directories are self-contained, and all image processing data (except for imported raw data, see 7.-imported-data-in-project-directories) pertaining to a project is written to the project directory.
A project directory always contains all the information needed to define that project. The project directory is written to every time certain actions are taken within CryoSPARC, for example changing project, workspace, or job metadata (titles, descriptions, etc), and when jobs complete processing. This "continuous export" model ensures that at any time, a project directory is self-contained and if anything goes wrong with a CryoSPARC instance or database, the projects remain intact and up-to-date, without the user having to manually trigger an export action.
As a safety feature, a project directory can, at any time, be transferred/renamed/copied and attached as a valid project in any (other) CryoSPARC instance that can read the files. This is true even if the original CryoSPARC instance that created the project is no longer functional. See Use Case: Rescuing a project from an inoperable instance for details on how to rescue a project from a failed or inoperable instance.
Similar to projects, jobs inside the project are stored in a self-contained format, and job directories are updated whenever the jobs are created, modified, or completed. Note: jobs that are in
failedstatus will not be updated on disk until they enter
completedstatus - either by actually completing, or by the user choosing the
mark as completedoption in the Job Details panel.
For projects created in CryoSPARC v4.0+, the project directory is initially named based on the title of the project at creation time. For example, if a project is titled "My Protein, Data Collection (October 1 2022)" the project directory will be created as
CS-my-protein-data-collection-october-1-2022. The project directory will be created within the container directory indicated at creation time. The project directory can be changed later on (see Use Case: Renaming a project directory).
Inside each project directory in CryoSPARC v4.0+ (including existing projects), there will be a lock file present called
cs.lock. This file should not be removed or changed.
Detached projects can be attached to a CryoSPARC instance. Attaching a project creates a new project in the instance using the indicated project directory. All project details, workspaces, jobs, and sessions in the detached project will be made accessible, and the project directory will be treated as an active project directory by the instance. Any intact CryoSPARC project directory that does not contain a lock file (including from a previous CryoSPARC version) can be attached.
Note: Projects already belonging to another CryoSPARC instance cannot be attached until they are detached from their original instance. When this is not possible see Use Case: Rescuing a project from an inoperable instance
Projects can be attached under the “New Project” dropdown menu:
Once the attach process begins, you will see a new project appear in the projects page. This project will have a new numeric UID (distinct from the numeric UID of the project in the instance where the project previously was attached). Once attachment is complete, users can begin to interact with the project and continue processing.
Projects can be detached from their CryoSPARC instance. Detaching a project unlocks the project from its instance, allowing the project folder to be moved to another location or attached to another project. In the UI of the instance where the project is being detached, the project will also display as ‘detached’ and no longer be usable. A detached project’s details, workspaces, jobs, and sessions are saved to the project directory.
A project can be detached using the “Detach Project” button in the project’s “Actions” menu:
Detached projects will show an icon on their cards:
Upon detaching a project, the project is no longer associated with the CryoSPARC instance, but some project information is retained in the CryoSPARC database. As of v4.1.2, the “Delete Project from Database” action can be used to remove the remaining database entries associated with the project. Performing this action on a detached project hides it in the UI and removes large database files, potentially freeing up space on disk.
CryoSPARC projects can be archived to allow their project folder to be moved on disk and un-archived at a later date. Archiving sets the project to read-only mode, where it can be seen in the UI but cannot be modified. All project details, workspaces, jobs, and sessions will be maintained in the CryoSPARC database as well as in the project directory on disk. CryoSPARC does not expect the project directory to be available for read or write while a project is archived. When moving a project directory, be sure to consider moving the raw data that was imported into the project as well (see 7.-imported-data-in-project-directories)
A project can be archived using the “Archive Project” button in the project’s “Actions” menu:
Archived projects will show an icon on their card:
The archived status can also be seen in the project details:
Archived projects can be unarchived back into the CryoSPARC instance, removing the read-only status and allowing the project to be modified again. Projects can be unarchived from a different project directory location than the location at time of archive.
Note: Archiving and unarchiving should only be used with the intention of keeping the project tied to the current instance of CryoSPARC. Users looking to transfer projects between CryoSPARC instance should refer to the Attach and Detach features instead.
A project can be unarchived using the “Unarchive Project” button in the project’s “Actions” menu:
When unarchiving a project, the project directory must be specified:
Several job types (2D Classification 3D Classification, and 3D Variability Analysis) have an option to control whether the job will save intermediate results at all. By default, jobs will save intermediate results. However, this can be turned off on a per-job level using job parameters, or it can be turned off at the project level for each job type. To do so, select the project and at the bottom of the details panel, set job-specific defaults under the 'Generate Intermediate Results' module:
Project-level defaults for generation of intermediate results
When raw data is imported into a CryoSPARC project (using an
ImportJob), the raw data is not copied into the project directory. Rather, symlinks are created within the Import Job directory pointing to the raw data. Aside from these symlinks, CryoSPARC jobs do not create symlinks that point to locations that are outside the project directory. This keeps project directories self-contained.
The symlinks within import jobs can be changed if the position of the raw data on disk changes. For example, when Archiving a project, if the raw data (e.g., raw movies) are also archived to a different location than where they were imported from, the import symlinks must be updated. See the guide here for more details:A. Moving only raw particle, micrograph or movie data already imported into CryoSPARC
Due to the use of symlinks, it is important that when copying or moving a project directory, symlinks NOT be dereferenced (i.e., do not use the
tar). If symlinks are dereferenced, the new copy of the project directory will also contain copies of all the raw data files, as well as potentially multiple copies of intermediate and output files that are internally symlinked within the project directory. Instead of dereferencing symlinks, raw data should be archived separately from project directories.
Sometimes, you may need to move a project directory on disk. You may have created it in the wrong place accidentally, you may have a full disk, or if you have tiered storage, e.g., a fast SSD-backed storage system for active projects and a slower HDD-backed storage array for bulk storage, you may wish to move a project directory from the fast filesystem to the slower filesystem once most processing is complete.
In these cases, you can simply:
- 1.Archive the project
- 2.Move the project directory to its new location
- 3.Unarchive the project using the path to the project directory at its new location
The project will now be usable once again, and all reads/writes will happen to the new project directory location.
In CryoSPARC v4.0+, project directories are named based on the project title entered at creation time. If you later change the title, you can rename the project directory using the following steps:
- 1.Archive the project
- 2.Rename the project directory on disk, but leave it in it's original location
- 3.Unarchive the project using the path to the project directory with its new name
When you need to move a CryoSPARC project between instances, for example when transferring a project from a data collection facility to a user's home facility, use the following steps:
- 1.Detach the project from its original CryoSPARC instance
- 2.Copy the project directory to a location accessible by the new CryoSPARC instance
- 3.Attach the project to the new CryoSPARC instance using the path to the project directory at its new location
- 4.(Optional, available in v4.1.2+) Use the “Delete Project from Database” action on the detached project to remove and remaining database entries relating to this project from its original CryoSPARC instance
Once processing in a project is complete, the project can be either detached (if it is unlikely to be brought back to the same CryoSPARC instance) or archived. Either action will allow the project directory to be moved or compressed without causing errors in the CryoSPARC instance.
Be sure to separately archive/copy/move the raw data that was imported into the CryoSPARC project, as raw data is not stored inside the project directory. See 7. Imported data and symlinks in project directories
The project directory can be copied as-is, and stored on a backup, remote, or cold-storage filesystem. In some cases it may help to
tarthe project directory into a single file. An example command to consolidate a project directory is:
tar -cvf P47.tar ./P47
Note that you can use any method you choose to archive/transfer/store the project directory, as long as the entire contents remain intact.
If you need to access the project at a later date, you can un-
tarthe bundle to any accessible filesystem. Then, if the project was archived (you can tell by checking that the
cs.lockfile is still present inside the project directory), you can un-archive it to the same instance from where it was archived. Otherwise if it was detached, you can attach it in any instance.
If a CryoSPARC v4.0+ instance is no longer operable (due to database corruption or other issue), a project that was attached to that instance can be rescued by attaching to a new instance. Use the following steps:
- 1.Ensure that that inoperable instance is completely shut down, and that there are no remaining "zombie" processes associated with that instance still running.
- 2.For additional safety, make a copy of the project directory to be rescued and use the copy for subsequent steps.
- 3.Delete the
cs.lockfile in the project directory.
- 4.In the new instance, use Attach Project and point to the project directory where the lock file was removed.
- 5.The new instance should import all available workspaces, jobs, and sessions and make the project directory available for use once again.
The following use cases remain unchanged in v4.0+: