Guide: Data Cleanup (v4.3+)
New features in v4.3+ for managing and cleaning up project data.
Single particle cryo-EM datasets can be large (multiple TB), and the additional data generated by processing can grow equally large. Managing project data and cleaning up at various points in the project life cycle are important aspects of successful cryo-EM operations.
CryoSPARC v4.3 introduces several new features for managing and cleaning up project data. This guide provides a conceptual overview of data generated during processing and outlines recommended strategies for several use cases.
CryoSPARC creates a self-contained project directory each time a new project is created, and all files generated by CryoSPARC related to a project will be stored in its project directory. For details about the project life cycle, see:
Similarly, each time a CryoSPARC job is created, a job directory is created inside the associated project and the job data is stored in that job directory. When a job is cleared, the job directory is emptied but the job’s metadata (parameters, inputs, etc) are kept in the CryoSPARC database. This allows cleared jobs (or chains of cleared jobs) to be re-run at a later time.
Like jobs, CryoSPARC Live Sessions also have a session directory that is created inside their associated project directory, and session data (micrographs, particles, etc) are stored in the session directory.
In the standard project life cycle, projects (including all their jobs and sessions) can be detached from an instance or archived if they are not needed for some time. However, this does not generally reduce the size of a project directory. The new tools described here allow for shrinking projects, workspaces, and Live sessions so they can be dealt with more efficiently.
These tools will delete project data and therefore must be used with care, but are designed to allow cleaning up project data with clarity and confidence about what exactly will get deleted.
CryoSPARC does not currently (as of v4.3) attempt to manage raw data on your filesystems. That is, data imported into CryoSPARC using the import jobs is not copied into project directories and therefore CryoSPARC will not delete or modify it at its original location.
There are five ways that CryoSPARC can clean up project data. Each is summarized here and described in more detail below.
- Clearing jobs that are not needed to achieve a final result: One or more jobs in a project or workspace can be marked as final, meaning that that job and all ancestors providing input to that job should not be removed from the project. All other jobs (i.e., non-final jobs) can be cleared in one step to prune unnecessary branches in the processing workflow.
- Compacting CryoSPARC Live Sessions: Live sessions produce motion corrected micrographs and extracted particles. In v4.3, Live sessions can be compacted which removes all this data, and can be restored which uses saved parameters and particle locations to reprocess and reproduce the removed data at a later date. Before compacting, useful particles from a session can be retained separately by restacking the particles (see below).
- Clearing preprocessing jobs: Motion correction, CTF estimation and particle extraction jobs create large amounts of project data but can generally be safely cleared and re-run in the future (with the same parameters) in order to regenerate the data they produced. Before clearing particle extraction jobs, useful particles can be retained separately by restacking the particles (see below).
- Clearing Intermediate Results: Reduces the size of job directories by deleting unused intermediate data from them. This means removing results from early iterations of a job but keeping the results from the final iteration for downstream use.
- Clearing killed and failed jobs: Jobs that did not complete may still have produced sizeable outputs and can be cleared in one step.
The Cleanup Data tool can be used to clear project data in bulk using recommended strategies, and can be used at the project or workspace level. The options in the tool can be customized to match specific clearing preferences. Details about each option can be found below.
The Cleanup Data tool can be accessed at any time and is accessible via the quick actions menu or sidebar action panel when a project or workspace is selected.
Whether performing a cleanup on a project or workspace, the available options are identical. The only difference is that the workspace cleanup will only affect jobs that exist in the selected workspace. Note that in v4.3, linked jobs (i.e., jobs that are in the workspace being cleaned, but are also linked in other workspaces) will be treated as being part of the workspace being cleaned and therefore will be cleaned up by the tool, even if they appear in other workspaces as well. For this reason, it is important to mark as final all the important results across a project before cleaning up each workspace. By doing so, ancestors of final results across the project will be preserved in every workspace.
The Cleanup Data tool is comprised of four components:
- The header provides an overview of what project or workspace will be affected, how many sessions (if any) are contained within the project/workspace, how many jobs are contained within the project/workspace and how much space they take up on disk.
- The left side of the dialog displays a checklist of all cleanup operations available, broken down into categories.
- The right side of the dialog displays a tabbed interface, the cleanup preview - each with insight into how the cleanup action will affect the project or workspace.
- The footer allows you to return to the browse interface or proceed with the cleanup action.
As you select options, the data cleanup preview will update providing an overview of how the action will affect the project or workspace size. The preview has multiple tabs:
- Estimate: a breakdown of which jobs will be cleared, which jobs will only have intermediate results cleared, which jobs will be deleted, and a total of all jobs that will be modified as a result of the cleanup action. Below is a visual bar cart representing the breakdown and a percentage estimate of how much the cleanup action will reduce the project directory size on disk.
- Preprocessing: a list of preprocessing jobs, non-preprocessing jobs and total jobs in the project or workspace along with their size on disk and percentage makeup compared to the total size of all jobs.
- Intermediate results: a list of all jobs in the project or workspace (if any) that have produced intermediate results.
- Non-final jobs: a breakdown of all jobs that are marked as final, ancestors of final and neither (non-final) along with their size on disk and percentage makeup compared to the total size of all jobs. See below for details.
- Killed/Failed jobs: a breakdown of jobs by status along with their size on disk and percentage makeup compared to the total size of all jobs.
The Cleanup Data tool does not modify jobs or data created by Live sessions. For reducing sizes of sessions on disk for cleanup or archival purposes, view the section below on Live session compaction and restoration. Within the Cleanup Data tool, the number of sessions and their size on disk will be reported in the header section for reference.
The Cleanup Data tool uses two sources of information to provide insight into the project or workspace breakdown: the size of the project on disk and the size of individual jobs or sessions.
The project sizes are calculated automatically when a project is imported and is manually refreshed via the sidebar. Job sizes are kept track of automatically by CryoSPARC. When project size is out of date, you will see a notification bar under the header indicating that the size should be refreshed, and a button to do so.
Migration from older instances: when you upgrade from a previous version to v4.3, all projects will not have the required size statistics calculated. Therefore when you open the cleanup data tool on an older project, you will have to recalculate the project size before continuing.
The Cleanup Data tool has checkboxes for selecting whether to only clear, or clear and delete jobs in each category that can be cleaned up.
Clearing a completed job is a straightforward way of deleting job data while keeping the job inputs, parameters, and metadata intact. A cleared job takes up relatively little space, and can be re-run if its results are needed in the future. Once a job is cleared, all of its output data will be deleted, and its status will be set to building. Child jobs that use a cleared job’s output data as input will no longer be able to run, as the necessary files for them to run are deleted. However, outputs created by child jobs that were previously run will not be affected. Re-running the cleared job and regenerating its outputs will allow child jobs to be run again.
Be aware that re-running a cleared job does not guarantee bit-for-bit identical outputs to those generated the first time. Many factors outside of a job’s parameters can influence its outputs. See Determinism in CryoSPARC for more details.
For easier auditing of projects and workspaces, the data management module - visible when a single project or workspace is selected - displays a similar breakdown of jobs that are down in the preview view of the data cleanup action.
In the case where a project also contains sessions, a project directory breakdown will display how much of the project size on disk pertains to jobs, how much to sessions and other files (metadata or files transferred by a user that aren’t outputted by CryoSPARC):
Clearing deterministic jobs, such as pre-processing and extraction jobs, can remove a lot of reproducible data and save significant space if their results are no longer needed for processing. When the relevant checkboxes are selected, the Cleanup Data tool will clear jobs as long as they are not marked as a final result. Before clearing extraction jobs (which produce particle data), it can often be helpful to restack useful particles (see below).
Generating results deterministically is an important concept within scientific computing. Unfortunately achieving perfect determinism in high-performance applications is not trivial. While using CryoSPARC, many factors, including floating point precision, dependency versions, GPU models, and non-deterministic parallel execution can result in differences in outputs of jobs with identical inputs.
Preprocessing jobs (motion correction, CTF estimation, particle picking, extraction) can generally be considered deterministic in the sense that, for the same CryoSPARC version, clearing a job and re-running it later with the same input data and the same parameters will yield output data that is nearly identical, with the differences (for the reasons mentioned above) smaller than the measurable signal in cryo-EM data.
Other downstream job types, for example 2D classification, 3D ab-initio reconstruction, 3D refinements, 3D variability analysis, etc. can generally not be considered deterministic. This is because 1) these jobs heavily use stochastic algorithms that depend on random numbers and so forgetting to or incorrectly setting a random initial seed will produce very different results, and 2) because even with the same random seed, these jobs are all iterative and so small unavoidable differences (due to the reasons mentioned above) will accumulate and amplify over iterations leading to measurably different final results.
For these reasons, the Cleanup Data tool allows to clear preprocessing jobs that can be safely re-run in the future but does not attempt to clear downstream jobs.
Extracted particle sets take up significant space on disk, and the particles are arranged into files with typically one file per micrograph. Over the course of processing, only a subset of initially extracted particles are typically used for downstream processing. Particles are filtered out during 2D and 3D classification and sometimes particles from entire micrographs may be discarded during curation. Filtering out particles unfortunately does not erase them from disk, and a final particle set will typically be sparsely spread out through all of the initially extracted files.
At any point in a project when a useful subset of particles has been identified, the particles can be restacked using using the Restack Particles job. This job takes in an input particle set and writes new particle files containing only those particles as the output. After restacking, the original particle files can be deleted by clearing the extraction job from which they were produced. This can often reduce disk usage substantially and also has performance advantages for caching.
As an example of how use Restack Particles:
- A particle set is created as the output of Extract From Micrographs
- 2D Classification is run on the particle set, followed by Select 2D Classes, resulting in a subset of filtered particles of the original set as the output
- Restack Particles can now be run using the particle output of Select 2D Classes, producing new files with only those particles
- The original Extract From Micrographs job can be cleared (either manually, or automatically by the Cleanup Data tool) and processing can continue using the output of Restack Particles
By default, as of v4.3, automatic deletion of intermediate results is enabled for new jobs in all projects. This step is only applicable for users who have disabled the deletion of intermediate results at the project or individual job level.
The Cleanup Data tool will, with the appropriate checkbox checked, clear out intermediate results from all jobs in the project or workspace. This will not stop downstream jobs from being able to be created or run, since final results of jobs will not be cleared. Likewise, if an intermediate result was used in a downstream job, that particular intermediate result will not be cleared.
Separately, intermediate results can be manually cleared at the project, workspace, or job level using the button in the Actions menu.
For more details on clearing intermediate results, please see:
Often during the course of processing data, especially in advanced stages, you may experiment with multiple different pathways or attempts with different jobs and parameter to explore the data. Often only one of these will be fruitful and many branches in the processing tree will be redundant.
CryoSPARC v4.3 makes it easy to clean up unnecessary processing branches. At any point during processing, when a significant result or goal is achieved by a job in a CryoSPARC project, that job can be marked as a “final result”. The job will show with a special flag indicator for final result and all ancestors of the job will also be automatically marked as ancestor of final result.
Right-click menu with option to mark job as final
Job marked as final
It is best to mark important jobs as final before performing data cleanup actions, to avoid any loss of important results.
These jobs are treated specially by the Cleanup Data tool:
- Final results: Jobs marked as a final result cannot be cleared, cannot have their parameters changed, and cannot be deleted. Their data is protected from the Cleanup Data tool and from manual actions within CryoSPARC as well. When a job is marked as final, it is marked as final in all workspaces where it appears.
- Ancestor of final results: Ancestors of jobs marked as a final result are given a special status, called ancestor of final. When a job is marked as final, its ancestor jobs will be marked across the entire project, regardless of whether the appear in the same or a different workspace from the final job. Ancestor of final jobs can be cleared, however their parameters cannot be changed and they cannot be deleted. Therefore, the Cleanup Data tool can clear these jobs.
At any point in a project lifecycle, you can mark important jobs as final, and the Cleanup Data tool can be used to clear or delete non-final jobs (meaning jobs that are not final and are not ancestors of final jobs). When run at the project level, all non-final jobs in the project will be affected. When run at the workspace level, only non-final jobs in the workspace will be affected. Running the Cleanup Data tool with the “Clear non-final jobs” checkmark checked will effectively prune the processing tree, keeping all jobs necessary to achieve the final results but no others.
When the Cleanup Data tool is run with only the “Clear non-final jobs” checkboxes checked, it will preserve ancestors of final jobs, but will clear other unnecessary branches. However, when the Cleanup Data tool is run with any of the the “Clear pre-processing jobs” checkboxes checked, it may clear ancestors of final jobs that fall into those categories (e.g. motion correction, CTF estimation, etc).
In CryoSPARC v4.3, along with automatically using the Cleanup Data tool to trace jobs and their ancestors, it is also possible to manually select chains of jobs or ancestors/descendant of jobs and perform actions on those selections.
- Select a chain of jobs that are connected to each other
- Select the last job in the chain, and then
ctrlclick the first job in the chain you wish to select
- Right click on either job and choose “Select Job Chain”. The selection will update to include all jobs in the chain between the first and last job.
- Right click on any job in the chain and perform an action on all the jobs such as moving, linking to another workspace, cloning, clearing, deleting, etc.
- Select ancestors or descendants of a job
- Select a job, right click and choose “Select ancestor jobs” or “Select descendant jobs”
- The selection will be updated to include all ancestors or descendants
- Right click on any job in the set and perform an action on all the jobs such as moving, linking to another workspace, cloning, clearing, deleting, etc.
For more information about multi-selecting jobs, see:
Killed and failed jobs may have generated data while they were running which can be cleared. These jobs may be in this status because of bad inputs or errors and usually have no usable outputs. The Cleanup Data tool contains options to clear these killed and failed jobs.
In some cases, for jobs that generate intermediate results, these intermediate results may be usable as inputs to other jobs (e.g. using intermediate iterations in a classification job). In such a case, the killed or failed job can be marked as completed by right clicking on the job card and selecting “Mark Job as Complete”. This will allow for its intermediate results to be used as inputs to other jobs, and prevent it from being cleared when clearing killed or failed jobs with the Cleanup Data tool.
Live sessions generate large amounts of project data in the form of motion corrected micrographs and particle stacks. This data is stored in the session directory within the project directory.
In CryoSPARC v4.3, there are now actions available to reduce the disk space used by Live sessions once a project has reached a stage where the preprocessing data and particle extraction does not need to be repeated. Live sessions that are compacted can be restored in case preprocessing or extraction does need to be repeated.
If you wish to continue being able to perform downstream processing with a final particle stack that was initially extracted by CryoSPARC Live, it is best to restack the particles before compacting the Live session (see above).
Compacting can be done after marking a Live session as completed via the session’s Actions menu. Compacting will clear pre-processing and extraction stages, while saving extracted particle locations, all parameters, and user-inputted data (e.g. manual particle picks). A session cannot be modified after it is compacted, until after it is restored.
After compaction, the Live session will not be functional on CryoSPARC versions less than v4.3.0. This means that if CryoSPARC is downgraded or if the project is detached and reattached to an instance of a lower version, the session will not be able to be restored until it is brought back to a v4.3+ instance.
A compacted session can be restored via the session’s Actions menu. The restoration process brings a session back to it’s pre-compaction status by re-running pre-processing stages for all exposures, which can take significant time. Session restoration requires a lane to run on, which can be set via the Configuration tab of a session.
Should the restoration process be interrupted (e.g. by a failed worker job), it can be resumed by initiating the restoration process again. If individual exposures fail during the restoration process, “Reset failed exposures” can be used to retry processing those exposures.
CryoSPARC projects can be archived and moved to a separate device for long-term storage. Archival can be done after using the Cleanup Data tool to reduce the size of the project to its minimum. See:
Often after periods of heavy use, the CryoSPARC database (which is separate from project and job directories) can become large. It is possible to reduce the database size with additional steps:
The following are example use cases covering recommended actions for clearing project data during various stages of data processing.
- Restack particles to consolidate useful particle data separately from data that will not be immediately useful in further processing, e.g. junk particles or particles not selected after 2D/3D classification
- Use the project cleaning tool to clear deterministic jobs, including extraction jobs but not restack particles jobs.
- Compact all completed CryoSPARC Live sessions (ensuring that useful particles are restacked before compaction)
- Clear unneeded branches of the workspace job tree by marking useful jobs as final results and running the Cleanup Data tool
After these actions:
- You can continue processing downstream jobs using the restacked particles
- If you need to re-do upstream processing (e.g. particle picking), pre-processing jobs that were cleared can be re-run to reproduce their outputs
- Annotate unneeded branches of the workspace job tree by marking useful jobs as final results
- Run the Cleanup Data tool, and use it to clear all pre-processing jobs (including particle extraction and restack jobs) as well as to clear and delete non-final jobs
- Compact all Live sessions
- Archive the project, allowing results to still be browsed in the CryoSPARC UI but allowing the project directory to be moved
- Move the project directory to long term storage
After these actions:
- If the project ever needs to be restored, move the project directory back into your projects folder and unarchive the project
- Any cleared results can be restored by re-running cleared jobs or restoring Live sessions, albeit at the cost of time