Case Study: Exploratory data processing by Oliver Clarke
Last updated
Last updated
The guides presented here are kindly reproduced from Oliver Clarke, PhD, Assistant Professor of Physiology and Cellular Biophysics at Columbia University. They are friendly, approachable introductions to cryoEM data processing in CryoSPARC with a focus on the exploratory, "try-it-and-see" nature of single-particle analysis.
Both guides cover similar topics, but the 2024 version includes some steps which require CryoSPARC v4.4 or later. Below we directly reproduce the general outline section of each guide. The full guide is available in the linked PDF.
This workshop is intended to provide an introduction to "exploratory" data processing in CryoSPARC - that is, data processing with the goal of quickly identifying, reconstructing and refining the molecular species present in a heterogeneous sample. CryoSPARC is used here, but the same general principles & workflow apply to single particle processing in any software package (the particles and micrographs should be directly importable into RELION - just convert the particle.cs
file to a STAR file using csparc2star.py
in pyem
(see here) and you should be good to go). Note: some parts (e.g. symmetry relaxation) require CS 4.4 or later.
General principles to keep in mind
Process small, clean, subsets of your dataset before tackling the whole. There are many choices to make during data processing - What picking strategy to use? What cleaning/classification strategy to use? What molecular species are present, and which to focus on? In many cases, the only way to identify the best performing strategy is by trial and error. This is much faster working with a smaller subset of data, and can provide 3D volumes and strategies which can then be used to seed analysis of the entire dataset.
Iterate! Often, optimal processing of a heterogeneous dataset will benefit from multiple passes. The first quick pass identifies any potential issues (non-optimal orientation distribution, variable behavior of particles in different ice thickness regimes) and facilitates identification of the very best micrographs (those with the most particles remaining after initial picking and classification), which can then be used to train a neural network picker such as Topaz to repick the entire dataset.
Experiment/explore! There is no single valid strategy for processing a heterogeneous dataset, and this workshop is only a brief guide to some possible approaches. Mix and match, test what works best, and then apply these strategies to your own data!
I have included two subsets of data for the first part of the workshop (micrographs and extracted & Fourier cropped particles) derived from a publicly available heterogeneous dataset - EMPIAR-11043, the erythrocyte ankyrin-1 complex purified from digitonin extracts of human red blood cell membranes (PMID: 35835865). For the second part of the workshop, addressing mixed symmetry and pseudosymmetry, I have included subsets of data from EMPIAR-10425 (the MlaBDEF complex, PMID: 34188171), as well as EMPIAR-10059 (TRPV1-DkTx complex, PMID: 27281200).
These datasets are intended to provide a lightweight and portable starting point for data processing initiated from either CTF estimation and picking (micrographs) or ab-initio volume generation and classification (particles), which can be easily accommodated even on systems with limited storage and processing power. Both sets of data are relatively small, but large enough to allow for identification and characterization of multiple species over the course of the workshop.
This workshop is intended to provide an introduction to "exploratory" data processing in CryoSPARC - that is, data processing with the goal of quickly identifying, reconstructing and refining the molecular species present in a heterogeneous sample. CryoSPARC is used here, but the same general principles & workflow apply to single particle processing in any software package (the particles and micrographs should be directly importable into RELION - just convert the particle.cs
file to a STAR file using csparc2star.py
in pyem
(see here) and you should be good to go).
General principles to keep in mind:
Process small, clean, subsets of your dataset before tackling the whole. There are many choices to make during data processing - What picking strategy to use? What cleaning/ classification strategy to use? What molecular species are present, and which to focus on? In many cases, the only way to identify the best performing strategy is by trial and error. This is much faster working with a smaller subset of data, and can provide 3D volumes and strategies which can then be used to seed analysis of the entire dataset.
Iterate! Often, optimal processing of a heterogeneous dataset will benefit from multiple passes. The first quick pass identifies any potential issues (non-optimal orientation distribution, variable behavior of particles in different ice thickness regimes) and facilitates identification of the very best micrographs (those with the most particles remaining after initial picking and classification), which can then be used to train a neural network picker such as Topaz to repick the entire dataset.
Experiment/explore! There is no single valid strategy for processing a heterogeneous dataset, and this workshop is only a brief guide to some possible approaches. Mix and match, test what works best, and then apply these strategies to your own data!
I have included two subsets of data (micrographs and extracted & Fourier cropped particles) derived from a publicly available heterogeneous dataset - EMPIAR-11043, the erythrocyte ankyrin-1 complex purified from digitonin extracts of human red blood cell membranes.
These datasets are intended to provide a lightweight and portable starting point for data processing initiated from either CTF estimation and picking (micrographs) or ab-initio volume generation and classification (particles), which can be easily accommodated even on systems with limited storage and processing power. Both sets of data are relatively small, but large enough to allow for identification and characterization of multiple species over the course of the workshop.