Tutorial: 3D Classification (BETA)
3D Classification is a new way to perform discrete heterogeneity analysis in CryoSPARC in a manner that complements Heterogeneous Refinement.
3D Classification (BETA) is a tool in CryoSPARC v3.3+ that provides a way to distinguish discrete heterogeneous states, or classes, from single particle cryo-EM data. Namely, this job currently implements 3D classification without alignment (i.e., realignment of particle orientations or shifts) through a hybrid online and batch expectation maximization algorithm.
By avoiding the computationally burdensome job of realigning particles and relying on higher-order interpolation rather than zero-padding of volumes, 3D Classification can efficiently separate particles into a large number of classes for further downstream analysis at high speed and without very large GPU memory requirements. Furthermore, unlike Heterogeneous Refinement, this job does not require any 3D maps for initialization. Instead, we provide two different initialization modes that can 'bootstrap' reasonable initializations via back-projection. Finally, we also allow for a (soft) mask input to 'focus' the classification on a specific region of heterogeneity, ignoring variation that may be present elsewhere.
In CryoSPARC v4.0, 3D Classification was updated with several significant modifications to the underlying computational algorithm, accepted inputs, initial parameters, and diagnostic plots. Accordingly, this tutorial has also been modified with new salient considerations, and with analysis of two new representative datasets that demonstrate both the power and the limitations of the updated job. Please see the job guide page for a detailed list of new 3D class features included in v4+.
10 states recovered from (focussed) 3D Classification of the voltage-gated sodium channel (Xu et al., 2019). Density shown at two different thresholds. Data from EMPIAR-10261.
The 3D Classification job has one primary input requirement: particles with orientations and shifts in the
A typical pipeline may look as follows:
- 1.Particle extraction, 2D classification, motion correction, CTF computation, etc
- 1.Make sure to remove as many 'junk' particles as possible, though it may also be feasible to use the 3D Classification job itself to identify junk classes and remove associated particles
- 2.(Single- or multi-class)
Non-uniform Refinementto find improved alignments for a final set of particles;
- 4.(Optional, updated in v4.0) Mask generation:
3D Classification (BETA)with particles from step 2 or 3, and a focus/solvent mask from step 4.
- 6.Further refinement of (a subset of) class volumes
Number of classes2-100+. Unlike Heterogeneous Refinement, this job can feasibly classify large datasets into a large number of classes. For example, as of CryoSPARC v4.0, we can classify a ~1.2 million particle dataset (EMPIAR-10077) into 100 classes in approximately 8.5 hours (exclusive of final output checks) on the CryoSPARC testing hardware.
Target resolution2-10Å. This will define the 3D map box size and consequent Nyquist cut-off resolution. The resolution should represent a reasonable size scale at which the heterogeneity is expected, while being as low (i.e., numerically large) as possible to exclude noise. Computation time will also increase as resolution is increased. Note that in v4.0, each class may also be low-pass filtered below this cut-off, in accordance with its computed FSC curve.
Use FSC to filter each class(default: on). Starting with v4.0, 3D classification now has the (recommended) option to filter each class using its intra-class Fourier Shell Correlation.
During online EM, we use a sliding window approach to FSC filtering to avoid computing FSCs with small batch sizes. Accordingly, we apply FSC regularization at every O-EM iteration, but update the per-class FSC curves every 10th iteration (including iteration 0) using a decaying sum of sufficient statistics from the past. During F-EM iterations, FSC curves are re-computed every iteration.
Updated in v4.1:
Per-particle scale(default: optimal). Starting with v4.1, 3D classification now has the option to optimize per-particle scales with respect to the fixed consensus reconstruction prior to starting classification. By default this is turned on, though upstream scales ('input') or a constant scale of 1.0 ('none') can also be selected. Refer to the considerations below for more information about these settings.
Number of O-EM epochs,
O-EM batch size (per class),
O-EM learning rate init: These three parameters will have coupled effects on the variety and quality of the classes at the end of Online Expectation Maximization (O-EM). For a fixed number of epochs, reducing the batch size will increase the amount of O-EM iterations, the effect of which will also depend on the learning rate.
In general, if you observe unexpected class ‘collapse’ during O-EM, we suggest reducing the learning rate in 0.1 increments, and/or reducing the batch size. If the average Class ESS is not near 1 at the of O-EM, we suggest increasing the number of epochs.
Convergence criterion (%): this parameter determines when to stop F-EM based on the amount of particles that switch classes from the previous iteration to the current one. We find that leaving this at 2% works well in many cases. For more difficult datasets, this may need to be increased to account for some particles that switch persistently because they cannot be classified with high certainty. Alternatively, you can turn on the secondary convergence criterion discussed below.
RMS density change convergence check: monitor the root mean square (RMS) of the difference between class volume densities across iterations. For more difficult classification tasks, this number can be quite low despite a relatively high number of class switches (e.g., 5% +). In effect, particles may shuffle around classes, but have no significant effect on the class volumes. To ensure that classes with very few particles don’t have disproportional effect on this number, we computed a weighted average across classes, with weights set based on the relative size of the class.
Updated in v4.0:
Other parameters such as
Class similaritymay also affect classification, but we have not found their impact to be significant in our testing.
This job can classify any input particles with the
alignments3Dkey set (e.g., from an ab-initio job, from homogeneous refinement, from an imported EMPIAR particle dataset, etc.), however the quality of the alignments may affect the resulting classes.
During the expectation step of both full and online EM, we evaluate the likelihood of each particle under the following class volume
is the solvent mask,
is the focus mask,
is the consensus volume (computed at the outset and fixed), and
is the volume associated with class
In words: At each iteration, each 3D class volume is masked by the focus mask, and the voxels outside the focus mask are replaced with the consensus reconstruction density voxels (rather than zero). This result is then masked by the solvent mask, and the voxels outside the solvent mask are replaced with zero.
If the focus mask is not supplied, we set
contains a micelle, and no focus mask is supplied, the heterogeneity present within the micelle itself will usually dominate the classes and prevent 3D classification from identifying biologically relevant heterogeneity. Whenever possible, we recommend supplying a focus mask.
The class volumes produced by 3D classification may also be affected two other important elements:
- 1.Anisotropic magnification Anisotropic magnification will cause class volumes to look stretched along orthogonal axes (often resulting in a ‘wobbling’ effect when classes are animated sequentially, as below). Although 3D classification does not currently estimate anisotropic magnification within the job, it can use upstream estimates encoded in the particle stack to compensate for this effect. If you suspect anisotropic magnification is affecting your results, you can run a
Global CTF refinementjob and reconnect its output to the 3D classification job, ensuring that
Correct anisotropic magnificationis turned on.
Anisotropic magnification causes classes to 'wobble'.
2. Particle Scale Factors
🆕 As of CryoSPARC 4.1, 3D Classification includes built-in per-particle scale optimization via the parameter
Per-particle scale. By default, this parameter is set to
optimal. In this mode, optimal per-particle scales are computed with respect to the consensus volume and fixed particle alignments prior to classification. This procedure, combined with some algorithmic changes in v4.1, should largely avoid the convergence behaviour described below.
Note that in some cases it may still be beneficial to use
inputscales obtained via a refinement job prior to 3D classification, as these will be optimized simultaneously with alignments and may therefore differ from those computed herein (which use fixed alignments).
Finally, in some further cases (e.g., in data with significant compositional/discrete heterogeneity), it may also be useful to set this parameter to
none. In these cases, per-particle scale optimization may make it more difficult for 3D classification to separate classes with missing/additional density.
Even if two particles contain the same signal, their relative scale (i.e., mean intensity) may differ due to ice thickness and other factors. This can have a significant effect on 3D classification. Similar to anisotropic magnification, we don’t compute these scales within the 3D classification job itself, but the job can use optimal scales from an upstream job to obviate their effect. A common manifestation of unequal particle scales is dramatic ‘reshuffling’ during F-EM iterations. If you observe this, (re-)run a
Homogeneous refinementjob with
Minimize over per-particle scaleturned on. If per-particle scale factors are indeed an issue, you may observe a multi-modal scale factor histogram (as seen below). The new particle stack from the refinement job can then be input to 3D classification which will account for these scale factors.
An example of a class flow diagram from a 3D classification job where particle scales are not equal. We often find that unequal scales cause the F-EM iterations to dramatically ‘shuffle’ the classes.
A bimodal distribution of per-particle optimal scales computed with a homogeneous refinement job. Computing these per-particle scales prior to connecting particles to a 3D classification job can significantly affect the resulting 3D classes.
Effective sample size (ESS) and soft class assignments
The ESS is a simple measure of the extent to which a discrete probability distribution is 'dispersed.' In 3D Classification, the class ESS is evaluated over the posterior of class assignments for each particle. Numerically, the per-particle class ESS is equal to the inverse of the sum of squared class probabilities and ranges from 1 to the number of classes. A value near the number of classes indicates that the class posterior is near a uniform distribution, while a value near 1 represents a `hard' selection of a single class.
In v4.0, we now display a histogram of class ESS values as part of the standard suite of diagnostic plots.
Per-particle Class ESS Histogram displayed in 3D Classification (≥v4.0).
Weighted back-projection and further refinement
In 3D Classification, the final output volumes are constructed using a weighted back-projection with weights on each particle defined based on the class posterior. This means that although the output class particle sets are disjoint, each particle may contribute to multiple (or all) volumes. When the dataset-wide mean class ESS is near 1, this effect is minimized. Nevertheless, the volumes themselves are primarily useful as a visualization of each class, and further refinement should be done on the relevant particle sets for a final reconstruction.
In v4.0, 3D Classification includes the option to disable this ‘soft back-projection’ by turning on the parameter named
Force hard classification.
Ribosome with selenocysteine delivery in E. coli, Fischer et al. (2016)
This data captures a ribosome complex binding with a ligand. In the original publication, 6 distinct states (see Figure below) were found.
Figure 1a, Fischer et al. (2016).
- Part 1: 1.19 million particles from
- Part 2: 1.19 million particles from
Global CTF Refinement
- Solvent mask
- Mask from
Without correcting for anisotropic magnification, the classes include biologically-salient conformations but also display a characteristic ‘wobble’ (discussed in important considerations above):
Results of a 10 class 3D Classification job without correcting for anisotropic magnification
Here, in another 3D classification run, this wobbling is even more pronounced and most noticeable when there is little other conformational change:
Results of a (different) 10 class 3D Classification job without correcting for anisotropic magnification — here the ‘wobble’ is very evident.
Part 2: Classification without anisotropic magnification correction
To correct for this, we ran a
Global CTF refinementjob (with
Anisotropic Mag.fits turned on) on the particle stack:
Then, using the new particle output group, we re-ran 3D classification which produced 10 classes that were no longer stretched in the same way:
Results of a 10 class 3D Classification job correcting for anisotropic magnification
Note that we still see one class (class 2) with significantly more density:
This may be indicative of some particle scaling issues. We discuss one way to account for these in the next dataset.
Human RNA polymerase III, Girbig et al. (2021)
Two different conformations of the clamping mechanism of the RNA polymerase, from Fig. 2c, Girbig et al. (2021).
- Solvent mask
- Part 1: mask from
Homogeneous Reconstruction Onlyjob
- Part 2: mask from
As of CryoSPARC 4.1, 3D Classification includes built-in per-particle scale optimization via the parameter
Per-particle scale. Please see the note above regarding updated convergence behaviour in v4.1.
We ran 3D classification with the default parameters on the imported dataset from EMPIAR 10697 consisting of approximately 166K particles. With these inputs, the job required over 30 F-EM iterations to converge below the standard threshold of 2% class switches. We often observed significant class ‘shuffling’, as you can see below.
After 67 total iterations, the job did converge, with the following class distributions:
At first it may seem that we’ve found a number of ‘low population’ classes. However, upon further inspection we see that many of these states are similar:
Ten (of ten) classes from a 3D Classification job run on data from EMPIAR 10697, with no modification to particle scales.
Part 2: Classification with optimized per-particle scales
Investigating this further, we see that if we run
Homogeneous refinementon the particles with
Minimize over per-particle scaleturned on, we see the following scale distribution:
This type of distribution is indicative of multiple scale ‘modes’ which will affect 3D Classification. Indeed, if we re-run the job with these scale-optimized particles, the 3D classification job converges within 2 F-EM iterations to the following class distribution:
This results in four distinct classes, animated below:
Four states recovered using 3D Classification of human RNA Polymerase III (EMPIAR 10697), after per-particle scale optimization.
A.baumannii MlaBDEF complex bound to AppNHp, Mann et al. (2021)
The MlaBDEF complex, from Fig 1c, Mann et al. (2021).
Different MlaB binding configurations, from Supp. Fig. 2, Mann et al. (2021).
- 80K particles from
Non-uniform refinement(after particle picking, 2D class, ab-initio)
- Solvent mask
- Mask from
- Focus mask
- Mask created using ChimeraX (following the CryoSPARC mask generation tutorial)
Focus mask isolating the binding sites of MlaB.
- Part 1:
- Classes: 5
- Part 2:
- Classes: 5
- Force hard classification: True
Applying 5-class focussed classification on this dataset results in the following classes (after 6-FEM iterations):
Note the small hump around 2 in the ESS histogram. This indicates that several thousand particles still have significant probability of belonging to two classes — these particles are ‘spread’ about two volumes.
When we take a look at the volumes themselves:
We see that there is no class that contains no MlaB units.
Five classes from 3D classification performed on EMPIAR 10425 with a focus on mask on the MlaB binding sites.
In this type of case, it might be useful to observe what happens when we turn off weighted backprojection, and instead classify particles using ‘hard classification’.
With hard classification turned on, we see a very different class distribution:
A plurality of the particles are now in class 4, which is a class that may have no MlaB units (though this requires further investigation).
Five classes from 3D classification performed on EMPIAR 10425 with a focus on mask on the MlaB binding sites. In this case, hard classfication is turned on, and we see the potential presence of a ‘no binding’ class.
Thus, for data where a significant potion of particles cannot be classified into a single class with certainty (i.e., their class ESS is ≥ 2), turning on hard classification may help uncover classes that would otherwise be ‘smeared’ out by this uncertainty.
Fischer, Niels, et al. "The pathway to GTPase activation of elongation factor SelB on the ribosome." Nature 540.7631 (2016): 80-85.
Girbig, Mathias, et al. "Cryo-EM structures of human RNA polymerase III in its unbound and transcribing states." Nature structural & molecular biology 28.2 (2021): 210-219.
Mann, Daniel, et al. "Structure and lipid dynamics in the maintenance of lipid asymmetry inner membrane complex of A. baumannii." Communications biology 4.1 (2021): 1-9.
Xu, Hui, et al. "Structural basis of Nav1. 7 inhibition by a gating-modifier spider toxin." Cell 176.4 (2019): 702-715.