In Part One of this tutorial, we covered the basics of 3D Variability Analysis (3DVA) and showed how it can be used to solve for variability components (i.e., eigenvectors of the 3D covariance of images).
This second part of the tutorial covers some of the more advanced ways that 3D variability results can be used to interpret heterogeneity in a dataset.
In cryoSPARC v2.12+, many improvements have been made to the 3D Variability algorithm. It can now be used to solve more variability components simultaneously (12+), resolving more detailed motion, and works for smaller particles.
For example the following videos show the first two variability components solved at a high resolution (4Å) for the T20S proteasome. These components show two types of orthogonal variability in the molecule, corresponding to extension of the barrel and twisting of the top and bottom subunits. Only two components are shown, though in this case, 6 components were solved in total.
3D Variability Analysis has also been used in recent cryo-EM projects and has already appeared in a number of publications:
Structural basis for the docking of mTORC1 on the lysosomal surface (Rogala et al. Science 2019) (Note: This video is from the publication supplementary materials, found here)
Cryo–electron microscopy structures of human oligosaccharyltransferase complexes OST-A and OST-B (Ramirez, Kowal, et al. Science 2019)
Cryo-EM reveals an asymmetry in a novel single-ring viral chaperonin (Stanishneva-Konovalova et al. JSB 2019)
Cryo-electron Microscopy Structure of the Acinetobacter baumannii 70S Ribosome and Implications for New Antibiotic Development (Morgan et al. mBio 2020)
Almost all proteins of biological interest have some amount of conformational heterogeneity, especially continuous heterogeneity. For almost all users, it will be a good idea to use 3D Variability Analysis to directly discover the heterogeneity that cannot be easily solved using traditional 3D classification approaches.
3D variability excels at resolving continuous heterogeneity and motion within a molecule, but it is equally powerful for separating discrete conformational changes.
One of the key theoretical concepts that helps us understand 3D classification of heterogeneity in cryo-EM, is that classification (i.e. clustering) problems are naturally difficult, being non-convex and NP-complete in the worst case. This is the reason that, for example, running more than one 3D classification job (e.g., ab-initio and heterogeneous refinement using initial models) will yield different clustering results. Depending on initialization and stochastic steps taken by the clustering algorithms, some clusters can be found twice (i.e., duplicate classes) and some clusters can be missed entirely. There is no simple way to know whether additional undiscovered clusters remain in the dataset.
Clustering algorithms must explore the space of conformations to find clusters that explain the particle data well. This exploration process becomes more difficult with:
more noise in the data
more data points that need to be compared to one another
high dimensionality data points
Raw cryo-EM particle data unfortunately exhibits all three of the above characteristics. Particle images are very noisy, there are hundreds of thousands or millions of them and each one is represented using thousands of dimensions (pixels).
3D Variability Analysis steps around this issue, making clustering much simpler. It relies on a simple theoretical result: a linear manifold formed from eigenvectors of the data covariance (i.e., 3D Variability components) will, under some mild conditions, span the subspace in which clusters lie, without needing to know the cluster identities or the number of clusters ahead of time. For cryo-EM heterogeneity, this means that when there are discrete clusters present in a dataset, the first several 3D Variability components will directly show us the difference between clusters, separating them as clearly as is possible given the noise in images. In this way, the problem of finding clusters becomes much simpler - they will show up as visual "clusters" when we visualize the particles in their reaction coordinates. Then, we can simply cluster particles by their low-dimensional reaction coordinates, rather than having to look at all the pixels in every image and explore 3D conformations simultaneously.
The following example, using 3DVA on a dataset of 50S ribosome particles (data from David, Tan, et. al Cell 2016, EMPIAR10076) shows how 3D Variability components can separate clusters corresponding to discrete conformational changes. 131,899 particles were processed, first through ab-initio reconstruction (single class) then homogeneous refinement to obtain a consensus refinement of the ribosome. This structure showed regions of low and high resolution where there was variability.
3DVA was then run, solving 4 components. The components themselves are shown below:
Using the new in-line 3D interactive visualization capability in cryoSPARC v2.13, we are able to inspect three reaction coordinate dimensions of the particles, and see clear separation of clusters. This was done by creating a 3D Variability Display job in
cluster mode and selecting 6 as the number of clusters. Clustering in the reaction coordinate space is done using a Gaussian Mixture Model, and each particle is then assigned to a cluster. These are each displayed with a different colour.
This is a great example where clusters are clearly separated in the reaction coordinate space. The new visualization features in cryoSPARC v2.13 make it easy to see the topology of the particles in reaction coordinate space. Each cluster represents a different conformation of the ribosome in this case. The
3D Variability Display job in cryoSPARC, when set to cluster mode, also uses the clusters that are found here to reconstruct the different conformations individually:
In this case, several large conformational changes of the ribosome have been automatically separated. We can then apply 3D Variability again to particles within each cluster, in order to find discrete sub-classes and continuous flexibility that may be present within each overall cluster. We can also take the particles and reconstruction from each cluster and apply standard homogeneous refinement to improve particle alignments and resolution.
3D Variability Display job outputs reconstructions and particle sets from each cluster to make these workflows possible. Currently, the user must specify the number of clusters, though in principle it will be possible to automatically detect the number of clusters in a future version.
In order to see the interactive visual 3D scatter plot as shown above, use the following steps.
Run a 3D Variability Analysis job
Connect the 3DVA job outputs to a 3D Variability Display job
Set the 3D Variability Display job to
cluster mode and set a number of clusters using the
Number of Frames/Clusters parameter.
Run the 3D Variability Display job
In the streamlog of the 3D Variability Display job, you will see plots showing static 3D images of the scatter plot. Hovering your mouse over these plots will prompt you to click to start interactive mode. Interactive mode allows for zooming, panning, rotating, and turning on/off each cluster's points.
Continuous heterogeneity is often well modelled by 3D Variability components, as the first examples in this tutorial demonstrate. In the default
simple mode, cryoSPARC's
3D Variability Display job will create a volume series using the consensus refinement and the 3D Variability component itself to show how the 3D structure changes. This results in a series that contains a smoothly varying, linear change in 3D density values from one end of the series to the other. This linear approximation is often very good for showing small detailed motions and conformational changes within the protein structure, but it can break down for larger motions. In those cases, it can be helpful to construct a volume series by reconstructing separate volumes directly from subsets of the particle images, sorted and chosen by their reaction coordinate value. This technique, of sub-sampling a dataset in the latent space (i.e., reaction coordinates) for visualization, has been used by other methods that involve representing particles in a reaction coordinate space. These reconstructions are called
intermediate reconstructions in cryoSPARC, and can be created using the
3D Variability Display job in
intermediate mode. In this mode, particles are sorted along each variability dimension, and then split into (overlapping) subsets, weighted by their position along the variability dimension. This creates a "rolling window" of particles selected for creating each intermediate reconstruction in a volume series.
Each "triangle" window here shows the weighting for one selection of particles along the sequence that defines the series. Each selection is reconstructed independently using the weighted particles, and these together form a series that can help visualize non-linear changes in a dataset.
The 3D Variability Display Job has a parameter that allows setting the "width" of the rolling window used. This can increase or decrease the number of particles per sub-selection of particles. Increasing the width leads to better signal-to-noise levels in reconstructions, while decreasing it leads to less blurring within each intermediate reconstruction due to the particle's flexing motion. A width of zero is also allowed, which will use "tophat" windows instead.
There is ongoing discussion about how symmetry should be handled when solving 3D Variability components, but in general, the Symmetry Expansion job should be used before using 3D Variability, for symmetric particles.
To explain further: when there is “symmetry” in a particle, this only applies to the consensus structure - any 3D variability mode can break the symmetry. However, the variability mode (assuming the underlying symmetry is a true symmetry) should naturally occur in all symmetric versions of itself.
In this sense there are two kinds of variability modes: modes that are changes/motion within just a single subunit, and modes where there are changes/motion across the entire particle in a coordinated fashion. The first kind are modes where each particle image contains information about multiple positions along the mode (since each subunit is in potentially a different position along that mode). The second kind are modes where each particle image contains information about only one position along the mode (since the entire particle is in only one position) but due to the symmetry, the image could be used as information for all symmetric copies of that same mode.
So for the first kind of mode, symmetry expansion is best. In this case it’s also fine to create a mask around a single subunit, but this is not necessary (since using the subunit mask will make it impossible to find motions across the entire particle). For the second kind of mode, using symmetry expansion will mean that there are many copies of the same mode that can be found (i.e. imaging a symmetric molecule bending along its entire length, in one direction. There are equivalent copies of that mode where the molecule bends in the symmetric versions of that one direction). Symmetry expansion makes sure that every particle counts for each one of these copies, rather than just the single one with which the particle is arbitrarily aligned in the input poses.
Since cryoSPARC v2.12, 3D Variability jobs have a parameter that allows setting a High-pass Resolution in Angstroms. This option adds a “high-pass prior” to the 3D Var components limiting the amount of power they can have at low frequencies. This essentially ignores variability that is larger than a certain scale.
Many smaller/membrane proteins have a large amount of “structured noise” present in the images at low resolutions. This could be from pancaked particles floating around at the air water interfaces, empty micelles above/below the particles, etc. These phenomena causing most of the variability modes to be full of large blobs appearing and disappearing, rather than actually probing the motion or flexing of the molecule. In these cases, turning on the high-pass resolution can improve results. A typical value for the high-pass resolution is 20Å.