Tutorial: 3D Variability Analysis (Part One)
Part One of the two-part tutorial on 3D Variability Analysis for exploring heterogeneity.
This is part one of a two-part series on 3D variability analysis in CryoSPARC. This Part One explains the background, basics, and usage of the new tools, while Part Two covers several case studies, interpretation of results, and advanced usage.
3D Variability Analysis (3DVA) is a powerful tool in CryoSPARC v2.9+ for exploring both discrete and continuous heterogeneity in single particle cryo-EM data sets. It is based on a different fundamental idea from previous methods like 2D/3D classification, focused refinement, multibody refinement, etc.
In a standard 3D cryo-EM refinement, individual particle images are used to reconstruct a single 3D structure, under the assumption that there is only one rigid conformation present in the data. In 3D variability analysis, individual particle images are instead used to reconstruct a continuous family of 3D structures. Together, this family captures the multiple discrete and flexible conformations that are present in the data. Intuitively, the family can be thought of as the set of all possible 3D conformations that appear in the sample. There may be multiple dimensions (intuitively: attributes) needed to characterize the members of the family. For example, a subunit may associate/dissociate (first attribute), while another subunit bends (second attribute), while a ligand binds/unbinds (third attribute). Each single particle image is an image of one particular member of the family, with the position of each particle along each attribute of family being known as a "reaction coordinate", "embedding coordinate", or "latent representation".
3D variability analysis in CryoSPARC solves for the continuous family of 3D structures as well as the assignment of each particle to a position within the family (reaction coordinates). As such, it is a helpful way to understand and visualize the heterogeneity and flexibility in a data set.
In practice, 3DVA can be used in multiple ways. Most simply, it provides visually interpretable insight into the conformational space of a molecule, in the form of 3D videos. In more complex workflows, it can be used to seed subsequent heterogeneous refinement (especially when discrete heterogeneity is present) by creating versions of the molecule at different conformational positions, according to the family of structures solved. It can also be used to determine the direction and amount of continuous flexibility in a molecule, and thereby help in the selection of masked regions for local refinements. Finally, it can directly provide information about the energy landscape of the molecule, by inspecting the distribution of reaction coordinates among particles.
Note that 3DVA was released as BETA in CryoSPARC v2.9, and is a work in progress - stay tuned for further updates and improvements!
To read more about 3D Variability Analysis, please see:
Punjani, A. and Fleet, D.J. 3D Variability Analysis: Resolving continuous flexibility and discrete heterogeneity from single particle cryo-EM images. Journal of Structural Biology, Volume 213, Issue 2, 2021. https://doi.org/10.1016/j.jsb.2021.107702
3D variability analysis is based on a widely explored idea: to try and compute the eigenvectors of the 3D covariance of a set of particle images. This is also known as principal Components Analysis (PCA). A "principal component" is the same as an "eigenvector of the covariance". Eigenvectors of the 3D covariance are, literally, linear directions in the space of 3D volumes in which there is significant variability within the data set. They can be thought of as "trajectories" in the space of 3D structures along which the molecule exhibits conformational variability. Moving along a particular eigenvector yields all the different 3D conformations of a molecule that are captured by that particular eigenvector. Multiple orthogonal eigenvectors can be solved simultaneously, each corresponding to a different type of variability. Together they yield a multi-dimensional subspace that defines the family of conformations in the dataset.
In CryoSPARC, we have developed and implemented an algorithm and new fast GPU implementation for computing high-resolution 3D eigenvectors of the covariance, using particles from all viewing directions simultaneously, while also accurately correcting for the CTF. The algorithm can be used by running the new
3D Variabilityjob in cryoSPARC v2.9+.
The algorithm, as mentioned above, solves for the eigenvectors as well as the position of each particle along each eigenvector - referred to here as reaction coordinates. The number of eigenvectors to solve, K, is a user specified parameter.
Each eigenvector itself is a 3D volume, containing positive and negative values at each voxel. These describe the specific change that is happening to the molecule along the corresponding dimension of variability. Below is an example of what three orthogonal slices of a single typical eigenvector may look like:
Reaction coordinates are a set of numbers, for each particle. The number of reaction coordinates is equal to the number of eigenvectors that are solved. The distribution of reaction coordinates across particles in a dataset gives a picture of the energy landscape of the molecule, by showing how many particles in the population take on a specific position in the conformational space spanned by the eigenvectors.
Below is a sample reaction coordinate plot for the EMPIAR-10028 80s ribosome:
The 0, 1, and 2 axes correspond to the three different components (eigenvectors) that were solved using 3DVA. The left plot shows coordinates for 0 vs 1, and the right plot for 1 vs 2. There is a small "bimodal" split along coordinate 0 (which by inspecting the eigenvector, is determined to correspond to ratcheting of the small subunit), while 1 and 2 correspond to a more continuous flexing of the molecule.
In the case of discrete heterogeneity, we expect to find multiple "clusters" in reaction coordinate space. Each cluster corresponds to a discrete conformation of the molecule, where there is more probability of finding particles. The spaces between clusters may be either non-physical intermediate states that are never occupied (eg. if the clusters represent conformations where a subunit associates/dissociates, the in-between state of a half-density subunit is not realistic) or they may represent intermediate states that are occupied with very low probability (eg. if the clusters represent the opening and closing of a channel, the in-between state of a half-open channel may by physically realistic, but almost never observed). It is important to note that 3D variability analysis starts with fixed alignments of each particle, which are estimated using a standard refinement. As such, 3DVA assumes that the conformations present in a dataset are not so different that a refinement of particles against a consensus structure would yield grossly incorrect alignments. In other words, 3DVA will work well for finding discrete heterogeneity only when the different conformations present have enough shared density structure that particles from any conformation can be reliably aligned to an average structure of all the conformations. This is generally the case for conformational changes induced by eg. ligand binding, opening/closing of channels, or association/dissociation of subunits. It is generally not the case when there are multiple entirely distinct molecules in the sample - for this case, it is recommended to start with a heterogeneous ab-initio reconstruction to separate grossly different conformations.
On the other hand, in the case of non-discrete or continuous flexibility, we expect to find a single cluster with significant variability in one or more dimensions. Those components of variability will correspond to the flexibility itself, and particles along the corresponding dimension of the reaction coordinates are in varying states of "flex". The flexibility itself can be visualized as a "movie" of volumes along the variability dimension. In this case, intermediate states all have significant probability and are physically realistic. It is important to note that in this case, the eigenvectors of the covariance, representing linear transformations of density, can not perfectly represent motion. They are, however, a good approximation to motion at resolutions that are approximately equal to the extent of motion. This means that for example, the motion of an alpha helix moving by 5Å can be well approximated by an eigenvector that is limited in resolution to 5Å. Likewise, the motion of a subunit by 40Å is well approximated by an eigenvector limited to 40Å resolution. This subtle point will be discussed further in the second part of this tutorial series.
We recommend using the
3D Variabilityjob when you suspect or are interested in investigating conformational variability in your dataset. Through this tutorial and the example results below, we can conclude that many molecules analyzed through single particle cryoEM contain at least some observable 3D variability.
- 1.If commencing with a new particle stack, try and clean up the data set as much as possible using
2D Classification. The goal is to remove obvious "junk" particles that may later affect the quality of reconstruction and refinement, but not to try and separate conformations.
- 2.Using the cleaned particle stack, run a multi-class
Number of Ab-initio classes≥ 2) in order to find and exclude grossly different structures that may be present in the data set (e.g., if you have two completely different molecules present). You may also observe some heterogeneity that is conformational variability of a single molecule (e.g,. a subunit moving/flexing or portions dissociating, etc.) - this is fine.
- 3.Combine all particles that have some amount of protein density in common (i.e. all conformational states) and perform a
NB. For now, you will need to set
Refinement box size (Voxels)to the full resolution/box size of the particles. If this is very large, it is better to use the
Downsample Particlesjob first to downsample the particles, do the consensus refinement again, and then apply 3D variability analysis. During the refinement, optionally enter a value to enforce
Symmetry(e.g., C4) if known.
- 1.Create a new
3D Variabilityjob and connect both the particles and the mask from the previous refinement.
NB. For membrane proteins it is particularly helpful to mask out the micelle because otherwise, almost all the observable variability will be of the micelle and not the encapsulated molecule. Similarly, for very flexible proteins, it is helpful to further dilate the mask from a refinement to ensure that the mask is wide enough to contain all the flexed versions of the molecule. This can be done using the Volume tools job in CryoSPARC.
- 1.Alternatively, if only interested in conformational variability of a specific region of the protein, import a different mask and connect this one to the
3D Variabilityjob to perform "focused variability".
Number of modes to solve: This is the number of components/eigenvectors that will be solved for. The default value is 3, and the maximum is limited by GPU memory and the box size. Generally more modes will take (linearly) more time to solve. One subtle but important point is that the algorithm will generally solve these in order of importance, so asking for more modes will not reduce the quality of the first, most important modes of variability. NB. in v2.9 there is a known bug where sometimes, due to numerical instability, using more modes can result in "streaking" and other artefacts in the results. We are working on fixing this.
Filter resolution (Å): This is the resolution at which eigenvectors are filtered during optimization. We recommend setting this to a value lower than the resolution of the consensus refinement performed previously (e.g., you may set it to 5A for a 3A refinement, but it can be experimented with). Using too high of a resolution will allow noise to dominate the determination of the eigenvectors.NB. Currently,
3D Variability(as well as the preceding refinement) must be run at the full resolution/box size of the particles. You can downsample particles before processing, or else downsample the results before downloading outputs, in the next step. Also, you cannot currently enforce symmetry during this job.
- 3.Run the job. It should take 30 mins to 1 hour for 3 components and 100K particles on 1 GPU.
- 2.In the meantime, create a
3D Variability Displayjob, connecting the particles and volumes from the
Output mode: There are several output modes that can be used for different purposes. NB. in v2.9.0 only the simple mode is exposed as the others are experimental. Simple mode: This will output a simple linear "movie" of volumes along each eigenvector. You can control the number of frames, and the starting/ending position as a percentile of the reaction coordinate distribution. The output will be a set of "volume series" which are essentially just a set of numbered
.mrcfiles that can be opened with Chimera (see below).
Number of frames: How many frames (per dimension) or clusters to output.
Downsample to box size: Downsample the volumes before writing, to save disk space.
Crop to size (after downsample): Crop the volumes after downsampling, to save disk space.
Only use these components: Select a subset of variability components to use for visualization or clustering. This can be entered as a comma-separated, 0-indexed list (e.g., '0,2,3'). Leave as None to use all the components computed by the 3DVA job.
- 3.Once completed,
3D Variability Displaywill output a series of frames for each component of the variability detected in the dataset. The outputs can be used to either download and view as movies in Chimera, or can be used to provide initial models and particle subsets for further classification and refinement. See the Visualization section below for detailed instructions about how to display the results in Chimera. Also see the second part of this tutorial series for more advanced usage and interpretation.
The 80S Ribosome dataset contains 100K particles. It is a perfect example of a molecule that has many types of variability that are simultaneously occurring in the dataset. The results here were obtained by first performing a consensus refinement of the dataset reaching 3Å resolution. Next, the particles and mask were used in a 3D variability job with 5 modes. This yielded 5 different eigenvectors and the reaction coordinates of particles along each one:
The figure below shows a projection of each of the 5 modes of variability (eigenvectors) that were solved. Each one contains both black (negative) and white (positive) regions that correspond with density changes.
The figure below shows the reaction coordinate distribution of particles, as scatter plots between adjacent pairs of components (0 vs 1, 1 vs 2, 2 vs 3, etc). Clearly, the first component (0) has the most variability. This corresponds to the ratcheting motion of the small subunit.
The following videos show, from the same view, the first two eigenvectors of variability in the 80S ribosome dataset. These correspond with ratcheting of the small subunit, and rotation of the head region of the small subunit in a transverse direction.
The Nav1.7 channel is an example of a dataset where the first component of variability (pictured below) is a flexible conformational change. Notably, the resolution of this result is high, and the motion of individual helices can be seen at a resolution where side-chains and helical pitch is identifiable. Note that this result was computed using particles from an already-classified subset of the data, containing only the active state of the channel. This was done to emphasize the flexible degrees of variability rather than the discrete channel opening-closing variability. Note also that this result was computed using a mask that excludes the micelle. 3 modes were solved in this case.
Some molecules that have always been assumed to be quite rigid, like the T20S Proteasome, actually do have flexible variability as well! The video below shows the first mode of variability of the proteasome data, computed from 50K particles. In this mode, there appears to be a slight lengthening and twisting of the proteasome molecule.
2. In the volume series tool, click "Open..." and multi-select the series of
frame_XXX.mrcfiles from a single component. This will open them all together as a series.
3. Now in the Volume Series tool there are some options for displaying. Use these options:
- When you set the Data cache size, you have to press "Enter" after typing 1024.0 in the box, then check if it worked by clicking "Current use" to see how much space is allocated for cache.
- Make sure the number of cached renderings is at least as many as the number of frames (20 by default).
- Oscillate mode will make the frames loop back and forth, best way to show the motion.
- The "Normalize threshold levels" option is nice because it adjusts the threshold for each frame so that the total volume enclosed by the mesh from each frame is the same.
- Once you have these settings set, adjust the step and threhsold in the volume viewer so that the initial frame looks good:
- Then you can press "Play". The frames will play slowly at first, but after the first pass through they will speed up a lot because the 3D mesh is cached.
We recommend the following settings:
set bgColor whiteand a colour of
To record a movie use these commands:
windowsize 1200 1200
vseries open *.mrc
vol #0 step 1
vseries play #0 direction oscillate loop false normalize true cacheFrames 30
vseries stop #0
movie encode ~/Desktop/3DVA/result.mpg quality higher