Tutorial: 3D Classification (BETA)
3D Classification is a new way to perform discrete heterogeneity analysis in CryoSPARC in a manner that complements Heterogeneous Refinement.

Introduction

3D Classification (BETA) is a tool in CryoSPARC v3.3+ that provides a way to distinguish discrete heterogeneous states, or classes, from single particle cryo-EM data. Namely, this job currently implements 3D classification without alignment (i.e., realignment of particle orientations or shifts) through a hybrid online and batch expectation maximization algorithm.
By avoiding the computationally burdensome job of realigning particles and relying on higher-order interpolation rather than zero-padding of volumes, 3D Classification can efficiently separate particles into a large number of classes for further downstream analysis at high speed and without very large GPU memory requirements. Furthermore, unlike Heterogeneous Refinement, this job does not require any 3D maps for initialization. Instead, we provide two different initialization modes that can 'bootstrap' reasonable initializations via back-projection. Finally, we also allow for a (soft) mask input to 'focus' the classification on a specific region of heterogeneity, ignoring variation that may be present elsewhere.
Example classes on the EMPIAR 10077 dataset.

Usage

The 3D Classification job has one primary input requirement: particles with orientations and shifts in the alignments3D key.
A typical pipeline may look as follows:
  1. 1.
    Particle extraction, 2D classification, motion correction, CTF computation, etc.
    1. 1.
      Make sure to remove as many 'junk' particles as possible, though it may also be feasible to use the 3D Classification job itself to identify junk classes and remove associated particles;
  2. 2.
    (Single- or multi-class) Ab-initio reconstruction;
  3. 3.
    (Optional) Homogeneous refinement to find improved alignments for a final set of particles;
  4. 4.
    (Optional) Mask generation:
    1. 2.
      Mask from Homogeneous refinement;
  5. 5.
    3D Classification (BETA) with particles from step 2 or 3 and a mask from step 4;
  6. 6.
    Further refinement of a subset of classes.

Salient parameters

  • Number of classes
    • 5-100+. Unlike Heterogeneous Refinement, this job can quickly classify large datasets into a large number of classes. For example, in the EMPIAR-10077 example below, we classify a ~1.2 million particle dataset into 100 classes in approximately 8 hours on the CryoSPARC testing hardware.
  • Target resolution
    • 2-10Å. This will define the 3D map box size and low pass filtering performed during classification (filtering will be done up to approximately the Nyquist limit based on this box size). The resolution should represent a reasonable size scale at which the heterogeneity is expected, while being as low as possible to exclude noise. Computation time will also increase as resolution is increased.
  • Initialization mode
    • See the Important considerations below.
  • Batch size per class
    • This parameter multiplied by the number of classes will define the number of particles in each O-EM iteration.
  • Auto-tune initial class similarity & Target class ESS fraction
    • When Auto-tune initial class similarity is set to True, the job will automatically tune the class similarity score so that initially, the average particle has a significant likelihood of belonging to multiple classes (characterized by the Effective Sample Size, or ESS, of the probability distribution over classes). The Target class ESS fraction parameter defines the target ESS as a fraction of the total number of classes. This fraction can typically be set between 0.1 and 0.9.

Important considerations

  • Source of alignments3D
    • This job can classify any input particles with the alignments3D key set (e.g., from an ab-initio job, from homogeneous refinement, from an imported EMPIAR particle dataset, etc.), however the quality of the alignments may affect the resulting classes.
  • Initialization modes
    • Simple mode
      • In simple mode, the job will back-project K initial volumes by randomly selecting K sets with N particles each (where N is set by the parameter PCA/simple: particles per reconstruction).
    • PCA mode
      • In PCA (Principal Component Analysis) mode, the job will do the following for K classes:
        1. 1.
          Select M >> K subsets of N particle images each. M is set by PCA: number of reconstructions ;
        2. 2.
          Reconstruct M volumes;
        3. 3.
          Apply a Principal Component Analysis (PCA) decomposition on the space of 3D voxels with PCA: number of components ;
        4. 4.
          In principal component space, cluster M volumes into K clusters;
        5. 5.
          Average all volumes in each cluster. Use the resulting K volumes as initial structures for classification.
    • Input mode
      • In input mode, the job will use K input volumes as initial structures for classification. Note that these must be distinct volumes, and the job will throw an error if they are not distinct.
  • Effective sample size (ESS) and soft class assignments
    • The ESS is a simple measure of the extent to which a discrete probability distribution is 'dispersed.' In 3D Classification, the class ESS is evaluated over the posterior of class assignments for each particle. Numerically, the per-particle class ESS is equal to the inverse of the sum of squared class probabilities and ranges from 1 to the number of classes. A value near the number of classes indicates that the class posterior is near a uniform distribution, while a value near 1 represents a `hard' selection of a single class.
    • We use a batch-averaged class ESS for initial class similarity tuning and also report this number after each iteration. When this metric approaches one, the classification is likely converged.
  • Weighted back-projection and further refinement
    • Please note that the resulting output volumes are constructed using a weighted back-projection with weights on each particle defined based on the class posterior. This means that although the output class particle sets are disjoint, each particle may contribute to multiple (or all) volumes. When the dataset-wide mean class ESS is near 1, this effect is minimized. Nevertheless, the volumes themselves are primarily useful as a visualization of each class, and further refinement should be done on the relevant particle sets for a final reconstruction.

Example Results

We report two examples of 3D classification evaluated on two different datasets. The first includes more than a million particles and 100 classes, while the second uses approximately 66,000 particle images but also includes pseudo-symmetry and focussed classification with a generated soft mask.
Ribosome with selenocysteine delivery in E. coli
This data captures a ribosome complex binding with a ligand. In the original publication, 6 distinct states (see Figure below) were found.
Figure 1a from Fischer et al. (2016).
After pre-processing of raw data, extracting approximately 1.2 million particles, and running a consensus refinement in cryoSPARC Live, we applied the 3D Classification (BETA) job with 100 classes. The resulting classes (visualized using ChimeraX, using the vseries play command) are shown below:
In these classes, we find all six states reported in the original publication and several other potentially relevant states of heterogeneity.
3D Classification (BETA) Inputs:
Particles: 1,194,964 particles (alignments3D from the Homogeneous refinement job)
Volumes: None
Mask: Generated using ChimeraX to include the ribosome and the ligand. Softened with Volume Tools.
Other salient parameters:
Number of classes: 100
Target resolution: 6Å
Initialization mode: PCA
TRPV5 with calmodulin bound
This dataset was suggested on the CryoSPARC Discussion Forum as an example where 3D Classification has been shown to be useful. In the original publication, Dang et al. (2020) report results from 3D Classification (without alignment) using RELION. After performing a consensus refinement, the authors apply a 1D search over a single rotation angle to break the initial C4 symmetry and then apply a focussed classification on two binding sites. The figure below presents three classes the authors obtained using this procedure.
Figure S9 from the supplementary material of Dang et al. (2020). Three different binding configurations of the TRPV5 and calmodulin.
In our testing, we simplified this procedure to the following:
  1. 1.
    Import 66,071 particles from the EMPIAR entry (these include alignments3D);
  2. 2.
    Run a Homogeneous Reconstruction Only to obtain a consensus reconstruction;
  3. 3.
    Generate a mask that focuses on the interior of the TRPV5 channel based on the reconstruction in step 2, and finally
  4. 4.
    Run a 3D Classification (BETA) job with 24 classes (without any form of symmetry expansion).

Consensus reconstruction

After the Homogeneous Reconstruction Only job, we obtain the following 3D map:
Homogeneous Reconstruction Only job output on EMPIAR 10256
EMPIAR 10256 reconstruction visualized with ChimeraX (level set such that the micelle is not visible).

3D Classification

Inputs:
Particles: 66,071 from particles.star in EMPIAR-10256
Volumes: None
Mask:
A soft mask that focusses on the interior of TRPV5 and excludes the micelle. For more information on how you can generate this type of mask, please see the mask generation tutorial.
We generated a mask for the interior of the TRPV5 with the ChimeraX map erase tool.
Other salient parameters:
Number of classes: 24
Target resolution: 6Å
Initialization mode: PCA
24 classes from focussed 3D Classification on the EMPIAR 10256 dataset.
The resulting 24 classes appear to contain the three classes reported in the figure above and several other heterogenous states (that include symmetry states) for further refinement.

Citations

Fischer, N., Neumann, P., Bock, L. et al. The pathway to GTPase activation of elongation factor SelB on the ribosome. Nature 540, 80–85 (2016).
Dang, S., van Goor, M. K., Asarnow, D., Wang, Y., Julius, D., Cheng, Y., & van der Wijst, J. Structural insight into TRPV5 channel function and modulation. Proceedings of the National Academy of Sciences 116(18), 8869-8878 (2019).
Copy link
On this page
Introduction
Usage
Salient parameters
Important considerations
Example Results
EMPIAR-10077
EMPIAR-10256
Citations