Job: Ab-Initio Reconstruction

At a Glance

Generate a number of initial 3D models directly from particle images without requiring an input model. Ab initio reconstruction is unique among 3D jobs in that it does not require input particle poses or starting volumes — the job generates a 3D volume from the beginning (hence ab initio).

Inputs

Particles

Note that when ab initio reconstruction is used to generate a single volume from a large number of particles, some particles may be omitted from the reconstruction as the algorithm does not need to see all particles to arrive at a converged reconstruction. The particles used to generate the volume are output in a separate group than those that were used. The particles that are used are selected randomly from the entire dataset.

Outputs

This job creates the requested number of volumes and a particle stack for each. If only one volume is requested, an additional particle output contining unused particles is also created.

Commonly Adjusted Parameters

Number of classes

Increasing this number generates more volumes. In an ideal (clean) particle stack, this parameter should match the number of distinct structures present. In practice, providing a small number of classes into which junk particles are classified can improve results.

Number of particles to use

Force the job to use this many particles (from the beginning of the stack — not random) to generate volumes. Results will only contain this many particles as well.

We generally recommend that this is left blank (i.e., use all particles) unless there are a great number of input particles, in which case setting this to a lower number will speed results.

Maximum resolution

Use information up to and including this resolution during reconstruction.

In highly-symmetric particles or small membrane proteins it can help to use higher resolutions (i.e., lower numbers), but keep in mind that ab initio reconstruction does not use gold-standard half sets and so can be sensitive to overfitting.

Initial/final minibatch size

Sets the number of particles that are included in each online-EM iteration.

Increasing this parameter increases memory demand and slows down each iteration, but may improve results with highly symmetric targets, small targets, or noisy images.

Symmetry

Symmetry to impose on the volumes.

We strongly recommend that this option is kept at C1 unless results show flattened, disc-like volumes and a different source of information supports a higher symmetry. In particular, for proteins with suspected octahedral, tetrahedral, or icosahedral symmetry, it may be necessary to impose symmetry to avoid converging to a flattened model.

Class similarity

A higher value for this parameter forces particle assignment to blur between classes during the early iterations. If particles are expected to look very similar, this blurring encourages the algorithm to initially create several similar classes which then can diverge during iterations. Without blurring, all particles which appeared similar would be placed in a single class, preventing the desired classification.

The two Class similarity anneal parameters control when the algorithm begins moving from this input similarity value to the value detected from the data.

Common Uses

The most common use cases for ab initio reconstruction are to perform initial cleaning of a particle stack in 3D and to generate one or more starting volumes for future refinements. Both of these goals can be achieved with a single ab initio reconstruction job by optimizing the number of requested classes. This typically requires launching several jobs, each with a different number of classes, and inspecting the results.

It has been reported that performing extensive particle cleaning in 2D (i.e., by repeatedly selecting 2D classes which look clean) can sometimes result in a loss of rare or difficult-to-distinguish particle views. Cleaning in 3D seems to be less susceptible to this problem.

Common Problems

Ab initio reconstruction may fail when the particle stack has extreme orientation bias. Rebalance 2D Classes can help reduce bias in these cases to generate the initial model.

Highly symmetric particles occasionally produce flattened, disc-like volumes in ab initio reconstruction. These are the only cases in which we recommend imposing symmetry on the ab initio volumes.

Ab initio reconstruction does not use half-sets or any other type of regularization. It can thus be susceptible to overfitting at high resolutions or with particles with low signal-to-noise ratios, and so we do not recommend that it is used to generate higher-resolution models.

When a particle stack is believed to be relatively free of junk, Heterogeneous Refinement can be a better tool for detecting distinct conformations or compositions of the target.

Next Steps

Heterogeneous Refinement is useful for further cleaning, or to detect conformational or compositional differences in the target.

Homogeneous Refinement or Non-Uniform Refinement jobs are useful for creating a single, “consensus” refinement of the particles at a high resolution.

Implementation Details

For a full description of CryoSPARC’s ab initio reconstruction algorithm, see the accompanying paper.

The problem of volume generation in single particle analysis can be stated in broad terms as follows: given a set of input images, what volume most likely produced those images? This question can be further subdivided into two parts:

  • Given a volume and a set of images, what is the most likely pose of the volume in each image?

  • Given a set of images and a set of orientations, how can we improve the volume to make the images more likely?

In cryo-EM literature, methods that seek the most likely density that generate the particle images are referred to as Maximum Likelihood (ML) or Maximum A Posteriori (MAP) methods. All 2D and 3D reconstruction algorithms in CryoSPARC fall under the umbrella of ML methods, because they seek to maximize likelihood in this way.

What is a pose?

In cryoEM image processing, a pose is a set of five numbers: three describe the rotation of the volume in 3D space, and two describe the X- and Y-shifts of the volume (to account for imperfect centering of the extracted particle image). Searching for the correct pose is thus a 5D problem that needs to be solved for each particle image, and therefore can become very computationally expensive.

CryoSPARC’s ab initio algorithm approaches the first part using an algorithm called branch and bound (BnB), and the second part using an algorithm called stochastic gradient descent (SGD).

Branch and Bound

To find the most likely pose, one could create a 5D grid of finely-spaced points and check all of those poses for each particle image. This can be prohibitively slow. Branch and bound creates the same grid of finely-spaced points, where each point is associated with a point from a coarser grid via a subdivision scheme. This builds a tree of branching points at which to calculate how well the volume at the given pose fits the image.

The search for the correct pose starts at the coarsest grid. It then proceeds to the next coarsest grid, but only at points which branch from points that were above a certain probability threshold in the previous iteration. The selection of this threshold is the bound portion of branch and bound and is described in detail in Punjani et al. 2017.

Every pose of the volume has some error associated with it. The higher this error, the less likely the pose is correct. To find the optimal pose, one could calculate this error for every pose and pick the pose with the lowest error. However, the true error is extremely expensive to compute.

We therefore instead calculate a lower bound on the error. This function is orders of magnitude faster to compute and gives us the best-case scenario — we know that the true error cannot possibly be smaller than this lower bound.

We calculate this lower bound for each pose. We then calculate the expensive true error only once, for the pose with the lowest lower bound (put another way, the pose which has the possibility for the lowest error). This pose’s true error is almost certainly higher than this lower bound, but it also has a chance at being the best pose as measured by the true error.

We can then compare each pose’s best case to this true error. Any pose with a lower bound that has higher error than this calculated error cannot possibly be the best pose — we have already found a better pose than its best case. We can thus discard this pose and all of its following branched poses from future iterations.

Thus, the search for the optimal pose for a given image is accelerated by the two prongs of the branch and bound algorithm:

  • the branching structure enables us to rule out a large number of poses without ever calculating any error for them

  • the cheap lower bound lets us rule out a large number of poses without having to calculate their expensive true error

The branch-and-bound algorithm is so fast that it is used for all pose searches in CryoSPARC, not just ab initio reconstruction.

What exactly is this inexpensive lower bound?

Although it is easy to understand what branch and bound is doing, it is not obvious how exactly one might calculate a lower bound on possible error. Full details are available in the paper’s supplementary information, but briefly, error is first split into two parts: error due to signal above a critical frequency L and error due to signal above that frequency. The low-resolution error is relatively easy to calculate and is computed directly. The high-frequency error is bounded by further splitting it into a combination of error due to noise and the worst possible mismatch the reference volume could create.

Stochastic Gradient Descent

Once the optimal poses for all particles have been found, the information in the particle images can be used to improve the volume. In a homogeneous refinement, the existing volume would be updated to the most-likely result given the new particle poses. This algorithm is called Maximum Likelihood (often referred to as ML in the literature) or Maximum a posteriori (often MAP in the literature), and the function used to update the volume is called the gradient.

Maximum likelihood methods work well when the starting volume is close to correct. However, ab initio volumes are likely quite poor, especially in early iterations when the volume is essentially just noise. Poor starting volumes are unlikely to find the correct solution using maximum likelihood approaches because these techniques can only search locally — they will get stuck in suboptimal structures from where there is no clear direction that will improve the quality of the map. To overcome this issue, ab initio reconstruction uses stochastic gradient descent.

In stochastic gradient descent, the gradient is calculated using only a very small subset of particle images. The current volume is then updated according to this gradient. Importantly, the noise in the minibatch sampling and the approximate nature of the gradient creates noise in both the direction and distance of the step the algorithm takes. This noise helps the volume escape local optima and settle into a global optimum near to the correct structure. Thus, stochastic gradient descent can provide good initial volumes for maximum likelihood methods to optimize.

References

  1. Ali Punjani et al., “cryoSPARC: Algorithms for Rapid Unsupervised Cryo-EM Structure Determination,” Nature Methods 14 (06/online 2017): 290, https://doi.org/10.1038/nmeth.4169

Last updated