Job: Subset Particles by Statistic

At a Glance

Split particles into groups based on the value of a statistic.

Description

Often, some statistic of an alignment or classification (such per-particle scale or the distance a particle shifted during an alignment) indicates that several subpopulations exist in a particle stack. Subset Particles by Statistic simplifies the process of separating these populations, either by modeling the selected statistic using a Gaussian mixture model (GMM) with a selected number of components, or with manually entered thresholds.

In some cases, separating particles in this way is an efficient means of eliminating junk or otherwise low-quality particles. For an example of this use case, see the Subsetting By Per-Particle Scale section below.

Inputs

Subset Particles by Statistic can model the distribution of a statistic for a particular particle stack, or it can model the distribution of the difference in a statistic. When operating on a single statistic, just one Particles input exists. When operating on a difference, the Particles input disappears and is replaced with the Initial Particles and Final Particles inputs.

Particles

When operating on a single statistic, such as “Per-particle scale”, only this input is available.

The particles are separated using the statistic selected in the Subset by parameter. Output particles will have the same metadata as this input (e.g., the pose will remain unchanged by the filtering operation).

Initial Particles and Final Particles

When operating on a difference statistic, such as “Absolute difference in 3D shift (A)”, only this pair of inputs is available.

First, the absolute difference in the selected statistic is calculated between the Initial Particles and Final Particles. Taking Absolute difference in 3D shift (A) as an example, if one particle had a difference of (0, 1) between the two particle stacks and another had a difference of (-1, 0), both would be at the 1.0 position of the histogram.

The particles are then separated by the distribution of these differences. When particle groups are output, they inherit their metadata (e.g., poses) from the Final Particles input.

Since differences are absolute, Subset Particles by Statistic will find the same groups no matter which input a given particle stack is inserted into. You can therefore always place the particle stack with your desired metadata (poses, CTF values, etc) in the Final Particles.

Commonly Adjusted Parameters

Subset by

This parameter selects the statistic by which particles will be separated into groups.

Statistic

Description

Per-particle scale

Per-particle scale accounts for local variations in greyscale and is typically taken as a proxy for ice thickness, but would also be affected by overall particle quality and other factors. Note that if the volume is highly anisotropic due to orientation bias, the per-particle scale may be unreliable.

Particle picking NCC score

The Normalized Cross Correlation (NCC) score measures how well the particle image matches the template, blob, or ring used to select the particle. It is only set during particle picking.

Particle picking power score

The power score measures the overall contrast of the patch surrounding a particle pick. It is only set during particle picking.

Average defocus (A)

The average defocus of a particle takes into account astigmatism and any CTF refinements that have been performed on the particle stack.

2D alignment error

The 2D alignment error is a measure of the mismatch between the particle image and the 2D class average to which it is aligned, in the particle’s optimal pose. While a higher 2D alignment error means the particle is a poorer match for its class average than other particles, this may be because the particle is low quality or because the class average represents a slightly different viewing direction. Thus, when 3D alignments are available, 3D alignment error should be preferred.

3D alignment error

The 3D alignment error is a measure of the mismatch between the particle image and the projection of the volume in the particle’s optimal pose. A higher error may generally correlate with poorer images, but the error is also affected by the volume’s overall quality.

Class probability - 2D

A particle’s 2D class probability is the probability assigned to that particle’s best class during 2D Classification. Note that this does not filter particles by which class they are assigned to, only the confidence of their assignment. See the Class Probability section of this page for more info. Probability modes typically work best with manual thresholds.

Class probability - 3D

When a particle is classified using Heterogeneous Refinement or 3D Classification, it is assigned a probability of belonging to each class. This mode filters by the sum or maximum of all probabilities for the classes selected with the Class indices parameter. See the Class Probability section of this page for more info. Probability modes typically work best with manual thresholds.

Class ESS - 2D

The Effective Sample Size is an alternate measure of the confidence in a particle’s class assignment. Higher ESS values indicate lower confidence in the particle’s class. An ESS of 1.0 indicates that a particle is only a member of a single class. ESS modes typically work best with manual thresholds.

Class ESS - 3D

Total motion (A)

Total motion measures the distance the particle moved while the movie was collected. More motion is generally considered to correlate with poorer images due to blurring.

X location (fraction)

The X location is a value ranging from 0.0 to 1.0, with 0.0 representing the left-most pixel in a micrograph and 1.0 the right-most.

Y location (fraction)

The Y location is a value ranging from 0.0 to 1.0, with 0.0 representing the bottom-most pixel in a micrograph and 1.0 the top-most.

Helical tilt angle (deg)

The helical tilt angle is only meaningful for helical proteins. It measures the angle of deviation of the particle image relative to a vertical helix.

3DVA component X

When particles are analyzed with 3D Variability Analysis, they are assigned a coordinate along each component. This statistic represents the particle’s value along the Xth component, where X is the value entered into 3DVA Component Number.

Absolute difference in 3D pose (deg)

The difference in 3D pose is the absolute difference in degrees between a particle’s pose in the initial and final datasets. Note that this single value takes into account the particle’s rotation in all three degrees of freedom.

Absolute difference in 3D shift (A)

The difference in 3D shift is the absolute difference in X/Y shift, refined by a 3D method, between the initial and final datasets. Note that in 3D refinements, there are still only two degrees of freedom for the shift.

Absolute difference in 2D pose (deg)

The difference in 2D pose is the absolute difference in 2D rotation between the initial and final datasets.

Absolute difference in 2D shift (A)

The difference in 2D shift is the absolute difference in X/Y shift, refined by 2D Classification, between the initial and final datasets.

Absolute difference in average defocus (A)

The difference in average defocus is the absolute difference in the average defocus (see above) between the initial and final datasets. Typically, the average defocus only changes during a Local CTF Refinement job, or a 3D refinement in which on-the-fly CTF refinement is enabled

Class indices and Class probability filtering mode

These parameters only appear when Subset by is set to “Class probability - 3D”.

Class indices selects classes by index (starting with 0) over which the probability should be evaluated. Class probability filtering mode controls whether the particles should be retained if the sum or the maximum of the probabilities in the selected classes is greater than the threshold.

3DVA Component Number

This parameter only appears when Subset by is set to “3DVA component X”. This parameter selects which component from a 3D Variability Analysis job is used to separate particles.

3DVA components are never removed from a particle stack, only updated.

Consider a workflow in which you first run a 3DVA job with 4 components, then run a 3DVA job on the resulting particle stack with only 2 components. In this case, components 0 and 1 refer to the coordinates from the second job, but components 2 and 3 refer to the coordinates from the first job.

Curation mode

If Curation mode is set to “Cluster by gaussian fitting”, a Gaussian mixture model (GMM) will be fit to the statistic selected with Subset by. The number of Gaussian components in the GMM is set by Number of gaussians, which only appears in this curation mode. Particles are then output in a set corresponding to Gaussian component for which they have the highest probability.

If Curation mode is set to “Split by manual thresholds”, the user manually selects thresholds which divide the particle groups. The Number of thresholds parameter creates the selected number of Threshold N parameters, which are then used to divide the particles.

Minimum probability to include particle in cluster (%)

This parameter sets the minimum probability a particle must have to be included in the output. A particle has a 0% probability of belonging to a cluster when we are certain it does not belong to the cluster, and a 100% probability when we are certain it does. In practice, particles will essentially never have a 0% probability but may have 100% probability.

For example, consider a particle stack with the following distribution of per-particle scales:

This distribution of per-particle scales looks like it has three distinct groups, so we will use a GMM with three Gaussians. The resulting model looks like this:

We can use this GMM to assign each particle to either the red, pink, or green group, depending on the height of the Gaussians at that position. We can then use these probabilities to split the particles into groups. Each particle is assigned into the group with the highest probability at that particle’s per-particle scale. This would result in the following groups:

With this technique, we keep all particles for future use. However, some particles have an equal probability of belonging to two Gaussian components. In this example, particles with per-particle scales around 1.0 are equally likely to belong to the red and pink components.

In some cases, it may be preferable to remove particles with an uncertain group assignment. This is what the Minimum probability parameter does. In this example, say we set the Minimum probability to 80%.

This removes particles with low-confidence group assignments. In other words, we have traded reduced particle count for increased confidence. In this way, the Minimum probability parameter is conceptually related to the operation performed in the Class Probability Filter job.

Outputs

Particles set N

The separated input particles are output in N sets, depending on the selection of thresholds or GMM components. Sets are numbered from low to high, with set 0 having the lowest mean value for the selected statistic.

Note that if the selected statistic is a difference statistic, the metadata (including pose, CTF estimate, etc) will come from Particles Final.

Common Next Steps

The next steps will depend on the selected statistic, but this job is often used to select a subset of interest for further refinement.

Separating particles into per-particle scale and removing the low-value group can improve refinement results in some cases. See the Subsetting By Per-Particle Scale section below for an example of this use case.

Recommended Alternatives

If you are using this job to split particles up into evenly-spaced regions of 3DVA coordinate space, the Intermediates mode of 3D Variability Display may prove easier to use.

Subsetting By Per-Particle Scale

This example uses particle images from EMPIAR 10059.

In this example, we performed a Non-Uniform Refinement of 241,210 TRPV1 particles with C4 symmetry and Minimize over per-particle scale turned on. The resulting map had a GSFSC resolution of 2.69.

The per-particle scales show a clear bi-modal distribution. Typically, per-particle scales are expected to be normally distributed about 1.0.

One explanation for the bimodal distribution is there are two populations of particles, one in thick ice and one in thin ice. The particles in thick ice would need a higher per-particle scale to have the same greyscale as the particles in thin ice, giving rise to a bimodal distribution.

However, it is also possible that junk or low-quality particles do not match the TRPV1 volume very well even in their best orientation, and so are assigned a lower per-particle scale to reduce their overall error. If this is the case, filtering out these particles will yield a better reconstruction.

We can use Subset Particles by Statistic to separate the particles into two groups with a Gaussian Mixture Model. Subset Particles by Statistic also plots the viewing direction distribution of each set, which we can check to ensure that the different scales do not simply correlate with a specific viewing direction.

Although there is some difference between the two viewing direction plots, there’s no clear orientation bias in either one. We can now refine each of these sets separately and evaluate the particle quality based on the final map. Since cluster 1 contains more particles than cluster 0, a random subset of the particles from cluster 1 will be used to make the comparison even.

With the same number of particles in each refinement, the high-scale particles produce a significantly better map. It seems likely that the low scale factors in the original refinement were, in fact, accounting for remaining junk particles in the stack. Note also that both particle stacks’ per-particle scales have recentered around 1.0 — the actual value of the per-particle scale is a function of the overall range of the greyscale across the entire particle stack.

Class Probability

The Class Probability modes of Subset Particles by Statistic replace the functionality of Class Probability Filter, which is a legacy job as of CryoSPARC v4.7.

When particles are classified (in, e.g., 2D Classification, 3D Classification, or Heterogeneous Refinement) they are assigned a probability of belonging to each class. Once the job completes, particles are assigned to the class they have the highest probability of belonging to.

For example, consider a 3D Classification with 3 classes. A particle with the probabilities of [0.8, 0.1, and 0.1] for each of the classes would be assigned to class 0 and included in the class 0 output. However, a particle with the probabilities of [0.34, 0.33, 0.33] would also be assigned to class 0, even though the assignment of this second particle is far less confident than the first.

When a particle stack must be very clean (for instance, before performing 3D Flexible Refinement) it may be beneficial to keep only the most confidently-assigned particles. Taking the example above, suppose class 0 is the good class and classes 1 and 2 are junk. One could create a dataset which contained only the first particle (and others like it) using Subset Particles by Statistic and setting the following parameters:

Subset by

Class probability - 3D

Class indices

Subsetting mode

Split by manual thresholding

Threshold 1

0.75

In this example, classes 0 and 1 are good. The threshold is set to 0.75, and Class probability filtering mode is set to “sum”.

If instead classes 0 and 1 were both good, Class indices could instead be set to 0, 1. Particles would then be retained if the sum of their probabilities of belonging to both class 0 and class 1 were greater than 0.75. In this case, particles with probabilities [0.8, 0.1, 0.1] or [0.5, 0.4, 0.1] would be kept. This makes sense if both class 0 and class 1 are high-quality classes of the same volume — particles may not be confidently assigned to one of the two, but they are definitely good if they’re split among those two classes

In this example, classes 0 and 1 are good, but represent different particles. The threshold is still set to 0.75, but now the Class probability filtering mode is set to “max”. Thus, only particles confidently assigned to one or the other good class are retained.

Now, suppose class 0 and class 1 are both good, but they represent different targets. Good particles should be confidently assigned to one of the two classes — particles which are split between two different-but-good volumes are likely low quality. In this case, Class probability filtering mode should be set to “max”. Now, the [0.8, 0.1, 0.1] particle will be kept but the [0.5, 0.4, 0.1] particle rejected.

CryoSPARC Dataset Fields

Subset Particles by Statistic operates by filtering particles based on the value (or differences in values) of a Dataset Field. For most users, this is an unimportant detail; for users interested in using CryoSPARC Tools to incorporate similar analyses in their scripts, we list the dataset fields related to each operation below.

Statistic

Field name(s)

Per-particle scale

The value of alignments3D/alpha_min

Particle picking NCC score

The value of pick_stats/ncc_score

Particle picking power score

The value of pick_stats/power

Average defocus (A)

The average of ctf/df1_A and ctf/df2_A

2D alignment error

The value of alignments2D/error

3D alignment error

The value of alignments3D/error

Class probability - 2D

The value of alignments2D/class_posterior

Class probability - 3D

The sum of the elements of alignments3D_multi/class_posterior selected by the Class indices parameter.

Class ESS - 2D

The value of alignments2D/class_ess

Class ESS - 3D

The value of alignments3D_multi/class_ess

Total motion (A)

The summed length of all steps of the particle’s path recorded in the file located at motion/path

X location (fraction)

The value of location/center_x_frac

Y location (fraction)

The value of location/center_y_frac

Helical tilt angle (deg)

The rotation about the imaging axis in degrees. This can be calculated from the pose recorded in rotation vector format in alignments3D/pose.

3DVA component X

components_mode_x/value , where x is replaced by the component number.

Absolute difference in 3D pose (deg)

The absolute rotational difference between the particle’s alignments3D/pose in the two datasets.

Absolute difference in 3D shift (A)

alignments3D/psize_A times the euclidian distance between the particle’s alignments3D/shift in each of the two datasets.

Absolute difference in 2D pose (deg)

The absolute value of the difference in alignments2D/pose of the particle in each of the two datasets, converted from radians into degrees.

Absolute difference in 2D shift (A)

alignments2D/psize_A times the euclidian distance between the particle’s alignments2D/shift in each of the two datasets

Absolute difference in average defocus (A)

The absolute value of difference in the average of each particle’s ctf/df1_A and ctf/df2_A in the two datasets.

PreviousJob: Rebalance Orientations Next3D Reconstruction

Last updated 7 months ago