Job: Subset Particles by Statistic

At a Glance

Split particles into groups based on the value of a statistic.

Description

Often, some statistic of an alignment or classification (such per-particle scale or the distance a particle shifted during an alignment) indicates that several subpopulations exist in a particle stack. Subset Particles by Statistic simplifies the process of separating these populations, either by modeling the selected statistic using a Gaussian mixture model (GMM) with a selected number of components, or with manually entered thresholds.

Inputs

Subset Particles by Statistic can model the distribution of a statistic for a particular particle stack, or it can model the distribution of the difference in a statistic. When operating on a single statistic, just one Particles input exists. When operating on a difference, the Particles input disappears and is replaced with the Initial Particles and Final Particles inputs.

Particles

When operating on a single statistic, such as “Per-particle scale”, only this input is available.

The particles are separated using the statistic selected in the Subset by parameter. Output particles will have the same metadata as this input (e.g., the pose will remain unchanged by the filtering operation).

Initial Particles and Final Particles

When operating on a difference statistic, such as “Absolute difference in 3D shift (A)”, only this pair of inputs is available.

First, the absolute difference in the selected statistic is calculated between the Initial Particles and Final Particles. Taking Absolute difference in 3D shift (A) as an example, if one particle had a difference of (0, 1) between the two particle stacks and another had a difference of (-1, 0), both would be at the 1.0 position of the histogram.

The particles are then separated by the distribution of these differences. When particle groups are output, they inherit their metadata (e.g., poses) from the Final Particles input.

Since differences are absolute, Subset Particles by Statistic will find the same groups no matter which input a given particle stack is inserted into. You can therefore always place the particle stack with your desired metadata (poses, CTF values, etc) in the Final Particles.

Commonly Adjusted Parameters

Subset by

This parameter selects the statistic by which particles will be separated into groups.

Statistic

Description

Per-particle scale

Per-particle scale accounts for local variations in greyscale and is typically taken as a proxy for ice thickness, but would also be affected by overall particle quality and other factors. Note that if the volume is highly anisotropic due to orientation bias, the per-particle scale may be unreliable.

Particle picking NCC score

Particle picking power score

Average defocus (A)

2D alignment error

The 2D alignment error is a measure of the mismatch between the particle image and the 2D class average to which it is aligned, in the particle’s optimal pose. While a higher 2D alignment error means the particle is a poorer match for its class average than other particles, this may be because the particle is low quality or because the class average represents a slightly different viewing direction. Thus, when 3D alignments are available, 3D alignment error should be preferred.

3D alignment error

The 3D alignment error is a measure of the mismatch between the particle image and the projection of the volume in the particle’s optimal pose. A higher error may generally correlate with poorer images, but the error is also affected by the volume’s overall quality.

Class probability - 2D

Class probability - 3D

Class ESS - 2D

The Effective Sample Size is an alternate measure of the confidence in a particle’s class assignment. Higher ESS values indicate lower confidence in the particle’s class. An ESS of 1.0 indicates that a particle is only a member of a single class. ESS modes typically work best with manual thresholds.

Class ESS - 3D

Total motion (A)

Total motion measures the distance the particle moved while the movie was collected. More motion is generally considered to correlate with poorer images due to blurring.

X location (fraction)

The X location is a value ranging from 0.0 to 1.0, with 0.0 representing the left-most pixel in a micrograph and 1.0 the right-most.

Y location (fraction)

The Y location is a value ranging from 0.0 to 1.0, with 0.0 representing the bottom-most pixel in a micrograph and 1.0 the top-most.

Helical tilt angle (deg)

The helical tilt angle is only meaningful for helical proteins. It measures the angle of deviation of the particle image relative to a vertical helix.

3DVA component X

Absolute difference in 3D pose (deg)

The difference in 3D pose is the absolute difference in degrees between a particle’s pose in the initial and final datasets. Note that this single value takes into account the particle’s rotation in all three degrees of freedom.

Absolute difference in 3D shift (A)

Absolute difference in 2D pose (deg)

The difference in 2D pose is the absolute difference in 2D rotation between the initial and final datasets.

Absolute difference in 2D shift (A)

The difference in 2D shift is the absolute difference in X/Y shift, refined by 2D Classification, between the initial and final datasets.

Absolute difference in average defocus (A)

Class indices and Class probability filtering mode

These parameters only appear when Subset by is set to “Class probability - 3D”.

Class indices selects classes by index (starting with 0) over which the probability should be evaluated. Class probability filtering mode controls whether the particles should be retained if the sum or the maximum of the probabilities in the selected classes is greater than the threshold.

3DVA Component Number

3DVA components are never removed from a particle stack, only updated.

Consider a workflow in which you first run a 3DVA job with 4 components, then run a 3DVA job on the resulting particle stack with only 2 components. In this case, components 0 and 1 refer to the coordinates from the second job, but components 2 and 3 refer to the coordinates from the first job.

Curation mode

If Curation mode is set to “Cluster by gaussian fitting”, a Gaussian mixture model (GMM) will be fit to the statistic selected with Subset by. The number of Gaussian components in the GMM is set by Number of gaussians, which only appears in this curation mode. Particles are then output in a set corresponding to Gaussian component for which they have the highest probability.

If Curation mode is set to “Split by manual thresholds”, the user manually selects thresholds which divide the particle groups. The Number of thresholds parameter creates the selected number of Threshold N parameters, which are then used to divide the particles.

Minimum probability to include particle in cluster (%)

This parameter sets the minimum probability a particle must have to be included in the output. A particle has a 0% probability of belonging to a cluster when we are certain it does not belong to the cluster, and a 100% probability when we are certain it does. In practice, particles will essentially never have a 0% probability but may have 100% probability.

For example, consider a particle stack with the following distribution of per-particle scales:

This distribution of per-particle scales looks like it has three distinct groups, so we will use a GMM with three Gaussians. The resulting model looks like this:

We can use this GMM to assign each particle to either the red, pink, or green group, depending on the height of the Gaussians at that position. We can then use these probabilities to split the particles into groups. Each particle is assigned into the group with the highest probability at that particle’s per-particle scale. This would result in the following groups:

With this technique, we keep all particles for future use. However, some particles have an equal probability of belonging to two Gaussian components. In this example, particles with per-particle scales around 1.0 are equally likely to belong to the red and pink components.

In some cases, it may be preferable to remove particles with an uncertain group assignment. This is what the Minimum probability parameter does. In this example, say we set the Minimum probability to 80%.

Outputs

Particles set N

The separated input particles are output in N sets, depending on the selection of thresholds or GMM components. Sets are numbered from low to high, with set 0 having the lowest mean value for the selected statistic.

Note that if the selected statistic is a difference statistic, the metadata (including pose, CTF estimate, etc) will come from Particles Final.

Common Next Steps

The next steps will depend on the selected statistic, but this job is often used to select a subset of interest for further refinement.

Separating particles into per-particle scale and removing the low-value group can improve refinement results in some cases. See the Subsetting By Per-Particle Scale section below for an example of this use case.

Recommended Alternatives

Subsetting By Per-Particle Scale

In this example, we performed a Non-Uniform Refinement of 241,210 TRPV1 particles with C4 symmetry and Minimize over per-particle scale turned on. The resulting map had a GSFSC resolution of 2.69.

The per-particle scales show a clear bi-modal distribution. Typically, per-particle scales are expected to be normally distributed about 1.0.

One explanation for the bimodal distribution is there are two populations of particles, one in thick ice and one in thin ice. The particles in thick ice would need a higher per-particle scale to have the same greyscale as the particles in thin ice, giving rise to a bimodal distribution.

However, it is also possible that junk or low-quality particles do not match the TRPV1 volume very well even in their best orientation, and so are assigned a lower per-particle scale to reduce their overall error. If this is the case, filtering out these particles will yield a better reconstruction.

We can use Subset Particles by Statistic to separate the particles into two groups with a Gaussian Mixture Model. Subset Particles by Statistic also plots the viewing direction distribution of each set, which we can check to ensure that the different scales do not simply correlate with a specific viewing direction.

Although there is some difference between the two viewing direction plots, there’s no clear orientation bias in either one. We can now refine each of these sets separately and evaluate the particle quality based on the final map. Since cluster 1 contains more particles than cluster 0, a random subset of the particles from cluster 1 will be used to make the comparison even.

With the same number of particles in each refinement, the high-scale particles produce a significantly better map. It seems likely that the low scale factors in the original refinement were, in fact, accounting for remaining junk particles in the stack. Note also that both particle stacks’ per-particle scales have recentered around 1.0 — the actual value of the per-particle scale is a function of the overall range of the greyscale across the entire particle stack.

Class Probability

For example, consider a 3D Classification with 3 classes. A particle with the probabilities of [0.8, 0.1, and 0.1] for each of the classes would be assigned to class 0 and included in the class 0 output. However, a particle with the probabilities of [0.34, 0.33, 0.33] would also be assigned to class 0, even though the assignment of this second particle is far less confident than the first.

Subset by

Class probability - 3D

Class indices

Subsetting mode

Split by manual thresholding

Threshold 1

0.75

In this example, classes 0 and 1 are good. The threshold is set to 0.75, and Class probability filtering mode is set to “sum”.

If instead classes 0 and 1 were both good, Class indices could instead be set to 0, 1. Particles would then be retained if the sum of their probabilities of belonging to both class 0 and class 1 were greater than 0.75. In this case, particles with probabilities [0.8, 0.1, 0.1] or [0.5, 0.4, 0.1] would be kept. This makes sense if both class 0 and class 1 are high-quality classes of the same volume — particles may not be confidently assigned to one of the two, but they are definitely good if they’re split among those two classes

In this example, classes 0 and 1 are good, but represent different particles. The threshold is still set to 0.75, but now the Class probability filtering mode is set to “max”. Thus, only particles confidently assigned to one or the other good class are retained.

Now, suppose class 0 and class 1 are both good, but they represent different targets. Good particles should be confidently assigned to one of the two classes — particles which are split between two different-but-good volumes are likely low quality. In this case, Class probability filtering mode should be set to “max”. Now, the [0.8, 0.1, 0.1] particle will be kept but the [0.5, 0.4, 0.1] particle rejected.

CryoSPARC Dataset Fields

Statistic

Field name(s)

Per-particle scale

The value of alignments3D/alpha_min

Particle picking NCC score

The value of pick_stats/ncc_score

Particle picking power score

The value of pick_stats/power

Average defocus (A)

The average of ctf/df1_A and ctf/df2_A

2D alignment error

The value of alignments2D/error

3D alignment error

The value of alignments3D/error

Class probability - 2D

The value of alignments2D/class_posterior

Class probability - 3D

The sum of the elements of alignments3D_multi/class_posterior selected by the Class indices parameter.

Class ESS - 2D

The value of alignments2D/class_ess

Class ESS - 3D

The value of alignments3D_multi/class_ess

Total motion (A)

The summed length of all steps of the particle’s path recorded in the file located at motion/path

X location (fraction)

The value of location/center_x_frac

Y location (fraction)

The value of location/center_y_frac

Helical tilt angle (deg)

The rotation about the imaging axis in degrees. This can be calculated from the pose recorded in rotation vector format in alignments3D/pose.

3DVA component X

components_mode_x/value , where x is replaced by the component number.

Absolute difference in 3D pose (deg)

The absolute rotational difference between the particle’s alignments3D/pose in the two datasets.

Absolute difference in 3D shift (A)

alignments3D/psize_A times the euclidian distance between the particle’s alignments3D/shift in each of the two datasets.

Absolute difference in 2D pose (deg)

The absolute value of the difference in alignments2D/pose of the particle in each of the two datasets, converted from radians into degrees.

Absolute difference in 2D shift (A)

alignments2D/psize_A times the euclidian distance between the particle’s alignments2D/shift in each of the two datasets

Absolute difference in average defocus (A)

The absolute value of difference in the average of each particle’s ctf/df1_A and ctf/df2_A in the two datasets.

PreviousJob: Rebalance Orientations Next3D Reconstruction

Last updated 1 month ago