Job: Subset Particles by Statistic
Last updated
Last updated
Split particles into groups based on the value of a statistic.
Often, some statistic of an alignment or classification (such per-particle scale or the distance a particle shifted during an alignment) indicates that several subpopulations exist in a particle stack. Subset Particles by Statistic simplifies the process of separating these populations, either by modeling the selected statistic using a Gaussian mixture model (GMM) with a selected number of components, or with manually entered thresholds.
Subset Particles by Statistic can model the distribution of a statistic for a particular particle stack, or it can model the distribution of the difference in a statistic. When operating on a single statistic, just one Particles input exists. When operating on a difference, the Particles input disappears and is replaced with the Initial Particles and Final Particles inputs.
When operating on a single statistic, such as “Per-particle scale”, only this input is available.
The particles are separated using the statistic selected in the Subset by
parameter. Output particles will have the same metadata as this input (e.g., the pose will remain unchanged by the filtering operation).
When operating on a difference statistic, such as “Absolute difference in 3D shift (A)”, only this pair of inputs is available.
First, the absolute difference in the selected statistic is calculated between the Initial Particles and Final Particles. Taking Absolute difference in 3D shift (A)
as an example, if one particle had a difference of (0, 1)
between the two particle stacks and another had a difference of (-1, 0)
, both would be at the 1.0
position of the histogram.
The particles are then separated by the distribution of these differences. When particle groups are output, they inherit their metadata (e.g., poses) from the Final Particles input.
This parameter selects the statistic by which particles will be separated into groups.
Per-particle scale
Per-particle scale accounts for local variations in greyscale and is typically taken as a proxy for ice thickness, but would also be affected by overall particle quality and other factors. Note that if the volume is highly anisotropic due to orientation bias, the per-particle scale may be unreliable.
Particle picking NCC score
Particle picking power score
Average defocus (A)
2D alignment error
The 2D alignment error is a measure of the mismatch between the particle image and the 2D class average to which it is aligned, in the particle’s optimal pose. While a higher 2D alignment error means the particle is a poorer match for its class average than other particles, this may be because the particle is low quality or because the class average represents a slightly different viewing direction. Thus, when 3D alignments are available, 3D alignment error should be preferred.
3D alignment error
The 3D alignment error is a measure of the mismatch between the particle image and the projection of the volume in the particle’s optimal pose. A higher error may generally correlate with poorer images, but the error is also affected by the volume’s overall quality.
Class probability - 2D
Class probability - 3D
Class ESS - 2D
The Effective Sample Size is an alternate measure of the confidence in a particle’s class assignment. Higher ESS values indicate lower confidence in the particle’s class. An ESS of 1.0 indicates that a particle is only a member of a single class. ESS modes typically work best with manual thresholds.
Class ESS - 3D
The Effective Sample Size is an alternate measure of the confidence in a particle’s class assignment. Higher ESS values indicate lower confidence in the particle’s class. An ESS of 1.0 indicates that a particle is only a member of a single class. ESS modes typically work best with manual thresholds.
Total motion (A)
Total motion measures the distance the particle moved while the movie was collected. More motion is generally considered to correlate with poorer images due to blurring.
X location (fraction)
The X location is a value ranging from 0.0 to 1.0, with 0.0 representing the left-most pixel in a micrograph and 1.0 the right-most.
Y location (fraction)
The Y location is a value ranging from 0.0 to 1.0, with 0.0 representing the bottom-most pixel in a micrograph and 1.0 the top-most.
Helical tilt angle (deg)
The helical tilt angle is only meaningful for helical proteins. It measures the angle of deviation of the particle image relative to a vertical helix.
3DVA component X
Absolute difference in 3D pose (deg)
The difference in 3D pose is the absolute difference in degrees between a particle’s pose in the initial and final datasets. Note that this single value takes into account the particle’s rotation in all three degrees of freedom.
Absolute difference in 3D shift (A)
Absolute difference in 2D pose (deg)
The difference in 2D pose is the absolute difference in 2D rotation between the initial and final datasets.
Absolute difference in 2D shift (A)
The difference in 2D shift is the absolute difference in X/Y shift, refined by 2D Classification, between the initial and final datasets.
Absolute difference in average defocus (A)
These parameters only appear when Subset by
is set to “Class probability - 3D”.
Class indices
selects classes by index (starting with 0) over which the probability should be evaluated. Class probability filtering mode
controls whether the particles should be retained if the sum or the maximum of the probabilities in the selected classes is greater than the threshold.
3DVA components are never removed from a particle stack, only updated.
Consider a workflow in which you first run a 3DVA job with 4 components, then run a 3DVA job on the resulting particle stack with only 2 components. In this case, components 0 and 1 refer to the coordinates from the second job, but components 2 and 3 refer to the coordinates from the first job.
If Curation mode
is set to “Cluster by gaussian fitting”, a Gaussian mixture model (GMM) will be fit to the statistic selected with Subset by
. The number of Gaussian components in the GMM is set by Number of gaussians
, which only appears in this curation mode. Particles are then output in a set corresponding to Gaussian component for which they have the highest probability.
If Curation mode
is set to “Split by manual thresholds”, the user manually selects thresholds which divide the particle groups. The Number of thresholds
parameter creates the selected number of Threshold N
parameters, which are then used to divide the particles.
This parameter sets the minimum probability a particle must have to be included in the output. A particle has a 0% probability of belonging to a cluster when we are certain it does not belong to the cluster, and a 100% probability when we are certain it does. In practice, particles will essentially never have a 0% probability but may have 100% probability.
For example, consider a particle stack with the following distribution of per-particle scales:
This distribution of per-particle scales looks like it has three distinct groups, so we will use a GMM with three Gaussians. The resulting model looks like this:
We can use this GMM to assign each particle to either the red, pink, or green group, depending on the height of the Gaussians at that position. We can then use these probabilities to split the particles into groups. Each particle is assigned into the group with the highest probability at that particle’s per-particle scale. This would result in the following groups:
With this technique, we keep all particles for future use. However, some particles have an equal probability of belonging to two Gaussian components. In this example, particles with per-particle scales around 1.0 are equally likely to belong to the red and pink components.
In some cases, it may be preferable to remove particles with an uncertain group assignment. This is what the Minimum probability
parameter does. In this example, say we set the Minimum probability
to 80%.
The separated input particles are output in N sets, depending on the selection of thresholds or GMM components. Sets are numbered from low to high, with set 0 having the lowest mean value for the selected statistic.
Note that if the selected statistic is a difference statistic, the metadata (including pose, CTF estimate, etc) will come from Particles Final.
The next steps will depend on the selected statistic, but this job is often used to select a subset of interest for further refinement.
Separating particles into per-particle scale and removing the low-value group can improve refinement results in some cases. See the Subsetting By Per-Particle Scale section below for an example of this use case.
In this example, we performed a Non-Uniform Refinement of 241,210 TRPV1 particles with C4 symmetry and Minimize over per-particle scale
turned on. The resulting map had a GSFSC resolution of 2.69.
The per-particle scales show a clear bi-modal distribution. Typically, per-particle scales are expected to be normally distributed about 1.0.
One explanation for the bimodal distribution is there are two populations of particles, one in thick ice and one in thin ice. The particles in thick ice would need a higher per-particle scale to have the same greyscale as the particles in thin ice, giving rise to a bimodal distribution.
However, it is also possible that junk or low-quality particles do not match the TRPV1 volume very well even in their best orientation, and so are assigned a lower per-particle scale to reduce their overall error. If this is the case, filtering out these particles will yield a better reconstruction.
We can use Subset Particles by Statistic to separate the particles into two groups with a Gaussian Mixture Model. Subset Particles by Statistic also plots the viewing direction distribution of each set, which we can check to ensure that the different scales do not simply correlate with a specific viewing direction.
Although there is some difference between the two viewing direction plots, there’s no clear orientation bias in either one. We can now refine each of these sets separately and evaluate the particle quality based on the final map. Since cluster 1 contains more particles than cluster 0, a random subset of the particles from cluster 1 will be used to make the comparison even.
With the same number of particles in each refinement, the high-scale particles produce a significantly better map. It seems likely that the low scale factors in the original refinement were, in fact, accounting for remaining junk particles in the stack. Note also that both particle stacks’ per-particle scales have recentered around 1.0 — the actual value of the per-particle scale is a function of the overall range of the greyscale across the entire particle stack.
For example, consider a 3D Classification with 3 classes. A particle with the probabilities of [0.8, 0.1, and 0.1] for each of the classes would be assigned to class 0 and included in the class 0 output. However, a particle with the probabilities of [0.34, 0.33, 0.33] would also be assigned to class 0, even though the assignment of this second particle is far less confident than the first.
Class indices
0
Subsetting mode
Split by manual thresholding
Threshold 1
0.75
In this example, classes 0 and 1 are good. The threshold is set to 0.75, and Class probability filtering mode is set to “sum”.
If instead classes 0 and 1 were both good, Class indices
could instead be set to 0, 1
. Particles would then be retained if the sum of their probabilities of belonging to both class 0 and class 1 were greater than 0.75. In this case, particles with probabilities [0.8, 0.1, 0.1] or [0.5, 0.4, 0.1] would be kept. This makes sense if both class 0 and class 1 are high-quality classes of the same volume — particles may not be confidently assigned to one of the two, but they are definitely good if they’re split among those two classes
In this example, classes 0 and 1 are good, but represent different particles. The threshold is still set to 0.75, but now the Class probability filtering mode is set to “max”. Thus, only particles confidently assigned to one or the other good class are retained.
Now, suppose class 0 and class 1 are both good, but they represent different targets. Good particles should be confidently assigned to one of the two classes — particles which are split between two different-but-good volumes are likely low quality. In this case, Class probability filtering mode
should be set to “max”. Now, the [0.8, 0.1, 0.1] particle will be kept but the [0.5, 0.4, 0.1] particle rejected.
Per-particle scale
The value of alignments3D/alpha_min
Particle picking NCC score
The value of pick_stats/ncc_score
Particle picking power score
The value of pick_stats/power
Average defocus (A)
The average of ctf/df1_A
and ctf/df2_A
2D alignment error
The value of alignments2D/error
3D alignment error
The value of alignments3D/error
Class probability - 2D
The value of alignments2D/class_posterior
Class probability - 3D
The sum of the elements of alignments3D_multi/class_posterior
selected by the Class indices parameter.
Class ESS - 2D
The value of alignments2D/class_ess
Class ESS - 3D
The value of alignments3D_multi/class_ess
Total motion (A)
The summed length of all steps of the particle’s path recorded in the file located at motion/path
X location (fraction)
The value of location/center_x_frac
Y location (fraction)
The value of location/center_y_frac
Helical tilt angle (deg)
The rotation about the imaging axis in degrees. This can be calculated from the pose recorded in rotation vector format in alignments3D/pose
.
3DVA component X
components_mode_x/value
, where x
is replaced by the component number.
Absolute difference in 3D pose (deg)
The absolute rotational difference between the particle’s alignments3D/pose
in the two datasets.
Absolute difference in 3D shift (A)
alignments3D/psize_A
times the euclidian distance between the particle’s alignments3D/shift
in each of the two datasets.
Absolute difference in 2D pose (deg)
The absolute value of the difference in alignments2D/pose
of the particle in each of the two datasets, converted from radians into degrees.
Absolute difference in 2D shift (A)
alignments2D/psize_A
times the euclidian distance between the particle’s alignments2D/shift
in each of the two datasets
Absolute difference in average defocus (A)
The absolute value of difference in the average of each particle’s ctf/df1_A
and ctf/df2_A
in the two datasets.
In some cases, separating particles in this way is an efficient means of eliminating junk or otherwise low-quality particles. For an example of this use case, see the section below.
The score measures how well the particle image matches the template, blob, or ring used to select the particle. It is only set during particle picking.
The measures the overall contrast of the patch surrounding a particle pick. It is only set during particle picking.
The average defocus of a particle takes into account and any that have been performed on the particle stack.
A particle’s 2D class probability is the probability assigned to that particle’s best class during 2D Classification. Note that this does not filter particles by which class they are assigned to, only the confidence of their assignment. See the section of this page for more info. Probability modes typically work best with manual thresholds.
When a particle is classified using Heterogeneous Refinement or 3D Classification, it is assigned a probability of belonging to each class. This mode filters by the sum or maximum of all probabilities for the classes selected with the Class indices
parameter. See the section of this page for more info.
Probability modes typically work best with manual thresholds.
When particles are analyzed with , they are assigned a coordinate along each component. This statistic represents the particle’s value along the Xth component, where X is the value entered into 3DVA Component Number
.
The difference in 3D shift is the absolute difference in X/Y shift, refined by , between the initial and final datasets. Note that in 3D refinements, there are still only two degrees of freedom for the shift.
The difference in average defocus is the absolute difference in the average defocus (see above) between the initial and final datasets. Typically, the average defocus only changes during a job, or a 3D refinement in which is enabled
This parameter only appears when Subset by
is set to “3DVA component X”. This parameter selects which component from a job is used to separate particles.
This removes particles with low-confidence group assignments. In other words, we have traded reduced particle count for increased confidence. In this way, the Minimum probability
parameter is conceptually related to the operation performed in the job.
If you are using this job to split particles up into evenly-spaced regions of 3DVA coordinate space, the Intermediates mode of may prove easier to use.
This example uses particle images from .
The Class Probability modes of Subset Particles by Statistic replace the functionality of , which is a legacy job as of CryoSPARC v4.7.
When particles are classified (in, e.g., , , or ) they are assigned a probability of belonging to each class. Once the job completes, particles are assigned to the class they have the highest probability of belonging to.
When a particle stack must be very clean (for instance, before performing ) it may be beneficial to keep only the most confidently-assigned particles. Taking the example above, suppose class 0 is the good class and classes 1 and 2 are junk. One could create a dataset which contained only the first particle (and others like it) using Subset Particles by Statistic and setting the following parameters:
Subset Particles by Statistic operates by filtering particles based on the value (or differences in values) of a Dataset Field. For most users, this is an unimportant detail; for users interested in using to incorporate similar analyses in their scripts, we list the dataset fields related to each operation below.