Guideline for Supervised Particle Picking using Deep Learning Models
Supervised particle picking.
Supervised deep learning models are machine learning models that are trained using pre-existing data, in this case particles, to pick particles from micrographs. The primary benefit of these models are to learn from an (ideally) small number of high-quality particle picks from a dataset to produce a model that can be used to pick particles from the remainder of the dataset.
This document will provide guidelines to help train effective particle picking models. The quality of the model is dependent on the training data and the training parameters. This document will first explore the training data then the training parameters. Finally, it will detail what to watch for when observing the training of a model to catch issues and resolve.
The general concepts and instructions on this page below are written in the context of the cryoSPARC deep neural network particle picking workflows. Other methods (eg. Topaz) also have similar fundamental characteristics and considerations, but differently named parameters etc.
The first step to training a model is acquiring the training data. The best way to acquire training data is to work with a small subset of micrographs in the dataset of interest. An effective model for a symmetric simple protein such as the T20S Proteasome can be trained on less than 20 micrographs, but higher values in the low hundreds may be required for more difficult datasets. Once the micrograph subset has been selected, manual picking, or one of the existing pickers, can be used to acquire an initial set of picks. Manual picking will be required for the most challenging cases, but for others a recommended workflow is to use the blob picker (or the pretrained deep model) to acquire an initial set of high-quiality picks. Even if the quality of the initial picks are not perfect, this initial set can be filtered using 2D Classification followed by the Select 2D job to filter out initial picks. The 2D classification and select 2D jobs can be repeated until the picks are of sufficient quality but this should often only require one run of 2D classification and select 2D.
Training data can be flawed in four ways:
- 1.the picks are misaligned from the particles
- 2.not all particles from the micrograph are picked
- 3.aggregations are picked
- 4.the picks are not particles
The impact of the latter two can be minimized using the 2D classification followed by select 2D mentioned in the "Acquiring training data" section. The other two flaws are not as impactful as they may initially seem, due to the design of the data processing performed by the model. The effect of the misalignments are reduced because the model infers rough blobs where the particle approximately is and centroids are extracted from each blob. This way so long as the misalignments are not severe, in which case the 2D classification should have filtered the picks, and the misalignments are not identical amongst all training picks which is extremely unlikely, the model will eventually learn to effectively ignore misalignments, at least to a point that it is negligible for future processing. The issue of not all particles being picked is reduced by how the micrographs are fed into the model. The micrographs are split into patches and individually fed into the model. This is possible due to the shape of particle picks not requiring context provided by the whole micrograph. A micrograph with initial picks lacking a fifth of the available picks may result with each patch only missing a few particles each, reducing the impact of these missed particles on the training.
This section will go over each notable training parameter and how to determine which value to select.
Number of parallel threads: This parameter will distribute micrograph preprocessing over multiple threads to reduce the preprocessing time. Higher values may lead to overhead. Thus values between 4-8 are generally safe. If the preprocessing is observed to still be too slow, higher values can be run.
Degree of lowpass filtering: If micrographs are observed to be too noisy, it is likely that the model may struggle with learning particle locations. Decreasing this parameter can reduce the noise in the micrographs thereby improving training. Values that are too low can begin filtering valuable information from the micrographs. 50 is a standard value and values down to 15 are recommended for noisier micrographs.
Initial learning rate and final learning rate: These two parameters are used to determine the learning decay used in the training. The initial learning rate is expected to be higher than the final learning rate. Values of 0.01 to 0.001 have been found to work best for the initial learning rate. However, it takes a long time to reach training and validation losses of 1 and below, it is recommended to increase the initial learning rate. The final learning rate can be altered to improve the final loss and accuracy. To determine whether an acceptable final learning rate was selected, observe the losses from the final few epochs. If the training loss is significantly lower than the validation loss, this indicates that the final learning rate should be decreased. Indicators that the learning rate is too high include losses and accuracies that do not change and losses that remain large over the initial epochs.
Minibatch size: The minibatch size is used to determine the size of the batches that are fed into the model during training. A minibatch size of 1 has been found to perform for particle picking. Increasing the minibatch size results with faster training at the cost of higher GPU memory cost. It has also been found that the stochasticity introduced by the increased minibatch size can result with worse particle picks.
Number of epochs: The number of epochs is the number of passthroughs through the dataset that is performed during training. It is recommended to first run a training job with a lower number of epochs such as between 20 to 50 and then increase the number of epochs if the loss was continuing to decrease at the end of the training job. If the loss stagnated at the end of training, use a value slightly greater than the epoch at which the loss began to stagnate. For example, if the loss began to stagnate at epoch 10, the next training job can have 15 epochs. The slight increase is to account for the fact that the training will perform less epochs in lower learning rates.
Use class weights: The use class weights parameter alters the loss function to increase the impact of correctly picking particles. For datasets with many particles per micrograph such as the T20S proteasome dataset, it is unnecessary to set this parameter. However, if the particles are sparse in the dataset, setting this parameter on can greatly improve performance. As a rule of thumb, it is suggested to keep this parameter on.
Normalize micrographs: The normalize micrographs will normalize micrographs to 0-mean, unit variance prior to training. For datasets with little junk, this parameter can possibly worsen training as it makes it more difficult to differentiate particles. However, it reduces the impact of junk which can improve performance in datasets with prevalent junk.
Debugging training issues can be done by observing the loss and accuracy values during and/or after training. There are two major types of training issues: underfitting and overfitting.
Underfitting occurs when the model fails to learn at all. This can be diagnosed by loss values that remain within the 1000s and above, or nan values. Underfitting can also be diagnosed by accuracies that remain at 60% and below.
Underfitting can be resolved by increasing the initial and final learning rates and/or increasing the number of epochs. It is recommended to increase the initial and final learning rates to resolve the underfitting and then alter the number of epochs to optimize training after.
Overfitting occurs when the model begins to memorize noise in training data as if they were general patterns. Overfitting can be diagnosed by searching for a divergence in the training and validation losses and accuracies. The training loss being lower than the validation loss or the training accuracy being higher than the validation accuracy are not indicators of overfitting. These behaviours are common and still result with successfully trained models. The kind of divergence that indicates overfitting is when the training loss continues to decrease while the validation loss starts increasing and when the training accuracy continues increasing while the validation accuracy begins decreasing. A model that overfit will perform well on the training dataset but will perform poorly on any data outside of the training dataset.
Overfitting can be resolved by increasing the learning rates. Since models are unlikely to begin overfitting near the beginning of training, it is recommended to increase the final learning rate.
One issue that can occur is if the model does not output any particles. This can occur when the model falls into a local optimization of never find particles. There are two approaches to resolving this issue. The first is to use class weights during training. This can be done so by setting the "Use class weights" parameter on. This parameter will reward the model for correctly picking particles, encouraging it to find the global optimum. The second approach is to decrease the initial learning rate. A lower initial learning rate will help prevent the training from falling into a local optimization.