# Job: Topaz Train and Job: Topaz Cross Validation

Topaz job types available via wrapper in CryoSPARC.

To perform particle picking using Topaz, a model must first be trained using either the `Topaz Train`

job or the `Topaz Cross Validation`

job. Both of these jobs require the same inputs and produce the same outputs as listed below:

**Inputs**

Particle Picks

Micrographs

**Outputs**

Topaz Model

Micrographs

**Parameters**

**Parameters**

Both of the `Topaz Train`

and `Topaz Cross Validation`

jobs feature various parameters. The basic parameters are detailed below:

**Path to Topaz Executable**The absolute path to the Topaz executable that will run the denoise job.

**Downsampling Factor**The factor by which to downsample micrographs by. It is highly recommended to downsample micrographs to reduce memory load and improve model performance. For example, a recommended downsampling factor for a K2 Super Resolution (7676x7420) dataset (e.g. EMPIAR 10025), is 16.

**Learning Rate**The value that determines the extent by which model weights are updated. Higher values will result with training approaching an optimum faster but may prevent the model from reaching the optimum itself, resulting with potentially worse final accuracy.

**Minibatch Size**The number of examples that are used within each batch during training. Lower values will improve model accuracy at the cost of increased training time.

**Number of Epochs**The number of iterations through the entire dataset the training performs. Higher number of epochs will naturally lead to longer training times. The number of epochs does not have to be optimized as the train and cross validation jobs will automatically output the model from the epoch with the highest precision.

**Epoch Size**The number of updates that occur each epoch. Increasing this value will increase the amount of training performed in each epoch in exchange for slower training speed.

**Train-Test Split**The fraction of the dataset to use for testing. For example, a value of 0.2 will use 80% of the input micrographs for training and the remaining 20% for testing. It is highly recommended to use a train-test split greater than 0.

**Expected Number of Particles**The average expected number of particles within each micrograph. This value does not have to be exact but a somewhat accurate value is necessary for Topaz to perform well. This is a necessary parameter that does not include a base value, thus it must be input by the user. It should be noted that if this parameter is lower than the average number of labeled picks input into the training job, then the training job will switch to the PN loss function, which was experimentally found to be worse than the GE-binomial loss function.

**Number of Parallel Threads**Number of threads to distribute preprocessing over. This parameter decreases the preprocessing time by a factor approximatly equal to the input value. It is recommended to set this value to at least 4 as the preprocessing time is often a bottleneck in the time performance of the job. Values less than 2 will default to a single thread.

The advanced parameters are detailed below:

**Pixel sampling factor / Number of iterations / Score threshold**Parameters that affect the preprocessing of micrographs. It is recommended not to change these parameters.

**Loss function**The loss function used to train the model. It is recommended to use GE-binomial for the following reasons. The PU loss function is a non-negative risk estimator approach and PN is a naive approach where unlabeled data is considered as negative for training. Both of these loss functions were found to perform poorly compared to the GE-binomial and GE-KL loss functions in the paper "Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs" by Bepler, T. et al. [1], the developers of Topaz. The paper also found that while GE-binomial and GE-KL had similar performance in most cases, there were a few cases where GE-binomial outperformed GE-KL. Thus it is recommended to use GE-binomial.

**Slack**The weight on the loss function if GE-binomial or GE-KL is selected as the loss function. It is recommended to keep the slack at -1 as -1 will use the default parameters. If the user desires to change this parameter, it is recommended to read the paper by Bepler, T. et al. [1] prior to doing so.

**Autoencoder weight**The weight on the reconstruction error of the autoencoder. An autoencoder weight of 0 will disable the autoencoder. According to the paper by Bepler, T. et al. [1], the autoencoder improves classifier performance when using fewer labeled data points. However, the degree of improvement diminishes with more labeled data points until it begins to negatively affect classifier performance due to over-regularization. The paper recommends using an autoencoder weight of 10 / N when N ≤ 250 and using an autoencoder weight of 0 otherwise, where N is the number of labeled data points.

**Regularization**The regularization parameter on the loss function using L2 regularization. Values less than 1 can be used to improve model performance but values greater than 1 are likely to begin to impede training.

**Model architecture**ResNet stands for residual network are a neural network architecture that is popular for eliminating the vanishing gradient problem. Note that max pooling cannot be used with Conv model architectures. Note that average pooling cannot be used with the ResNet8 model architecture.

Conv stands for convolutional neural network which is a popular architecture for computer vision problems. Note that max pooling cannot be used with Conv model architectures.

According to the Topaz GitHub page, ResNet8 provides a balance of good performance and receptive field size. Conv31 and Conv63, which have smaller receptive fields can be useful when less complex models are desired. Conv127 should not be used unless quite complex models are required. The following are the receptive fields for each architecture as shown in the aforementioned GitHub page:

resnet8 [receptive field = 71]

conv127 [receptive field = 127]

conv63 [receptive field = 63]

conv31 [receptive field = 31]

**Number of units in base layer**The number of units in the base layer. The ResNet8 model architecture will double the number of units during convolutions and pooling. For the Conv model architectures, the scaling of units can be specified using the

**Unit scaling**parameter.

**Dropout rate**The probability that a unit is disabled for one batch iteration during training. Dropout is sometimes useful for preventing overfitting. Low dropout rates greater than 0 and less than 0.5 can be used when the Topaz model begins overfitting during training.

**Batch normalization**Enabling batch normalization reduces the covariance shift of hidden units during training. This enables higher learning rates as activation of hidden units are reduced, reduces overfitting, and has provides some regularization. It is recommended to use batch normalization.

**Pooling method**Pooling method is the type of layer used to reduce the spatial complexity of layers within the model. Pooling methods improve training speed in exchange for some information loss.

Max pooling uses the max of the values within the pooling kernel as the output value of the kernel. Note that max pooling cannot be used with Conv model architectures.

Average pooling uses the average of the values within the pooling kernel as the output value of the kernel. Note that average pooling cannot be used with the ResNEt8 model architecture.

There is no strong recommendation regarding pooling method.

**Unit scaling**The factor by which to scale the number of units during convolutions and pooling when using Conv model architectures.

**Encoder network unit scaling**The factor by which to scale the number of units during convolutions and pooling within the autoencoder architecture. Only applies when an autoencoder weight greater than 0 is used.

The `Topaz Cross Validation`

job includes unique parameters that enable the user to select which parameter to vary and how to vary the parameter during training. These parameters are:

Parameter to Optimize

Number of Cross Validation Folds

Initial Value to begin with during Cross Validation

Value to Increment Parameter by during Cross Validation

The first parameter allows the user to select which parameter to vary. The number of cross validation folds indicate how many training jobs to perform during cross validation. The initial value and the incremental value parameters serve to specify which values to test. For example, choosing the parameters found in the table below will result with the `Topaz Cross Validation`

job testing the following learning rates two times each: 0.0001, 0.0002, 0.0003. After finding the learning rate yielding the best results, it will use that learning rate to perform the final training.

## Example Parameters

Parameter to optimize

Learning rate

Number of cross validation folds

2

Initial value to begin with

0.0001

Value to increment by

0.0001

Number of times to increment parameter

3

There are other advanced training and model parameters that will not be discussed in this introductory user guide such as selection of pooling layer or encoding network. These parameters can potentially improve the Topaz model. It should be noted that some of these parameters are incompatible with certain model architectures. The job will output an error if the job is attempting to use incompatible parameters. The following parameter combinations are forbidden:

**Parameters incompatible with ResNet architecture:**Average Pooling

Autoencoder/Encoding network

**Parameters incompatible with Convolutional Neural Network architecture:**Max Pooling

Dropout

## Similarities and Differences between Topaz Train and Topaz Cross Validation

The `Topaz Train`

and `Topaz Cross Validation`

jobs serve in the same purpose in that they both use particle picks and micrographs to produce models which can then be used to automatically pick particles.

The `Topaz Cross Validation`

job is different in that it runs multiple instances of the `Topaz Train`

job while varying a specified parameter, enabling the job to find an optimal value for a certain parameter. The `Topaz Cross Validation`

job then uses the optimal parameter value to perform one last `Topaz Train`

job and produces a usable model. However, a key disadvantage of the `Topaz Cross Validation`

job is that it is **significantly** slower than the standard `Topaz Train`

job.

It is recommended to use the `Topaz Train`

job for training the Topaz model and to only use the `Topaz Cross Validation`

job when attempting to find the optimal value for a particular parameter.

## Interpreting Training Results from Train and Cross Validation

Once training using either of the `Topaz Train`

or `Topaz Cross Validation`

jobs is complete, it will output a plot indicating the performance on the training set over each epoch. If a train-test split greater than 0 is used, a plot for the performance on the test set will also be output. The testing plot is a better indicator of the overall training results than the training plot and should be used to interpret the results whenever available. The x-axis indicates the epoch and the y-axis indicates the precision. The precision measures how accurate the model is. Successfully trained models will have a test plot gradually featuring precision that increases as the epoch increases.

If the precision begins to decrease after increasing for several epochs, then the model had begun to overfit to the training set. However, the job will automatically select the model from the epoch with the highest precision, therefore, assuming that the precision was improving prior to overfitting, the job will output a version of the model from before it began overfitting.

Below is an example of a test plot from a well-performing Topaz model.Once training using either of the `Topaz Train`

or `Topaz Cross Validation`

jobs is complete, it will output a plot indicating the performance on the training set over each epoch. If a train-test split greater than 0 is used, a plot for the performance on the test set will also be output. The testing plot is a better indicator of the overall training results than the training plot and should be used to interpret the results whenever available. The x-axis indicates the epoch and the y-axis indicates the precision. The precision measures how accurate the model is. Successfully trained models will have a test plot gradually featuring precision that increases as the epoch increases.

If the precision begins to decrease after increasing for several epochs, then the model had begun to overfit to the training set. However, the job will automatically select the model from the epoch with the highest precision, therefore, assuming that the precision was improving prior to overfitting, the job will output a version of the model from before it began overfitting.

Below is an example of a test plot from a well-performing Topaz model.

The `Topaz Cross Validation`

job also features a plot that presents the results of the cross validation and the performance at each value. An example of a cross validation plot using the example parameters shown previously can be found below:

Last updated