CryoSPARC Guide
  • About CryoSPARC
  • Current Version
  • Licensing
    • Non-commercial license agreement
  • Setup, Configuration and Management
    • CryoSPARC Architecture and System Requirements
    • CryoSPARC Installation Prerequisites
    • How to Download, Install and Configure
      • Obtaining A License ID
      • Downloading and Installing CryoSPARC
      • CryoSPARC Cluster Integration Script Examples
      • Accessing the CryoSPARC User Interface
    • Deploying CryoSPARC on AWS
      • Performance Benchmarks
    • Using CryoSPARC with Cluster Management Software
    • Software Updates and Patches
    • Management and Monitoring
      • Environment variables
      • (Optional) Hosting CryoSPARC Through a Reverse Proxy
      • cryosparcm reference
      • cryosparcm cli reference
      • cryosparcw reference
    • Software System Guides
      • Guide: Updating to CryoSPARC v4
      • Guide: Installation Testing with cryosparcm test
      • Guide: Verify CryoSPARC Installation with the Extensive Validation Job (v4.3+)
      • Guide: Verify CryoSPARC Installation with the Extensive Workflow (≤v4.2)
      • Guide: Performance Benchmarking (v4.3+)
      • Guide: Download Error Reports
      • Guide: Maintenance Mode and Configurable User Facing Messages
      • Guide: User Management
      • Guide: Multi-user Unix Permissions and Data Access Control
      • Guide: Lane Assignments and Restrictions
      • Guide: Queuing Directly to a GPU
      • Guide: Priority Job Queuing
      • Guide: Configuring Custom Variables for Cluster Job Submission Scripts
      • Guide: SSD Particle Caching in CryoSPARC
      • Guide: Data Management in CryoSPARC (v4.0+)
      • Guide: Data Cleanup (v4.3+)
      • Guide: Reduce Database Size (v4.3+)
      • Guide: Data Management in CryoSPARC (≤v3.3)
      • Guide: CryoSPARC Live Session Data Management
      • Guide: Manipulating .cs Files Created By CryoSPARC
      • Guide: Migrating your CryoSPARC Instance
      • Guide: EMDB-friendly XML file for FSC plots
    • Troubleshooting
  • Application Guide (v4.0+)
    • A Tour of the CryoSPARC Interface
    • Browsing the CryoSPARC Instance
    • Projects, Workspaces and Live Sessions
    • Jobs
    • Job Views: Cards, Tree, and Table
    • Creating and Running Jobs
    • Low Level Results Interface
    • Filters and Sorting
    • View Options
    • Tags
    • Flat vs Hierarchical Navigation
    • File Browser
    • Blueprints
    • Workflows
    • Inspecting Data
    • Managing Jobs
    • Interactive Jobs
    • Upload Local Files
    • Managing Data
    • Downloading and Exporting Data
    • Instance Management
    • Admin Panel
  • Cryo-EM Foundations
    • Image Formation
      • Contrast in Cryo-EM
      • Waves as Vectors
      • Aliasing
  • Expectation Maximization in Cryo-EM
  • Processing Data in cryoSPARC
    • Get Started with CryoSPARC: Introductory Tutorial (v4.0+)
    • Tutorial Videos
    • All Job Types in CryoSPARC
      • Import
        • Job: Import Movies
        • Job: Import Micrographs
        • Job: Import Particle Stack
        • Job: Import 3D Volumes
        • Job: Import Templates
        • Job: Import Result Group
        • Job: Import Beam Shift
      • Motion Correction
        • Job: Patch Motion Correction
        • Job: Full-Frame Motion Correction
        • Job: Local Motion Correction
        • Job: MotionCor2 (Wrapper) (BETA)
        • Job: Reference Based Motion Correction (BETA)
      • CTF Estimation
        • Job: Patch CTF Estimation
        • Job: Patch CTF Extraction
        • Job: CTFFIND4 (Wrapper)
        • Job: Gctf (Wrapper) (Legacy)
      • Exposure Curation
        • Job: Micrograph Denoiser (BETA)
        • Job: Micrograph Junk Detector (BETA)
        • Interactive Job: Manually Curate Exposures
      • Particle Picking
        • Interactive Job: Manual Picker
        • Job: Blob Picker
        • Job: Template Picker
        • Job: Filament Tracer
        • Job: Blob Picker Tuner
        • Interactive Job: Inspect Particle Picks
        • Job: Create Templates
      • Extraction
        • Job: Extract from Micrographs
        • Job: Downsample Particles
        • Job: Restack Particles
      • Deep Picking
        • Guideline for Supervised Particle Picking using Deep Learning Models
        • Deep Network Particle Picker
          • T20S Proteasome: Deep Particle Picking Tutorial
          • Job: Deep Picker Train and Job: Deep Picker Inference
        • Topaz (Bepler, et al)
          • T20S Proteasome: Topaz Particle Picking Tutorial
          • T20S Proteasome: Topaz Micrograph Denoising Tutorial
          • Job: Topaz Train and Job: Topaz Cross Validation
          • Job: Topaz Extract
          • Job: Topaz Denoise
      • Particle Curation
        • Job: 2D Classification
        • Interactive Job: Select 2D Classes
        • Job: Reference Based Auto Select 2D (BETA)
        • Job: Reconstruct 2D Classes
        • Job: Rebalance 2D Classes
        • Job: Class Probability Filter (Legacy)
        • Job: Rebalance Orientations
        • Job: Subset Particles by Statistic
      • 3D Reconstruction
        • Job: Ab-Initio Reconstruction
      • 3D Refinement
        • Job: Homogeneous Refinement
        • Job: Heterogeneous Refinement
        • Job: Non-Uniform Refinement
        • Job: Homogeneous Reconstruction Only
        • Job: Heterogeneous Reconstruction Only
        • Job: Homogeneous Refinement (Legacy)
        • Job: Non-uniform Refinement (Legacy)
      • CTF Refinement
        • Job: Global CTF Refinement
        • Job: Local CTF Refinement
        • Job: Exposure Group Utilities
      • Conformational Variability
        • Job: 3D Variability
        • Job: 3D Variability Display
        • Job: 3D Classification
        • Job: Regroup 3D Classes
        • Job: Reference Based Auto Select 3D (BETA)
        • Job: 3D Flexible Refinement (3DFlex) (BETA)
      • Postprocessing
        • Job: Sharpening Tools
        • Job: DeepEMhancer (Wrapper)
        • Job: Validation (FSC)
        • Job: Local Resolution Estimation
        • Job: Local Filtering
        • Job: ResLog Analysis
        • Job: ThreeDFSC (Wrapper) (Legacy)
      • Local Refinement
        • Job: Local Refinement
        • Job: Particle Subtraction
        • Job: Local Refinement (Legacy)
      • Helical Reconstruction
        • Helical symmetry in CryoSPARC
        • Job: Helical Refinement
        • Job: Symmetry search utility
        • Job: Average Power Spectra
      • Utilities
        • Job: Exposure Sets Tool
        • Job: Exposure Tools
        • Job: Generate Micrograph Thumbnails
        • Job: Cache Particles on SSD
        • Job: Check for Corrupt Particles
        • Job: Particle Sets Tool
        • Job: Reassign Particles to Micrographs
        • Job: Remove Duplicate Particles
        • Job: Symmetry Expansion
        • Job: Volume Tools
        • Job: Volume Alignment Tools
        • Job: Align 3D maps
        • Job: Split Volumes Group
        • Job: Orientation Diagnostics
      • Simulations
        • Job: Simulate Data (GPU)
        • Job: Simulate Data (Legacy)
    • CryoSPARC Tools
    • Data Processing Tutorials
      • Case study: End-to-end processing of a ligand-bound GPCR (EMPIAR-10853)
      • Case Study: DkTx-bound TRPV1 (EMPIAR-10059)
      • Case Study: Pseudosymmetry in TRPV5 and Calmodulin (EMPIAR-10256)
      • Case Study: End-to-end processing of an inactive GPCR (EMPIAR-10668)
      • Case Study: End-to-end processing of encapsulated ferritin (EMPIAR-10716)
      • Case Study: Exploratory data processing by Oliver Clarke
      • Tutorial: Tips for Membrane Protein Structures
      • Tutorial: Common CryoSPARC Plots
      • Tutorial: Negative Stain Data
      • Tutorial: Phase Plate Data
      • Tutorial: EER File Support
      • Tutorial: EPU AFIS Beam Shift Import
      • Tutorial: Patch Motion and Patch CTF
      • Tutorial: Float16 Support
      • Tutorial: Particle Picking Calibration
      • Tutorial: Blob Picker Tuner
      • Tutorial: Helical Processing using EMPIAR-10031 (MAVS)
      • Tutorial: Maximum Box Sizes for Refinement
      • Tutorial: CTF Refinement
      • Tutorial: Ewald Sphere Correction
      • Tutorial: Symmetry Relaxation
      • Tutorial: Orientation Diagnostics
      • Tutorial: BILD files in CryoSPARC v4.4+
      • Tutorial: Mask Creation
      • Case Study: Yeast U4/U6.U5 tri-snRNP
      • Tutorial: 3D Classification
      • Tutorial: 3D Variability Analysis (Part One)
      • Tutorial: 3D Variability Analysis (Part Two)
      • Tutorial: 3D Flexible Refinement
        • Installing 3DFlex Dependencies (v4.1–v4.3)
      • Tutorial: 3D Flex Mesh Preparation
    • Webinar Recordings
  • Real-time processing in cryoSPARC Live
    • About CryoSPARC Live
    • Prerequisites and Compute Resources Setup
    • How to Access cryoSPARC Live
    • UI Overview
    • New Live Session: Start to Finish Guide
    • CryoSPARC Live Tutorial Videos
    • Live Jobs and Session-Level Functions
    • Performance Metrics
    • Managing a CryoSPARC Live Session from the CLI
    • FAQs and Troubleshooting
  • Guides for v3
    • v3 User Interface Guide
      • Dashboard
      • Project and Workspace Management
      • Create and Build Jobs
      • Queue Job, Inspect Job and Other Job Actions
      • View and Download Results
      • Job Relationships
      • Resource Manager
      • User Management
    • Tutorial: Job Builder
    • Get Started with CryoSPARC: Introductory Tutorial (v3)
    • Tutorial: Manually Curate Exposures (v3)
  • Resources
    • Questions and Support
Powered by GitBook
On this page
  • Introduction
  • Training Data
  • Acquiring training data
  • Determining the quality of training data
  • Training Parameters
  • Identifying Training Issues
  1. Processing Data in cryoSPARC
  2. All Job Types in CryoSPARC
  3. Deep Picking

Guideline for Supervised Particle Picking using Deep Learning Models

Supervised particle picking.

Introduction

Supervised deep learning models are machine learning models that are trained using pre-existing data, in this case particles, to pick particles from micrographs. The primary benefit of these models are to learn from an (ideally) small number of high-quality particle picks from a dataset to produce a model that can be used to pick particles from the remainder of the dataset.

This document will provide guidelines to help train effective particle picking models. The quality of the model is dependent on the training data and the training parameters. This document will first explore the training data then the training parameters. Finally, it will detail what to watch for when observing the training of a model to catch issues and resolve.

The general concepts and instructions on this page below are written in the context of the cryoSPARC deep neural network particle picking workflows. Other methods (eg. Topaz) also have similar fundamental characteristics and considerations, but differently named parameters etc.

Training Data

Acquiring training data

The first step to training a model is acquiring the training data. The best way to acquire training data is to work with a small subset of micrographs in the dataset of interest. An effective model for a symmetric simple protein such as the T20S Proteasome can be trained on less than 20 micrographs, but higher values in the low hundreds may be required for more difficult datasets. Once the micrograph subset has been selected, manual picking, or one of the existing pickers, can be used to acquire an initial set of picks. Manual picking will be required for the most challenging cases, but for others a recommended workflow is to use the blob picker (or the pretrained deep model) to acquire an initial set of high-quiality picks. Even if the quality of the initial picks are not perfect, this initial set can be filtered using 2D Classification followed by the Select 2D job to filter out initial picks. The 2D classification and select 2D jobs can be repeated until the picks are of sufficient quality but this should often only require one run of 2D classification and select 2D.

Determining the quality of training data

Training data can be flawed in four ways:

  1. the picks are misaligned from the particles

  2. not all particles from the micrograph are picked

  3. aggregations are picked

  4. the picks are not particles

The impact of the latter two can be minimized using the 2D classification followed by select 2D mentioned in the "Acquiring training data" section. The other two flaws are not as impactful as they may initially seem, due to the design of the data processing performed by the model. The effect of the misalignments are reduced because the model infers rough blobs where the particle approximately is and centroids are extracted from each blob. This way so long as the misalignments are not severe, in which case the 2D classification should have filtered the picks, and the misalignments are not identical amongst all training picks which is extremely unlikely, the model will eventually learn to effectively ignore misalignments, at least to a point that it is negligible for future processing. The issue of not all particles being picked is reduced by how the micrographs are fed into the model. The micrographs are split into patches and individually fed into the model. This is possible due to the shape of particle picks not requiring context provided by the whole micrograph. A micrograph with initial picks lacking a fifth of the available picks may result with each patch only missing a few particles each, reducing the impact of these missed particles on the training.

Training Parameters

This section will go over each notable training parameter and how to determine which value to select.

  • Number of parallel threads: This parameter will distribute micrograph preprocessing over multiple threads to reduce the preprocessing time. Higher values may lead to overhead. Thus values between 4-8 are generally safe. If the preprocessing is observed to still be too slow, higher values can be run.

  • Degree of lowpass filtering: If micrographs are observed to be too noisy, it is likely that the model may struggle with learning particle locations. Decreasing this parameter can reduce the noise in the micrographs thereby improving training. Values that are too low can begin filtering valuable information from the micrographs. 50 is a standard value and values down to 15 are recommended for noisier micrographs.

  • Initial learning rate and final learning rate: These two parameters are used to determine the learning decay used in the training. The initial learning rate is expected to be higher than the final learning rate. Values of 0.01 to 0.001 have been found to work best for the initial learning rate. However, it takes a long time to reach training and validation losses of 1 and below, it is recommended to increase the initial learning rate. The final learning rate can be altered to improve the final loss and accuracy. To determine whether an acceptable final learning rate was selected, observe the losses from the final few epochs. If the training loss is significantly lower than the validation loss, this indicates that the final learning rate should be decreased. Indicators that the learning rate is too high include losses and accuracies that do not change and losses that remain large over the initial epochs.

  • Minibatch size: The minibatch size is used to determine the size of the batches that are fed into the model during training. A minibatch size of 1 has been found to perform for particle picking. Increasing the minibatch size results with faster training at the cost of higher GPU memory cost. It has also been found that the stochasticity introduced by the increased minibatch size can result with worse particle picks.

  • Number of epochs: The number of epochs is the number of passthroughs through the dataset that is performed during training. It is recommended to first run a training job with a lower number of epochs such as between 20 to 50 and then increase the number of epochs if the loss was continuing to decrease at the end of the training job. If the loss stagnated at the end of training, use a value slightly greater than the epoch at which the loss began to stagnate. For example, if the loss began to stagnate at epoch 10, the next training job can have 15 epochs. The slight increase is to account for the fact that the training will perform less epochs in lower learning rates.

  • Use class weights: The use class weights parameter alters the loss function to increase the impact of correctly picking particles. For datasets with many particles per micrograph such as the T20S proteasome dataset, it is unnecessary to set this parameter. However, if the particles are sparse in the dataset, setting this parameter on can greatly improve performance. As a rule of thumb, it is suggested to keep this parameter on.

  • Normalize micrographs: The normalize micrographs will normalize micrographs to 0-mean, unit variance prior to training. For datasets with little junk, this parameter can possibly worsen training as it makes it more difficult to differentiate particles. However, it reduces the impact of junk which can improve performance in datasets with prevalent junk.

Identifying Training Issues

Debugging training issues can be done by observing the loss and accuracy values during and/or after training. There are two major types of training issues: underfitting and overfitting.

Underfitting

Underfitting occurs when the model fails to learn at all. This can be diagnosed by loss values that remain within the 1000s and above, or nan values. Underfitting can also be diagnosed by accuracies that remain at 60% and below.

Underfitting can be resolved by increasing the initial and final learning rates and/or increasing the number of epochs. It is recommended to increase the initial and final learning rates to resolve the underfitting and then alter the number of epochs to optimize training after.

Overfitting

Overfitting occurs when the model begins to memorize noise in training data as if they were general patterns. Overfitting can be diagnosed by searching for a divergence in the training and validation losses and accuracies. The training loss being lower than the validation loss or the training accuracy being higher than the validation accuracy are not indicators of overfitting. These behaviours are common and still result with successfully trained models. The kind of divergence that indicates overfitting is when the training loss continues to decrease while the validation loss starts increasing and when the training accuracy continues increasing while the validation accuracy begins decreasing. A model that overfit will perform well on the training dataset but will perform poorly on any data outside of the training dataset.

Overfitting can be resolved by increasing the learning rates. Since models are unlikely to begin overfitting near the beginning of training, it is recommended to increase the final learning rate.

Other Issues

One issue that can occur is if the model does not output any particles. This can occur when the model falls into a local optimization of never find particles. There are two approaches to resolving this issue. The first is to use class weights during training. This can be done so by setting the "Use class weights" parameter on. This parameter will reward the model for correctly picking particles, encouraging it to find the global optimum. The second approach is to decrease the initial learning rate. A lower initial learning rate will help prevent the training from falling into a local optimization.

PreviousDeep PickingNextDeep Network Particle Picker

Last updated 2 years ago